Skip to content

[P1] Create incident runbooks #4

@wonderwomancode

Description

@wonderwomancode

Problem

No documented procedures for handling incidents. Team relies on tribal knowledge during outages.

Acceptance Criteria

  • Runbook exists for each high-severity incident type
  • Runbooks include: symptoms, diagnosis steps, resolution, escalation
  • Runbooks tested with tabletop exercise
  • Runbooks accessible within 30 seconds during incident
  • Post-incident review template exists

Runbook Structure

Each runbook follows this template:

# [Service] - [Incident Type]

## Severity
P0/P1/P2

## Symptoms
- What alerts fire
- What users experience
- What metrics show

## Diagnosis
1. Step-by-step commands to identify root cause
2. What to check first, second, third

## Resolution
### Quick Fix (< 5 min)
- Immediate mitigation

### Full Resolution
- Proper fix steps

## Escalation
- Who to contact
- When to escalate

## Post-Incident
- What to document
- Follow-up actions

Required Runbooks

Runbook Priority Service Trigger
api-down.md P0 service-cloud-api Health check fails
auth-down.md P0 service-auth Auth health fails
database-connection.md P0 PostgreSQL DB unreachable
proxy-down.md P0 infrastructure-proxy 502/503 errors
secrets-down.md P0 service-secrets Infisical unreachable
high-latency.md P1 Any P95 > 500ms
disk-full.md P1 Any Disk > 90%
certificate-expiry.md P1 Proxy Cert expires < 7 days
deployment-failed.md P2 Any Akash deploy fails

Example Runbook: api-down.md

# service-cloud-api - Service Down

## Severity
P0 - Critical

## Symptoms
- Alert: "api.alternatefutures.ai DOWN"
- Users see: 502 Bad Gateway or connection timeout
- Metrics: 0 requests/sec to api upstream

## Diagnosis
### 1. Check if it's actually down
curl -v https://api.alternatefutures.ai/health

### 2. Check Akash deployment status
akash query deployment get --dseq 24363709 --owner $OWNER

### 3. Check provider status
akash query provider status $PROVIDER

### 4. Check container logs
akash provider lease-logs --dseq 24363709 --provider $PROVIDER

## Resolution
### Quick Fix: Restart Container
# Send manifest to restart (same DSEQ)
akash provider send-manifest deploy.yaml --dseq 24363709 --provider $PROVIDER

### If Provider Issue: Migrate to New Provider
1. Close current deployment
2. Create new deployment (new DSEQ)
3. Accept bid from healthy provider
4. Update DNS to new ingress

### If Code Issue: Rollback
1. Identify last working commit
2. Build image with previous tag
3. Send manifest with old image

## Escalation
- After 15 min: @angela
- After 30 min: Page @hayk
- Database issues: Check service-cloud-api PostgreSQL runbook

## Post-Incident
- [ ] Create GitHub issue with timeline
- [ ] Document root cause
- [ ] Identify prevention measures
- [ ] Update runbook if needed

Implementation

1. Create runbooks directory

.github/
  runbooks/
    api-down.md
    auth-down.md
    database-connection.md
    proxy-down.md
    secrets-down.md
    high-latency.md
    disk-full.md
    certificate-expiry.md
    deployment-failed.md
    TEMPLATE.md

2. Create post-incident template

File: .github/ISSUE_TEMPLATE/post-incident.md

name: Post-Incident Review
about: Document incident timeline and learnings
title: '[PIR] YYYY-MM-DD - Brief description'
labels: post-incident
---
## Incident Summary
**Duration:** 
**Severity:** 
**Services Affected:**

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired |
| HH:MM | Investigation started |

## Root Cause

## Resolution

## Action Items
- [ ] 

## Lessons Learned

Testing

  • Tabletop exercise: "API is returning 502"
  • Time how long to find runbook
  • Verify commands work

Definition of Done

  • All 9 runbooks created
  • Post-incident template added
  • Team walkthrough completed
  • Runbooks linked from alert messages
  • First tabletop exercise done

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions