-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
Problem
No documented procedures for handling incidents. Team relies on tribal knowledge during outages.
Acceptance Criteria
- Runbook exists for each high-severity incident type
- Runbooks include: symptoms, diagnosis steps, resolution, escalation
- Runbooks tested with tabletop exercise
- Runbooks accessible within 30 seconds during incident
- Post-incident review template exists
Runbook Structure
Each runbook follows this template:
# [Service] - [Incident Type]
## Severity
P0/P1/P2
## Symptoms
- What alerts fire
- What users experience
- What metrics show
## Diagnosis
1. Step-by-step commands to identify root cause
2. What to check first, second, third
## Resolution
### Quick Fix (< 5 min)
- Immediate mitigation
### Full Resolution
- Proper fix steps
## Escalation
- Who to contact
- When to escalate
## Post-Incident
- What to document
- Follow-up actionsRequired Runbooks
| Runbook | Priority | Service | Trigger |
|---|---|---|---|
api-down.md |
P0 | service-cloud-api | Health check fails |
auth-down.md |
P0 | service-auth | Auth health fails |
database-connection.md |
P0 | PostgreSQL | DB unreachable |
proxy-down.md |
P0 | infrastructure-proxy | 502/503 errors |
secrets-down.md |
P0 | service-secrets | Infisical unreachable |
high-latency.md |
P1 | Any | P95 > 500ms |
disk-full.md |
P1 | Any | Disk > 90% |
certificate-expiry.md |
P1 | Proxy | Cert expires < 7 days |
deployment-failed.md |
P2 | Any | Akash deploy fails |
Example Runbook: api-down.md
# service-cloud-api - Service Down
## Severity
P0 - Critical
## Symptoms
- Alert: "api.alternatefutures.ai DOWN"
- Users see: 502 Bad Gateway or connection timeout
- Metrics: 0 requests/sec to api upstream
## Diagnosis
### 1. Check if it's actually down
curl -v https://api.alternatefutures.ai/health
### 2. Check Akash deployment status
akash query deployment get --dseq 24363709 --owner $OWNER
### 3. Check provider status
akash query provider status $PROVIDER
### 4. Check container logs
akash provider lease-logs --dseq 24363709 --provider $PROVIDER
## Resolution
### Quick Fix: Restart Container
# Send manifest to restart (same DSEQ)
akash provider send-manifest deploy.yaml --dseq 24363709 --provider $PROVIDER
### If Provider Issue: Migrate to New Provider
1. Close current deployment
2. Create new deployment (new DSEQ)
3. Accept bid from healthy provider
4. Update DNS to new ingress
### If Code Issue: Rollback
1. Identify last working commit
2. Build image with previous tag
3. Send manifest with old image
## Escalation
- After 15 min: @angela
- After 30 min: Page @hayk
- Database issues: Check service-cloud-api PostgreSQL runbook
## Post-Incident
- [ ] Create GitHub issue with timeline
- [ ] Document root cause
- [ ] Identify prevention measures
- [ ] Update runbook if neededImplementation
1. Create runbooks directory
.github/
runbooks/
api-down.md
auth-down.md
database-connection.md
proxy-down.md
secrets-down.md
high-latency.md
disk-full.md
certificate-expiry.md
deployment-failed.md
TEMPLATE.md
2. Create post-incident template
File: .github/ISSUE_TEMPLATE/post-incident.md
name: Post-Incident Review
about: Document incident timeline and learnings
title: '[PIR] YYYY-MM-DD - Brief description'
labels: post-incident
---
## Incident Summary
**Duration:**
**Severity:**
**Services Affected:**
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired |
| HH:MM | Investigation started |
## Root Cause
## Resolution
## Action Items
- [ ]
## Lessons LearnedTesting
- Tabletop exercise: "API is returning 502"
- Time how long to find runbook
- Verify commands work
Definition of Done
- All 9 runbooks created
- Post-incident template added
- Team walkthrough completed
- Runbooks linked from alert messages
- First tabletop exercise done
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation