[P1] Create incident runbooks

## Problem
No documented procedures for handling incidents. Team relies on tribal knowledge during outages.

## Acceptance Criteria
- [ ] Runbook exists for each high-severity incident type
- [ ] Runbooks include: symptoms, diagnosis steps, resolution, escalation
- [ ] Runbooks tested with tabletop exercise
- [ ] Runbooks accessible within 30 seconds during incident
- [ ] Post-incident review template exists

## Runbook Structure
Each runbook follows this template:
```markdown
# [Service] - [Incident Type]

## Severity
P0/P1/P2

## Symptoms
- What alerts fire
- What users experience
- What metrics show

## Diagnosis
1. Step-by-step commands to identify root cause
2. What to check first, second, third

## Resolution
### Quick Fix (< 5 min)
- Immediate mitigation

### Full Resolution
- Proper fix steps

## Escalation
- Who to contact
- When to escalate

## Post-Incident
- What to document
- Follow-up actions
```

## Required Runbooks

| Runbook | Priority | Service | Trigger |
|---------|----------|---------|---------|
| `api-down.md` | P0 | service-cloud-api | Health check fails |
| `auth-down.md` | P0 | service-auth | Auth health fails |
| `database-connection.md` | P0 | PostgreSQL | DB unreachable |
| `proxy-down.md` | P0 | infrastructure-proxy | 502/503 errors |
| `secrets-down.md` | P0 | service-secrets | Infisical unreachable |
| `high-latency.md` | P1 | Any | P95 > 500ms |
| `disk-full.md` | P1 | Any | Disk > 90% |
| `certificate-expiry.md` | P1 | Proxy | Cert expires < 7 days |
| `deployment-failed.md` | P2 | Any | Akash deploy fails |

## Example Runbook: api-down.md

```markdown
# service-cloud-api - Service Down

## Severity
P0 - Critical

## Symptoms
- Alert: "api.alternatefutures.ai DOWN"
- Users see: 502 Bad Gateway or connection timeout
- Metrics: 0 requests/sec to api upstream

## Diagnosis
### 1. Check if it's actually down
curl -v https://api.alternatefutures.ai/health

### 2. Check Akash deployment status
akash query deployment get --dseq 24363709 --owner $OWNER

### 3. Check provider status
akash query provider status $PROVIDER

### 4. Check container logs
akash provider lease-logs --dseq 24363709 --provider $PROVIDER

## Resolution
### Quick Fix: Restart Container
# Send manifest to restart (same DSEQ)
akash provider send-manifest deploy.yaml --dseq 24363709 --provider $PROVIDER

### If Provider Issue: Migrate to New Provider
1. Close current deployment
2. Create new deployment (new DSEQ)
3. Accept bid from healthy provider
4. Update DNS to new ingress

### If Code Issue: Rollback
1. Identify last working commit
2. Build image with previous tag
3. Send manifest with old image

## Escalation
- After 15 min: @angela
- After 30 min: Page @hayk
- Database issues: Check service-cloud-api PostgreSQL runbook

## Post-Incident
- [ ] Create GitHub issue with timeline
- [ ] Document root cause
- [ ] Identify prevention measures
- [ ] Update runbook if needed
```

## Implementation

### 1. Create runbooks directory
```
.github/
  runbooks/
    api-down.md
    auth-down.md
    database-connection.md
    proxy-down.md
    secrets-down.md
    high-latency.md
    disk-full.md
    certificate-expiry.md
    deployment-failed.md
    TEMPLATE.md
```

### 2. Create post-incident template
**File:** `.github/ISSUE_TEMPLATE/post-incident.md`
```yaml
name: Post-Incident Review
about: Document incident timeline and learnings
title: '[PIR] YYYY-MM-DD - Brief description'
labels: post-incident
---
## Incident Summary
**Duration:** 
**Severity:** 
**Services Affected:**

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired |
| HH:MM | Investigation started |

## Root Cause

## Resolution

## Action Items
- [ ] 

## Lessons Learned
```

## Testing
- [ ] Tabletop exercise: "API is returning 502"
- [ ] Time how long to find runbook
- [ ] Verify commands work

## Definition of Done
- [ ] All 9 runbooks created
- [ ] Post-incident template added
- [ ] Team walkthrough completed
- [ ] Runbooks linked from alert messages
- [ ] First tabletop exercise done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P1] Create incident runbooks #4

Problem

Acceptance Criteria

Runbook Structure

Required Runbooks

Example Runbook: api-down.md

Implementation

1. Create runbooks directory

2. Create post-incident template

Testing

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runbook	Priority	Service	Trigger
`api-down.md`	P0	service-cloud-api	Health check fails
`auth-down.md`	P0	service-auth	Auth health fails
`database-connection.md`	P0	PostgreSQL	DB unreachable
`proxy-down.md`	P0	infrastructure-proxy	502/503 errors
`secrets-down.md`	P0	service-secrets	Infisical unreachable
`high-latency.md`	P1	Any	P95 > 500ms
`disk-full.md`	P1	Any	Disk > 90%
`certificate-expiry.md`	P1	Proxy	Cert expires < 7 days
`deployment-failed.md`	P2	Any	Akash deploy fails

[P1] Create incident runbooks #4

Description

Problem

Acceptance Criteria

Runbook Structure

Required Runbooks

Example Runbook: api-down.md

Implementation

1. Create runbooks directory

2. Create post-incident template

Testing

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions