[P1] Document SLOs/SLIs for each service

## Problem
No defined SLOs (Service Level Objectives). Can't measure if services are "healthy enough" or need improvement.

## Acceptance Criteria
- [ ] SLOs defined for all 4 core services
- [ ] SLIs (indicators) measurable with current/planned tooling
- [ ] Error budgets calculated monthly
- [ ] SLO dashboard accessible to team
- [ ] Alert when error budget < 20% remaining

## Proposed SLOs

### service-cloud-api (GraphQL)
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% (43 min/month downtime) | Health check success rate |
| Latency P95 | < 200ms | Proxy histogram |
| Latency P99 | < 500ms | Proxy histogram |
| Error Rate | < 0.1% | 5xx responses / total |
| Throughput | Handle 100 RPS | Load test baseline |

### service-auth
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.95% (22 min/month) | Health check |
| Login Success | > 99% | Success / attempts |
| Token Issuance | < 50ms P95 | Response time |
| Error Rate | < 0.05% | Failed auth / total |

### infrastructure-proxy (Pingap)
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.99% (4 min/month) | TCP health check |
| Latency P99 | < 50ms added | Proxy metrics |
| TLS Handshake | < 100ms | Proxy metrics |
| Error Rate | < 0.01% | 5xx / total |

### service-secrets (Infisical)
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Health check |
| Secret Fetch | < 100ms P95 | API response |
| Encryption | 100% at rest | Audit |

## Error Budget Calculation
```
Monthly Error Budget = (1 - SLO) × 43,200 minutes

Example: 99.9% availability
Budget = 0.1% × 43,200 = 43.2 minutes of downtime allowed
```

## Implementation

### 1. Create SLO documentation
**File:** `SLO.md`
```markdown
# AlternateFutures Service Level Objectives

## service-cloud-api
- Availability: 99.9% (measured by synthetic health checks)
- Latency P95: 200ms (measured by proxy histogram)
...
```

### 2. Create SLO tracking workflow
**File:** `.github/workflows/slo-report.yml`
```yaml
name: Weekly SLO Report

on:
  schedule:
    - cron: '0 9 * * 1'  # Monday 9am UTC

jobs:
  report:
    runs-on: ubuntu-latest
    steps:
      - name: Calculate SLIs
        run: |
          # Query Prometheus/metrics for last 7 days
          # Calculate availability, latency percentiles
          
      - name: Generate report
        run: |
          # Compare to SLOs, calculate error budget remaining
          
      - name: Post to Discord
        run: |
          # Send weekly SLO summary
```

### 3. SLO Dashboard (Future)
Once Grafana deployed:
- Create SLO dashboard panel
- Show current vs target
- Error budget burn rate

## Required Before Implementation
- [ ] Prometheus metrics enabled (#1 in infrastructure-proxy)
- [ ] Health checks on all services (#108 in service-cloud-api)
- [ ] Centralized alerting (#2 in this repo)

## Testing
```bash
# Verify SLIs are measurable
curl https://api.alternatefutures.ai/health  # Availability
curl http://proxy:3018/metrics | grep latency  # Latency
```

## Definition of Done
- [ ] SLO.md created with all service objectives
- [ ] SLIs are measurable (not aspirational)
- [ ] Team reviewed and agreed on targets
- [ ] Error budget tracking implemented
- [ ] Weekly SLO report workflow running

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P1] Document SLOs/SLIs for each service #3

Problem

Acceptance Criteria

Proposed SLOs

service-cloud-api (GraphQL)

service-auth

infrastructure-proxy (Pingap)

service-secrets (Infisical)

Error Budget Calculation

Implementation

1. Create SLO documentation

2. Create SLO tracking workflow

3. SLO Dashboard (Future)

Required Before Implementation

Testing

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SLI	Target	Measurement
Availability	99.9% (43 min/month downtime)	Health check success rate
Latency P95	< 200ms	Proxy histogram
Latency P99	< 500ms	Proxy histogram
Error Rate	< 0.1%	5xx responses / total
Throughput	Handle 100 RPS	Load test baseline

SLI	Target	Measurement
Availability	99.95% (22 min/month)	Health check
Login Success	> 99%	Success / attempts
Token Issuance	< 50ms P95	Response time
Error Rate	< 0.05%	Failed auth / total

SLI	Target	Measurement
Availability	99.99% (4 min/month)	TCP health check
Latency P99	< 50ms added	Proxy metrics
TLS Handshake	< 100ms	Proxy metrics
Error Rate	< 0.01%	5xx / total

SLI	Target	Measurement
Availability	99.9%	Health check
Secret Fetch	< 100ms P95	API response
Encryption	100% at rest	Audit

[P1] Document SLOs/SLIs for each service #3

Description

Problem

Acceptance Criteria

Proposed SLOs

service-cloud-api (GraphQL)

service-auth

infrastructure-proxy (Pingap)

service-secrets (Infisical)

Error Budget Calculation

Implementation

1. Create SLO documentation

2. Create SLO tracking workflow

3. SLO Dashboard (Future)

Required Before Implementation

Testing

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions