-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
Problem
No defined SLOs (Service Level Objectives). Can't measure if services are "healthy enough" or need improvement.
Acceptance Criteria
- SLOs defined for all 4 core services
- SLIs (indicators) measurable with current/planned tooling
- Error budgets calculated monthly
- SLO dashboard accessible to team
- Alert when error budget < 20% remaining
Proposed SLOs
service-cloud-api (GraphQL)
| SLI | Target | Measurement |
|---|---|---|
| Availability | 99.9% (43 min/month downtime) | Health check success rate |
| Latency P95 | < 200ms | Proxy histogram |
| Latency P99 | < 500ms | Proxy histogram |
| Error Rate | < 0.1% | 5xx responses / total |
| Throughput | Handle 100 RPS | Load test baseline |
service-auth
| SLI | Target | Measurement |
|---|---|---|
| Availability | 99.95% (22 min/month) | Health check |
| Login Success | > 99% | Success / attempts |
| Token Issuance | < 50ms P95 | Response time |
| Error Rate | < 0.05% | Failed auth / total |
infrastructure-proxy (Pingap)
| SLI | Target | Measurement |
|---|---|---|
| Availability | 99.99% (4 min/month) | TCP health check |
| Latency P99 | < 50ms added | Proxy metrics |
| TLS Handshake | < 100ms | Proxy metrics |
| Error Rate | < 0.01% | 5xx / total |
service-secrets (Infisical)
| SLI | Target | Measurement |
|---|---|---|
| Availability | 99.9% | Health check |
| Secret Fetch | < 100ms P95 | API response |
| Encryption | 100% at rest | Audit |
Error Budget Calculation
Monthly Error Budget = (1 - SLO) × 43,200 minutes
Example: 99.9% availability
Budget = 0.1% × 43,200 = 43.2 minutes of downtime allowed
Implementation
1. Create SLO documentation
File: SLO.md
# AlternateFutures Service Level Objectives
## service-cloud-api
- Availability: 99.9% (measured by synthetic health checks)
- Latency P95: 200ms (measured by proxy histogram)
...2. Create SLO tracking workflow
File: .github/workflows/slo-report.yml
name: Weekly SLO Report
on:
schedule:
- cron: '0 9 * * 1' # Monday 9am UTC
jobs:
report:
runs-on: ubuntu-latest
steps:
- name: Calculate SLIs
run: |
# Query Prometheus/metrics for last 7 days
# Calculate availability, latency percentiles
- name: Generate report
run: |
# Compare to SLOs, calculate error budget remaining
- name: Post to Discord
run: |
# Send weekly SLO summary3. SLO Dashboard (Future)
Once Grafana deployed:
- Create SLO dashboard panel
- Show current vs target
- Error budget burn rate
Required Before Implementation
- Prometheus metrics enabled (Build self-healing backup service for platform databases #1 in infrastructure-proxy)
- Health checks on all services (#108 in service-cloud-api)
- Centralized alerting ([P1] Create centralized alerting system #2 in this repo)
Testing
# Verify SLIs are measurable
curl https://api.alternatefutures.ai/health # Availability
curl http://proxy:3018/metrics | grep latency # LatencyDefinition of Done
- SLO.md created with all service objectives
- SLIs are measurable (not aspirational)
- Team reviewed and agreed on targets
- Error budget tracking implemented
- Weekly SLO report workflow running
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation