Skip to content

[P1] Document SLOs/SLIs for each service #3

@wonderwomancode

Description

@wonderwomancode

Problem

No defined SLOs (Service Level Objectives). Can't measure if services are "healthy enough" or need improvement.

Acceptance Criteria

  • SLOs defined for all 4 core services
  • SLIs (indicators) measurable with current/planned tooling
  • Error budgets calculated monthly
  • SLO dashboard accessible to team
  • Alert when error budget < 20% remaining

Proposed SLOs

service-cloud-api (GraphQL)

SLI Target Measurement
Availability 99.9% (43 min/month downtime) Health check success rate
Latency P95 < 200ms Proxy histogram
Latency P99 < 500ms Proxy histogram
Error Rate < 0.1% 5xx responses / total
Throughput Handle 100 RPS Load test baseline

service-auth

SLI Target Measurement
Availability 99.95% (22 min/month) Health check
Login Success > 99% Success / attempts
Token Issuance < 50ms P95 Response time
Error Rate < 0.05% Failed auth / total

infrastructure-proxy (Pingap)

SLI Target Measurement
Availability 99.99% (4 min/month) TCP health check
Latency P99 < 50ms added Proxy metrics
TLS Handshake < 100ms Proxy metrics
Error Rate < 0.01% 5xx / total

service-secrets (Infisical)

SLI Target Measurement
Availability 99.9% Health check
Secret Fetch < 100ms P95 API response
Encryption 100% at rest Audit

Error Budget Calculation

Monthly Error Budget = (1 - SLO) × 43,200 minutes

Example: 99.9% availability
Budget = 0.1% × 43,200 = 43.2 minutes of downtime allowed

Implementation

1. Create SLO documentation

File: SLO.md

# AlternateFutures Service Level Objectives

## service-cloud-api
- Availability: 99.9% (measured by synthetic health checks)
- Latency P95: 200ms (measured by proxy histogram)
...

2. Create SLO tracking workflow

File: .github/workflows/slo-report.yml

name: Weekly SLO Report

on:
  schedule:
    - cron: '0 9 * * 1'  # Monday 9am UTC

jobs:
  report:
    runs-on: ubuntu-latest
    steps:
      - name: Calculate SLIs
        run: |
          # Query Prometheus/metrics for last 7 days
          # Calculate availability, latency percentiles
          
      - name: Generate report
        run: |
          # Compare to SLOs, calculate error budget remaining
          
      - name: Post to Discord
        run: |
          # Send weekly SLO summary

3. SLO Dashboard (Future)

Once Grafana deployed:

  • Create SLO dashboard panel
  • Show current vs target
  • Error budget burn rate

Required Before Implementation

Testing

# Verify SLIs are measurable
curl https://api.alternatefutures.ai/health  # Availability
curl http://proxy:3018/metrics | grep latency  # Latency

Definition of Done

  • SLO.md created with all service objectives
  • SLIs are measurable (not aspirational)
  • Team reviewed and agreed on targets
  • Error budget tracking implemented
  • Weekly SLO report workflow running

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions