Enterprise-grade telemetry pipeline with AIOps, SRE, MLOps, and self-healing capabilities — deployed on AWS via Terraform & Kubernetes.
┌─────────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ┌────────────┐ ┌──────────┐ ┌─────────┐ ┌──────────────┐ │
│ │ REST API │ │ WebSocket│ │ Health │ │ /metrics │ │
│ │ (FastAPI) │ │ Streaming│ │ Probes │ │ (Prometheus) │ │
│ └─────┬──────┘ └────┬─────┘ └────┬────┘ └──────┬───────┘ │
├────────┼──────────────┼───────────┼──────────────┼──────────┤
│ Application Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Ingest │ │ Process │ │ Evaluate │ │ Detect │ │
│ │ Telemetry│ │ Telemetry│ │ SLO │ │ Anomaly │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Trigger │ │ Retrain │ │
│ │ Remediate│ │ Model │ │
│ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Domain Layer │
│ Entities: TelemetryEvent, MetricPoint, Alert, SLO │
│ Value Objects: EventPriority, SeverityLevel, CorrelationId │
│ Interfaces: Repository, AnomalyDetector, RemediationCmd │
│ Services: SLOCalculator, ErrorBudgetTracker │
├─────────────────────────────────────────────────────────────┤
│ Infrastructure Layer │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐ │
│ │ Data Structures│ │ Resilience │ │ AIOps │ │
│ │ PriorityQueue │ │ CircuitBreaker│ │ ZScoreDetector │ │
│ │ RingBuffer │ │ RetryHandler │ │ MovingAvg │ │
│ │ SlidingWindow │ │ RateLimiter │ │ VarianceSpike │ │
│ │ TokenBucket │ │ │ │ DriftDetector │ │
│ │ DeadLetterQ │ │ │ │ LogClusterer │ │
│ └───────────────┘ └───────────────┘ └───────────────────┘ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐ │
│ │ Telemetry │ │ Persistence │ │ Healing │ │
│ │ StreamFactory │ │ InMemoryRepo │ │ RestartService │ │
│ │ Simulator │ │ │ │ ScaleReplicas │ │
│ │ ProcEngine │ │ Messaging │ │ OpenCircuitBrkr │ │
│ │ │ │ InMemoryBus │ │ ShedTraffic │ │
│ │ Observability│ │ RedisBus │ │ TriggerRetrain │ │
│ │ PromCollector │ │ │ │ │ │
│ │ StructLogger │ │ MLOps │ │ │ │
│ │ MetricEventBus│ │ ModelRegistry │ │ │ │
│ │ │ │ DriftMonitor │ │ │ │
│ └───────────────┘ └───────────────┘ └───────────────────┘ │
└─────────────────────────────────────────────────────────────┘
| Pattern | Implementation |
|---|---|
| Singleton | ConfigurationManager — thread-safe, double-checked locking |
| Factory | TelemetryStreamFactory — creates typed stream configs |
| Strategy | AnomalyDetector ABC → ZScore, MovingAvg, VarianceSpike |
| Observer | MetricPublisher/MetricSubscriber → PrometheusMetricsCollector |
| Command | RemediationCommand → 5 self-healing actions with undo |
| Repository | TelemetryRepository ABC → InMemoryTelemetryRepository |
| Adapter | MessageBus ABC → InMemoryMessageBus, RedisMessageBus |
| State | CircuitBreaker — CLOSED/OPEN/HALF_OPEN transitions |
| DI Container | Custom IoC container with singleton/transient lifetimes |
- Language: Python 3.11+ with full type hints
- API: FastAPI + Uvicorn (ASGI)
- Observability: Prometheus + Grafana
- Messaging: Redis Pub/Sub (with in-memory fallback)
- Containerization: Docker (multi-stage) + Docker Compose
- Orchestration: Kubernetes (Deployment, HPA, Ingress)
- IaC: Terraform (AWS VPC, EKS, IAM)
- CI/CD: GitHub Actions
- Load Testing: k6
cloud-native-reliability-platform/
├── src/
│ ├── config/ # Settings, DI container
│ ├── domain/ # Entities, value objects, interfaces, services
│ │ ├── entities/ # TelemetryEvent, MetricPoint, Alert, SLO
│ │ ├── value_objects/ # EventPriority, SeverityLevel, CorrelationId
│ │ ├── interfaces/ # ABCs: Repository, Detector, Command, Registry
│ │ └── services/ # SLOCalculator, ErrorBudgetTracker
│ ├── application/ # Use cases
│ │ └── use_cases/ # Ingest, Process, Evaluate, Detect, Remediate, Retrain
│ ├── infrastructure/ # Implementations
│ │ ├── data_structures/ # PriorityQueue, RingBuffer, SlidingWindow, TokenBucket, DLQ
│ │ ├── telemetry/ # Factory, Simulator, ProcessingEngine
│ │ ├── resilience/ # CircuitBreaker, RetryHandler, RateLimiter
│ │ ├── aiops/ # Anomaly detectors, drift, clustering
│ │ ├── healing/ # Self-healing commands
│ │ ├── observability/ # Prometheus metrics, structured logging
│ │ ├── persistence/ # InMemoryRepository
│ │ ├── messaging/ # Message bus (InMemory, Redis)
│ │ └── mlops/ # Model registry, drift monitor, auto-retrain
│ └── presentation/ # FastAPI app, routers
│ └── api/ # REST, WebSocket, health, SLO endpoints
├── tests/ # Unit tests
├── infrastructure/
│ ├── k8s/ # Kubernetes manifests
│ └── terraform/ # AWS VPC, EKS
├── monitoring/ # Prometheus, Grafana configs
├── .github/workflows/ # CI/CD pipeline
├── Dockerfile # Multi-stage production build
├── docker-compose.yml # Local dev stack
├── k6-load-test.js # Load testing script
├── pyproject.toml # Project metadata
└── requirements.txt # Pinned dependencies
# 1. Clone & install
git clone https://github.com/your-org/cloud-native-reliability-platform.git
cd cloud-native-reliability-platform
pip install -e ".[dev]"
# 2. Run the platform
uvicorn src.presentation.api.app:create_app --factory --reload
# 3. Access endpoints
# API docs: http://localhost:8000/docs
# Health: http://localhost:8000/health/live
# Metrics: http://localhost:8000/metrics
# SLO: http://localhost:8000/slo/statusdocker-compose up --build
# Services:
# App: http://localhost:8000
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/platform2024)pytest tests/ -v --tb=short --cov=srck6 run k6-load-test.js| Method | Path | Description |
|---|---|---|
GET |
/health/live |
Liveness probe |
GET |
/health/ready |
Readiness probe |
POST |
/telemetry/ |
Ingest telemetry event |
GET |
/telemetry/{id} |
Get event by ID |
GET |
/telemetry/source/{source} |
Events by source |
GET |
/telemetry/stats/engine |
Processing engine stats |
GET |
/slo/status |
SLO compliance status |
GET |
/slo/error-budget |
Error budget details |
WS |
/ws/telemetry |
Real-time telemetry stream |
GET |
/metrics |
Prometheus metrics |
- Min-Heap Priority Queue — O(log n) push/pop for event ordering by priority
- Ring Buffer — O(1) fixed-capacity circular buffer with eviction tracking
- Sliding Window — Time-based window with automatic eviction and statistical aggregation
- Token Bucket — Thread-safe rate limiting with configurable refill
- Dead Letter Queue — Failed event tracking with retry counting and permanent failure separation
- Z-Score Detection — Statistical anomaly detection (μ ± kσ)
- EMA Detection — Exponential Moving Average deviation detector
- PSI Drift Detection — Population Stability Index for distribution shift monitoring
- Jaccard Clustering — Token-level log line similarity grouping
- Circuit Breaker State Machine — CLOSED → OPEN → HALF_OPEN transitions
- Exponential Backoff with Jitter — Retry strategy to avoid thundering herd
| Variable | Default | Description |
|---|---|---|
PLATFORM_APP_NAME |
CloudNativeReliabilityPlatform |
Application name |
PLATFORM_LOG_LEVEL |
INFO |
Log level |
PLATFORM_DEBUG |
false |
Debug mode |
PLATFORM_REDIS_URL |
redis://localhost:6379/0 |
Redis connection |
PLATFORM_TELEMETRY_STREAM_COUNT |
1000 |
Number of sim streams |
PLATFORM_TELEMETRY_FREQUENCY_MS |
100 |
Emission frequency |
PLATFORM_SLO_AVAILABILITY_TARGET |
0.999 |
SLO target (99.9%) |
PLATFORM_CHAOS_ENABLED |
false |
Enable chaos mode |
MIT