Skip to content

omercangumus/cloud-native-reliability-platform

Repository files navigation

Cloud-Native Intelligent Reliability Platform

Enterprise-grade telemetry pipeline with AIOps, SRE, MLOps, and self-healing capabilities — deployed on AWS via Terraform & Kubernetes.

Python 3.11+ FastAPI Terraform Kubernetes License: MIT


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Presentation Layer                        │
│  ┌────────────┐ ┌──────────┐ ┌─────────┐ ┌──────────────┐  │
│  │  REST API   │ │ WebSocket│ │ Health  │ │  /metrics     │  │
│  │  (FastAPI)  │ │ Streaming│ │ Probes  │ │ (Prometheus)  │  │
│  └─────┬──────┘ └────┬─────┘ └────┬────┘ └──────┬───────┘  │
├────────┼──────────────┼───────────┼──────────────┼──────────┤
│                   Application Layer                         │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │ Ingest   │ │ Process  │ │ Evaluate │ │ Detect   │       │
│  │ Telemetry│ │ Telemetry│ │   SLO    │ │ Anomaly  │       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
│  ┌──────────┐ ┌──────────┐                                  │
│  │ Trigger  │ │ Retrain  │                                  │
│  │ Remediate│ │  Model   │                                  │
│  └──────────┘ └──────────┘                                  │
├─────────────────────────────────────────────────────────────┤
│                     Domain Layer                            │
│  Entities: TelemetryEvent, MetricPoint, Alert, SLO          │
│  Value Objects: EventPriority, SeverityLevel, CorrelationId │
│  Interfaces: Repository, AnomalyDetector, RemediationCmd    │
│  Services: SLOCalculator, ErrorBudgetTracker                │
├─────────────────────────────────────────────────────────────┤
│                  Infrastructure Layer                       │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐  │
│  │ Data Structures│ │   Resilience  │ │     AIOps         │  │
│  │ PriorityQueue │ │ CircuitBreaker│ │  ZScoreDetector   │  │
│  │ RingBuffer    │ │ RetryHandler  │ │  MovingAvg        │  │
│  │ SlidingWindow │ │ RateLimiter   │ │  VarianceSpike    │  │
│  │ TokenBucket   │ │               │ │  DriftDetector    │  │
│  │ DeadLetterQ   │ │               │ │  LogClusterer     │  │
│  └───────────────┘ └───────────────┘ └───────────────────┘  │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐  │
│  │  Telemetry    │ │  Persistence  │ │     Healing       │  │
│  │ StreamFactory │ │ InMemoryRepo  │ │  RestartService   │  │
│  │ Simulator     │ │               │ │  ScaleReplicas    │  │
│  │ ProcEngine    │ │  Messaging    │ │  OpenCircuitBrkr  │  │
│  │               │ │ InMemoryBus   │ │  ShedTraffic      │  │
│  │  Observability│ │ RedisBus      │ │  TriggerRetrain   │  │
│  │ PromCollector │ │               │ │                   │  │
│  │ StructLogger  │ │    MLOps      │ │                   │  │
│  │ MetricEventBus│ │ ModelRegistry │ │                   │  │
│  │               │ │ DriftMonitor  │ │                   │  │
│  └───────────────┘ └───────────────┘ └───────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Design Patterns & Principles

Pattern Implementation
Singleton ConfigurationManager — thread-safe, double-checked locking
Factory TelemetryStreamFactory — creates typed stream configs
Strategy AnomalyDetector ABC → ZScore, MovingAvg, VarianceSpike
Observer MetricPublisher/MetricSubscriberPrometheusMetricsCollector
Command RemediationCommand → 5 self-healing actions with undo
Repository TelemetryRepository ABC → InMemoryTelemetryRepository
Adapter MessageBus ABC → InMemoryMessageBus, RedisMessageBus
State CircuitBreaker — CLOSED/OPEN/HALF_OPEN transitions
DI Container Custom IoC container with singleton/transient lifetimes

Tech Stack

  • Language: Python 3.11+ with full type hints
  • API: FastAPI + Uvicorn (ASGI)
  • Observability: Prometheus + Grafana
  • Messaging: Redis Pub/Sub (with in-memory fallback)
  • Containerization: Docker (multi-stage) + Docker Compose
  • Orchestration: Kubernetes (Deployment, HPA, Ingress)
  • IaC: Terraform (AWS VPC, EKS, IAM)
  • CI/CD: GitHub Actions
  • Load Testing: k6

Project Structure

cloud-native-reliability-platform/
├── src/
│   ├── config/              # Settings, DI container
│   ├── domain/              # Entities, value objects, interfaces, services
│   │   ├── entities/        # TelemetryEvent, MetricPoint, Alert, SLO
│   │   ├── value_objects/   # EventPriority, SeverityLevel, CorrelationId
│   │   ├── interfaces/      # ABCs: Repository, Detector, Command, Registry
│   │   └── services/        # SLOCalculator, ErrorBudgetTracker
│   ├── application/         # Use cases
│   │   └── use_cases/       # Ingest, Process, Evaluate, Detect, Remediate, Retrain
│   ├── infrastructure/      # Implementations
│   │   ├── data_structures/ # PriorityQueue, RingBuffer, SlidingWindow, TokenBucket, DLQ
│   │   ├── telemetry/       # Factory, Simulator, ProcessingEngine
│   │   ├── resilience/      # CircuitBreaker, RetryHandler, RateLimiter
│   │   ├── aiops/           # Anomaly detectors, drift, clustering
│   │   ├── healing/         # Self-healing commands
│   │   ├── observability/   # Prometheus metrics, structured logging
│   │   ├── persistence/     # InMemoryRepository
│   │   ├── messaging/       # Message bus (InMemory, Redis)
│   │   └── mlops/           # Model registry, drift monitor, auto-retrain
│   └── presentation/        # FastAPI app, routers
│       └── api/             # REST, WebSocket, health, SLO endpoints
├── tests/                   # Unit tests
├── infrastructure/
│   ├── k8s/                 # Kubernetes manifests
│   └── terraform/           # AWS VPC, EKS
├── monitoring/              # Prometheus, Grafana configs
├── .github/workflows/       # CI/CD pipeline
├── Dockerfile               # Multi-stage production build
├── docker-compose.yml       # Local dev stack
├── k6-load-test.js          # Load testing script
├── pyproject.toml            # Project metadata
└── requirements.txt          # Pinned dependencies

Quick Start

Local Development

# 1. Clone & install
git clone https://github.com/your-org/cloud-native-reliability-platform.git
cd cloud-native-reliability-platform
pip install -e ".[dev]"

# 2. Run the platform
uvicorn src.presentation.api.app:create_app --factory --reload

# 3. Access endpoints
#    API docs: http://localhost:8000/docs
#    Health:   http://localhost:8000/health/live
#    Metrics:  http://localhost:8000/metrics
#    SLO:      http://localhost:8000/slo/status

Docker Compose (Full Stack)

docker-compose up --build

# Services:
#   App:        http://localhost:8000
#   Prometheus: http://localhost:9090
#   Grafana:    http://localhost:3000 (admin/platform2024)

Run Tests

pytest tests/ -v --tb=short --cov=src

Load Testing

k6 run k6-load-test.js

API Endpoints

Method Path Description
GET /health/live Liveness probe
GET /health/ready Readiness probe
POST /telemetry/ Ingest telemetry event
GET /telemetry/{id} Get event by ID
GET /telemetry/source/{source} Events by source
GET /telemetry/stats/engine Processing engine stats
GET /slo/status SLO compliance status
GET /slo/error-budget Error budget details
WS /ws/telemetry Real-time telemetry stream
GET /metrics Prometheus metrics

Key Algorithms & Data Structures

  • Min-Heap Priority Queue — O(log n) push/pop for event ordering by priority
  • Ring Buffer — O(1) fixed-capacity circular buffer with eviction tracking
  • Sliding Window — Time-based window with automatic eviction and statistical aggregation
  • Token Bucket — Thread-safe rate limiting with configurable refill
  • Dead Letter Queue — Failed event tracking with retry counting and permanent failure separation
  • Z-Score Detection — Statistical anomaly detection (μ ± kσ)
  • EMA Detection — Exponential Moving Average deviation detector
  • PSI Drift Detection — Population Stability Index for distribution shift monitoring
  • Jaccard Clustering — Token-level log line similarity grouping
  • Circuit Breaker State Machine — CLOSED → OPEN → HALF_OPEN transitions
  • Exponential Backoff with Jitter — Retry strategy to avoid thundering herd

Environment Variables

Variable Default Description
PLATFORM_APP_NAME CloudNativeReliabilityPlatform Application name
PLATFORM_LOG_LEVEL INFO Log level
PLATFORM_DEBUG false Debug mode
PLATFORM_REDIS_URL redis://localhost:6379/0 Redis connection
PLATFORM_TELEMETRY_STREAM_COUNT 1000 Number of sim streams
PLATFORM_TELEMETRY_FREQUENCY_MS 100 Emission frequency
PLATFORM_SLO_AVAILABILITY_TARGET 0.999 SLO target (99.9%)
PLATFORM_CHAOS_ENABLED false Enable chaos mode

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors