Skip to content

Latest commit

 

History

History
267 lines (213 loc) · 10.3 KB

File metadata and controls

267 lines (213 loc) · 10.3 KB

Cluster Autoscaler Implementation Summary

What Was Implemented

A complete Kubernetes Cluster Autoscaler solution for automatic node scaling (1-8 nodes) managed via GitOps.

Changes Made

1. Terraform Updates

  • File: infrastructure/terraform/prod/modules/kubernetes/variables.tf

    • Updated asg_max_size from 5 to 8 nodes
  • File: infrastructure/terraform/prod/variables.tf

    • Updated asg_max_size from 5 to 8 nodes

2. GitOps Manifests Created

Base Manifests (gitops/base/cluster-autoscaler/):

  • namespace.yaml - Creates cluster-autoscaler namespace
  • serviceaccount.yaml - Service account for the autoscaler
  • rbac.yaml - ClusterRole, Role, and bindings with required permissions
  • deployment.yaml - Autoscaler deployment with optimized configuration
  • service.yaml - Service for metrics exposure
  • podmonitor.yaml - Prometheus monitoring integration
  • kustomization.yaml - Kustomize configuration
  • README.md - Component documentation

ArgoCD Configuration:

  • gitops/apps/cluster-autoscaler.yaml - ArgoCD Application definition
  • gitops/argocd/projects/infrastructure.yaml - Updated to include cluster-autoscaler namespace

3. Documentation

  • infrastructure/docs/cluster-autoscaler-setup-guide.md - Complete setup and operations guide
  • infrastructure/docs/asg-ssm-setup-guide.md - Updated with autoscaler information
  • CLUSTER_AUTOSCALER_DEPLOYMENT.md - Step-by-step deployment checklist

Key Features

Intelligent Scaling

  • Scale Up: Automatic when pods can't be scheduled (2-3 minutes)
  • Scale Down: Automatic when nodes idle for 10+ minutes below 50% utilization
  • Anti-thrashing: 10-minute delay after scale-up prevents rapid cycling
  • Graceful draining: Respects PodDisruptionBudgets and pod constraints

Configuration

Min nodes:                    1
Max nodes:                    8
Scale-down threshold:         50% utilization
Scale-down unneeded time:     10 minutes
Scale-down delay after add:   10 minutes
Max node provision time:      15 minutes
Expander strategy:            least-waste

Kubernetes-Aware

  • ✅ Monitors pending pods
  • ✅ Respects node taints and labels
  • ✅ Honors pod affinity/anti-affinity
  • ✅ Respects PodDisruptionBudgets
  • ✅ Gracefully drains nodes
  • ✅ Auto-discovers ASGs via tags

Monitoring

  • ✅ Prometheus metrics on port 8085
  • ✅ PodMonitor for automatic scraping
  • ✅ Health checks (liveness/readiness)
  • ✅ Detailed logging (v=4)

Security

  • ✅ Runs on control plane node
  • ✅ Non-root user (65534)
  • ✅ Read-only root filesystem
  • ✅ No privilege escalation
  • ✅ Dropped capabilities
  • ✅ RBAC with minimal required permissions

GitOps Integration

  • ✅ Fully declarative configuration
  • ✅ Auto-sync enabled
  • ✅ Self-healing enabled
  • ✅ Version controlled
  • ✅ Easy to modify and rollback

Prerequisites Already Met

✅ ASG configured with autoscaler tags:

  • k8s.io/cluster-autoscaler/enabled=true
  • k8s.io/cluster-autoscaler/kubestock=owned

✅ IAM permissions already configured in kubestock-node-role:

  • AutoScaling permissions (SetDesiredCapacity, TerminateInstance, Describe*)
  • EC2 permissions (DescribeInstances, DescribeInstanceTypes, etc.)

✅ Nodes auto-join via golden AMI and SSM parameters

✅ Token refresh Lambda keeps join-token current

Deployment Steps

  1. Update Terraform (apply ASG max capacity change):

    cd infrastructure/terraform/prod
    terraform plan
    terraform apply
  2. Commit and push GitOps changes:

    git add gitops/ infrastructure/
    git commit -m "Add Kubernetes Cluster Autoscaler"
    git push
  3. Deploy via ArgoCD:

    kubectl apply -f gitops/apps/cluster-autoscaler.yaml
  4. Verify deployment:

    kubectl get pods -n cluster-autoscaler
    kubectl logs -n cluster-autoscaler -l app=cluster-autoscaler -f

Testing Commands

Scale Up Test:

kubectl create deployment scale-test --image=nginx --replicas=20
kubectl set resources deployment scale-test --requests=cpu=500m,memory=512Mi
watch kubectl get nodes

Scale Down Test:

kubectl delete deployment scale-test
watch kubectl get nodes  # Wait 10+ minutes

View Logs:

kubectl logs -n cluster-autoscaler -l app=cluster-autoscaler -f

View Metrics:

kubectl port-forward -n cluster-autoscaler svc/cluster-autoscaler 8085:8085
curl http://localhost:8085/metrics

Monitoring Metrics

Key metrics available:

  • cluster_autoscaler_nodes_count - Current node count by state
  • cluster_autoscaler_unschedulable_pods_count - Pods waiting for resources
  • cluster_autoscaler_scaled_up_nodes_total - Total scale-up events
  • cluster_autoscaler_scaled_down_nodes_total - Total scale-down events
  • cluster_autoscaler_failed_scale_ups_total - Failed scale attempts
  • cluster_autoscaler_last_activity - Timestamp of last scaling action

Cost Optimization

The autoscaler will:

  • ✅ Automatically remove idle nodes after 10 minutes (saves ~85% of idle costs)
  • ✅ Use least-waste strategy to minimize over-provisioning
  • ✅ Balance nodes across AZs for efficient resource usage
  • ✅ Scale based on actual demand, not static schedules

Example savings:

  • Without autoscaler: 5 nodes × 24h × $0.05/h = $6/day
  • With autoscaler (avg 2-3 nodes): 2.5 nodes × 24h × $0.05/h = $3/day
  • Savings: ~50% on compute costs during low-demand periods

Troubleshooting Quick Reference

Issue Check Command
Pod not running kubectl get pods -n cluster-autoscaler
View errors kubectl logs -n cluster-autoscaler -l app=cluster-autoscaler
Check ASG aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names kubestock-workers-asg
Node utilization kubectl top nodes
Status ConfigMap kubectl get cm cluster-autoscaler-status -n cluster-autoscaler -o yaml

Documentation References

  1. Setup Guide: infrastructure/docs/cluster-autoscaler-setup-guide.md
  2. ASG Documentation: infrastructure/docs/asg-ssm-setup-guide.md
  3. Deployment Checklist: CLUSTER_AUTOSCALER_DEPLOYMENT.md
  4. Official Docs: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

Success Indicators

After deployment, you should see:

  • ✅ Autoscaler pod running in cluster-autoscaler namespace
  • ✅ No errors in logs
  • ✅ Status ConfigMap shows cluster discovered
  • ✅ Scale-up test adds nodes within 2-3 minutes
  • ✅ Scale-down test removes nodes after 10+ minutes
  • ✅ Prometheus metrics available at :8085/metrics
  • ✅ ArgoCD Application shows "Healthy" and "Synced"

Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│                     Control Plane Node                            │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  Cluster Autoscaler (Deployment)                           │  │
│  │  - Monitors: Pending pods, node utilization                │  │
│  │  - Decides: When to scale up/down                          │  │
│  │  - Acts: Modifies ASG desired capacity                     │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────┬───────────────────────┬───────────────────┘
                       │                       │
            Scale Up   │                       │  Scale Down
                       ▼                       ▼
┌──────────────────────────────────────────────────────────────────┐
│              AWS Auto Scaling Group (1-8 nodes)                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │ Worker 1 │  │ Worker 2 │  │ Worker N │  │ (Future) │        │
│  │ (Ready)  │  │ (Ready)  │  │ (Ready)  │  │ (Scaled) │        │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
│                                                                   │
│  Tags:                                                            │
│  - k8s.io/cluster-autoscaler/enabled = true                      │
│  - k8s.io/cluster-autoscaler/kubestock = owned                   │
└──────────────────────────────────────────────────────────────────┘
         │
         │ Golden AMI + SSM Join
         ▼
   Nodes auto-join cluster

Next Steps

After successful deployment:

  1. Monitor scaling behavior for a few days
  2. Adjust thresholds if needed based on workload patterns
  3. Set up Grafana dashboards for visualization
  4. Configure HPA for pod-level autoscaling (complements Cluster Autoscaler)
  5. Consider spot instances for cost savings on non-critical workloads

Rollback Plan

If issues occur:

# Delete ArgoCD Application
kubectl delete application -n argocd cluster-autoscaler

# Delete namespace
kubectl delete namespace cluster-autoscaler

# Revert Terraform (if needed)
cd infrastructure/terraform/prod
# Edit variables.tf to set max_size back to 5
terraform apply

Implementation Date: December 14, 2025
Version: 1.0
Autoscaler Version: v1.31.0
Kubernetes Version: v1.31+