Skip to content

PRD: PodDisruptionBudgets require HA (2+ replicas) for stateful services #2497

@groundnuty

Description

@groundnuty

Context

During the dome-prod capacity stabilization work (PRs #2491, #2492), we identified that none of the 16 single-replica StatefulSets in production have PodDisruptionBudgets (PDBs). This means any voluntary disruption — node drain, cluster upgrade, autoscaler scale-down — can kill critical databases and stateful services without warning.

However, adding PDBs without first addressing the single-replica problem creates a different issue. This issue documents the full picture.

What is a PDB and why we need one

A PDB tells Kubernetes: "during voluntary disruptions, keep at least N pods running." Without PDBs:

  • kubectl drain (node maintenance) evicts pods immediately
  • Cluster upgrades kill pods one node at a time with no availability guarantee
  • Autoscaler can remove nodes hosting critical databases

With PDBs, the eviction API respects availability constraints. But PDBs only work meaningfully with 2+ replicas — you can't keep "at least 1 of 1" running while also evicting it.

Current state (dome-prod)

Existing PDBs (3)

PDB Namespace MaxUnavailable
dome-dss-vault-server in2 1
wallet-vault-server in2 1
elasticsearch-master-pdb search-engine 1

StatefulSets without PDBs (16, all single-replica)

Critical (data loss / outage risk):

StatefulSet Namespace Replicas Impact if evicted
mysql-til marketplace 1 TIL, CCS, trusted-issuers-list all go down
bae-marketplace-biz-ecosystem-logic-proxy marketplace 1 Marketplace frontend unavailable
cs-identity-keycloak cs-identity 1 Authentication broken for all users
cs-identity-postgresql cs-identity 1 Keycloak loses its database

Medium (service degradation):

StatefulSet Namespace Replicas Impact if evicted
elasticsearch-master (zammad) zammad 1 Ticketing search broken
zammad-postgresql zammad 1 Ticketing data unavailable
zammad2-postgresql zammad2 1 Second ticketing instance down
zammad2-elasticsearch-master zammad2 1 Second ticketing search broken
loki-distributed-ingester loki-distributed 1 Log ingestion stops

Low (non-critical):

StatefulSet Namespace Replicas Impact if evicted
argocd-application-controller argocd 1 GitOps paused (no new deploys)
mysql-knowledgebase knowledgebase 1 Bookstack unavailable
dekra-postgres dome-certification 1 Certification service down
zammad-redis-master zammad 1 Ticketing sessions lost
zammad2-redis-master zammad2 1 Second ticketing sessions lost
wallet-vault-server in2 1 Wallet vault (has PDB already)
prometheus/alertmanager kube-prometheus-stack 1 Monitoring gap

Multi-replica deployments without PDBs

Deployment Namespace Replicas Notes
zammad-railsserver zammad 2 PDB would work here, but non-critical
zammad2-railsserver zammad2 2 Same
coredns kube-system 2 Cluster DNS — should have PDB

The single-replica PDB dilemma

With 1 replica, there are only two PDB options, both problematic:

Option A: maxUnavailable: 0

  • Effect: Pod can never be voluntarily evicted
  • Problem: Blocks all node drains, cluster upgrades, and autoscaler operations. Operations team must manually delete the PDB before any maintenance, then recreate it after.
  • When it makes sense: Never, for routine operations. Only as a temporary "do not touch" signal.

Option B: maxUnavailable: 1

  • Effect: Allows eviction of the single pod (same as having no PDB)
  • Problem: Provides no actual protection. The only benefit is that the eviction goes through the PDB API, making it visible in audit logs.
  • When it makes sense: Compliance/audit requirements only.

Conclusion: PDBs on single-replica services are not a solution. The prerequisite is scaling to 2+ replicas.

What each critical service needs for proper HA + PDB

1. mysql-til (marketplace)

  • Current: Bitnami MySQL 8.0.31, standalone, 1 replica
  • Required changes:
    • Enable MySQL replication: architecture: replication in values.yaml
    • Configure secondary.replicaCount: 1
    • Set primary.pdb.create: true, primary.pdb.maxUnavailable: 0
    • Set secondary.pdb.create: true, secondary.pdb.maxUnavailable: 0
    • Verify applications (TIL, CCS) handle failover (connect to MySQL service, not pod directly)
  • Risk: Replication setup requires data migration or a maintenance window. Read-after-write consistency must be verified.

2. cs-identity-keycloak

  • Current: Bitnami Keycloak 24.0.4, 1 replica
  • Required changes:
    • Set replicaCount: 2
    • Set pdb.create: true, pdb.minAvailable: 1
    • Keycloak natively supports clustering via Infinispan — should work with multiple replicas out of the box
    • Verify session replication works (distributed caches)
  • Risk: Low — Keycloak is designed for HA. May need cache.stack: kubernetes or similar JGroups/DNS_PING config.

3. cs-identity-postgresql

  • Current: Bitnami PostgreSQL 16.3.0, primary only
  • Required changes:
    • Enable replication: architecture: replication in values.yaml
    • Configure readReplicas.replicaCount: 1
    • Set primary.pdb.create: true, primary.pdb.maxUnavailable: 0
    • Keycloak connects to the primary service — no app changes needed
  • Risk: Streaming replication is straightforward for PostgreSQL. Needs a short maintenance window for initial replica sync.

4. bae-marketplace-biz-ecosystem-logic-proxy

  • Current: StatefulSet, 1 replica, managed by bae-marketplace Helm chart
  • Required changes:
    • Increase replicas in BAE chart values
    • Verify the logic-proxy supports horizontal scaling (stateless request handling with shared MongoDB backend)
    • Add PDB via chart values or standalone manifest
  • Risk: Need to verify the app doesn't use local state/sessions. MongoDB is the shared backend, so this should scale horizontally.

Recommended approach

Phase 1 — Quick wins (no replication needed)

  • Keycloak → scale to 2 replicas + PDB (natively supports clustering)
  • coredns → add PDB minAvailable: 1 (already 2 replicas, just needs PDB)

Phase 2 — Database HA (requires planning + maintenance window)

  • cs-identity-postgresql → enable streaming replication + PDB
  • mysql-til → enable MySQL replication + PDB

Phase 3 — Remaining services

  • logic-proxy → scale + PDB (verify statelessness first)
  • Loki ingester → scale to 2 + PDB (Loki supports this natively)
  • Zammad/Zammad2 databases — only if ticketing becomes critical

Related PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions