Skip to content

Controller: InferencePolicy/ToolPolicy watch goes stale on AKS — CR edits don't reach the ConfigMap without a controller restart #451

Description

@pallakatos

Summary

On AKS, the controller's reconcilers intermittently stop reflecting CR spec edits into their compiled ConfigMaps. Editing an InferencePolicy (e.g. kubectl patch / kars policy) bumps metadata.generation, but the controller keeps recompiling the pre-edit object on its 15s requeue; the ConfigMap only updates after a controller restart (which forces a fresh List).

This is separate from the router hot-reload bug fixed in #450 / v0.1.14 (that one is the router side; this is the controller side).

Evidence (live, kars-aks, 2026-06-24)

  • Patched palkarstest-inference requirePromptShields: true→false; kubectl get confirmed false in etcd; controller logged InferencePolicyCompiled every 15s with a frozen version_hash/compiled_digest and the ConfigMap stayed true (rv unchanged) until rollout restart, after which it compiled false.
  • Reconcile logs showed generation: 3 (fresh metadata) yet a frozen compiled hash — i.e. reconciling fresh metadata against a stale reflector store spec.
  • Leader election healthy (single leader; standby = 0 reconciles); no panics; no watch errors logged; lease renewing normally. The reconcile loop ran in bursts with multi-minute gaps.
  • Propagation was sometimes minutes, sometimes never-without-restart — intermittent.

Likely cause

All controllers use watcher::Config::default() (kube-rs 3.1.0) with no watch timeout. On AKS, konnectivity/LB can silently drop an idle watch connection (no FIN/RST); kube-rs doesn't detect the dead stream and never re-lists, so the reflector store freezes while the 15s requeue keeps reconciling stale objects.

Suggested investigation / fix

  • Add a bounded watch timeout (e.g. watcher::Config::default().timeout(~290)) and/or TCP keepalive / HTTP2 ping on the kube client so a silently-dropped watch is re-established (relist) within a bounded window.
  • Consider a reconcile-activity watchdog that relists (or restarts) if no events arrive for N minutes.
  • Repro with RUST_LOG=kube=debug on AKS to capture the watch lifecycle.

Impact

kars policy / InferencePolicy / ToolPolicy / EgressApproval live edits may not apply until a controller restart. Does not affect first-time deploys (the first reconcile compiles correctly from the initial List).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions