Summary
On AKS, the controller's reconcilers intermittently stop reflecting CR spec edits into their compiled ConfigMaps. Editing an InferencePolicy (e.g. kubectl patch / kars policy) bumps metadata.generation, but the controller keeps recompiling the pre-edit object on its 15s requeue; the ConfigMap only updates after a controller restart (which forces a fresh List).
This is separate from the router hot-reload bug fixed in #450 / v0.1.14 (that one is the router side; this is the controller side).
Evidence (live, kars-aks, 2026-06-24)
- Patched
palkarstest-inference requirePromptShields: true→false; kubectl get confirmed false in etcd; controller logged InferencePolicyCompiled every 15s with a frozen version_hash/compiled_digest and the ConfigMap stayed true (rv unchanged) until rollout restart, after which it compiled false.
- Reconcile logs showed
generation: 3 (fresh metadata) yet a frozen compiled hash — i.e. reconciling fresh metadata against a stale reflector store spec.
- Leader election healthy (single leader; standby = 0 reconciles); no panics; no watch errors logged; lease renewing normally. The reconcile loop ran in bursts with multi-minute gaps.
- Propagation was sometimes minutes, sometimes never-without-restart — intermittent.
Likely cause
All controllers use watcher::Config::default() (kube-rs 3.1.0) with no watch timeout. On AKS, konnectivity/LB can silently drop an idle watch connection (no FIN/RST); kube-rs doesn't detect the dead stream and never re-lists, so the reflector store freezes while the 15s requeue keeps reconciling stale objects.
Suggested investigation / fix
- Add a bounded watch timeout (e.g.
watcher::Config::default().timeout(~290)) and/or TCP keepalive / HTTP2 ping on the kube client so a silently-dropped watch is re-established (relist) within a bounded window.
- Consider a reconcile-activity watchdog that relists (or restarts) if no events arrive for N minutes.
- Repro with
RUST_LOG=kube=debug on AKS to capture the watch lifecycle.
Impact
kars policy / InferencePolicy / ToolPolicy / EgressApproval live edits may not apply until a controller restart. Does not affect first-time deploys (the first reconcile compiles correctly from the initial List).
Summary
On AKS, the controller's reconcilers intermittently stop reflecting CR spec edits into their compiled ConfigMaps. Editing an
InferencePolicy(e.g.kubectl patch/kars policy) bumpsmetadata.generation, but the controller keeps recompiling the pre-edit object on its 15s requeue; the ConfigMap only updates after a controller restart (which forces a freshList).This is separate from the router hot-reload bug fixed in #450 / v0.1.14 (that one is the router side; this is the controller side).
Evidence (live, kars-aks, 2026-06-24)
palkarstest-inferencerequirePromptShields: true→false;kubectl getconfirmedfalsein etcd; controller loggedInferencePolicyCompiledevery 15s with a frozenversion_hash/compiled_digestand the ConfigMap stayedtrue(rv unchanged) untilrollout restart, after which it compiledfalse.generation: 3(fresh metadata) yet a frozen compiled hash — i.e. reconciling fresh metadata against a stale reflector store spec.Likely cause
All controllers use
watcher::Config::default()(kube-rs 3.1.0) with no watch timeout. On AKS, konnectivity/LB can silently drop an idle watch connection (no FIN/RST); kube-rs doesn't detect the dead stream and never re-lists, so the reflector store freezes while the 15s requeue keeps reconciling stale objects.Suggested investigation / fix
watcher::Config::default().timeout(~290)) and/or TCP keepalive / HTTP2 ping on the kube client so a silently-dropped watch is re-established (relist) within a bounded window.RUST_LOG=kube=debugon AKS to capture the watch lifecycle.Impact
kars policy/ InferencePolicy / ToolPolicy / EgressApproval live edits may not apply until a controller restart. Does not affect first-time deploys (the first reconcile compiles correctly from the initialList).