Skip to content

fix(router): hot-reload ConfigMap edits (follow symlinks) + kars connect wait-for-ready (v0.1.14)#450

Merged
pallakatos merged 1 commit into
mainfrom
fix/router-configmap-hot-reload
Jun 24, 2026
Merged

fix(router): hot-reload ConfigMap edits (follow symlinks) + kars connect wait-for-ready (v0.1.14)#450
pallakatos merged 1 commit into
mainfrom
fix/router-configmap-hot-reload

Conversation

@pallakatos

Copy link
Copy Markdown
Collaborator

The bug (live-confirmed in production)

Router policy edits silently did not take effect until a pod restart. Root cause: all four router change-detection watchers (InferencePolicy, EgressAllowlist, KarsMemory, AGT governance) used dir_max_mtime built on DirEntry::metadata() — an lstat that does not follow symlinks.

Kubernetes projects ConfigMap mounts as per-key symlinks into an atomically-swapped ..data dir. On update kubelet swaps ..data but never recreates the per-key symlink, so its lstat mtime is frozen at pod start. The 5s poll therefore never saw a change → the router kept enforcing the boot-time policy.

Live proof: patched an InferencePolicy → controller updated the ConfigMap in 22s → router logged no reload in 120s. Only a pod bump applied it.

Blast radius

Every router-enforced live edit until pod restart: prompt shields, content-safety floors, token budgets, model prefs, egress allowlists/approvals, memory bindings, AGT governance. This is a fail-open gap (tightening a policy didn't apply).

The fix

One shared config_mount::dir_max_mtime(dir, exts) that stats via std::fs::metadata(e.path()) (follows symlinks → real file mtime, which advances on every ..data swap). All four watchers delegate to it so it can't diverge again.

Why the existing tests missed it

watcher_reloads_on_mtime_change writes a plain file (mtime changes). The new detects_configmap_data_symlink_swap reproduces kubelet's atomic ..data symlink renameproven to FAIL on the old lstat code and PASS on the fix.

Also: kars connect first-run footguns

Waits for the pod to be Running + agent Ready before port-forward (fail-fast on ImagePull/CrashLoop), and auto-picks a free local port — fixes pod is not running. status=Pending and address already in use.

Verification

  • 944 router lib tests pass (939 + 5 new); clippy clean.
  • 821 CLI vitest pass; tsc + oxlint clean.
  • Security audit: docs/internal/security-audits/2026-06-25-router-configmap-hot-reload.md (2 sign-offs).

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…ect wait-for-ready (v0.1.14)

Router watchers never detected ConfigMap updates: dir_max_mtime used
DirEntry::metadata() (lstat), which reads the frozen mtime of the per-key
symlink kubelet lays down — not the target. kubelet swaps ..data atomically
without recreating the symlink, so the 5s mtime poll saw no change and the
router silently enforced the BOOT-TIME policy until a pod restart. Affected
InferencePolicy (prompt shields, content-safety floors, token budgets, model
prefs), EgressAllowlist, KarsMemory bindings, and AGT governance policies.

Fix: shared config_mount::dir_max_mtime(dir, exts) that stats via
std::fs::metadata (follows symlinks). All four watchers delegate to it so it
can't diverge again. Regression test reproduces kubelet's atomic ..data
symlink swap — proven to FAIL on the old lstat code and PASS on the fix
(the existing test missed it by writing a plain file).

Also: kars connect now waits for the pod to be Running + agent container
Ready before port-forwarding (fail-fast on ImagePull/CrashLoop) and
auto-picks a free local port — fixes the 'pod is not running. status=Pending'
and 'address already in use' first-run footguns.

Security audit: docs/internal/security-audits/2026-06-25-router-configmap-hot-reload.md (2 sign-offs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@pallakatos pallakatos merged commit c5649f9 into main Jun 24, 2026
35 checks passed
@pallakatos pallakatos deleted the fix/router-configmap-hot-reload branch June 24, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant