fix(router): hot-reload ConfigMap edits (follow symlinks) + kars connect wait-for-ready (v0.1.14)#450
Merged
Merged
Conversation
…ect wait-for-ready (v0.1.14) Router watchers never detected ConfigMap updates: dir_max_mtime used DirEntry::metadata() (lstat), which reads the frozen mtime of the per-key symlink kubelet lays down — not the target. kubelet swaps ..data atomically without recreating the symlink, so the 5s mtime poll saw no change and the router silently enforced the BOOT-TIME policy until a pod restart. Affected InferencePolicy (prompt shields, content-safety floors, token budgets, model prefs), EgressAllowlist, KarsMemory bindings, and AGT governance policies. Fix: shared config_mount::dir_max_mtime(dir, exts) that stats via std::fs::metadata (follows symlinks). All four watchers delegate to it so it can't diverge again. Regression test reproduces kubelet's atomic ..data symlink swap — proven to FAIL on the old lstat code and PASS on the fix (the existing test missed it by writing a plain file). Also: kars connect now waits for the pod to be Running + agent container Ready before port-forwarding (fail-fast on ImagePull/CrashLoop) and auto-picks a free local port — fixes the 'pod is not running. status=Pending' and 'address already in use' first-run footguns. Security audit: docs/internal/security-audits/2026-06-25-router-configmap-hot-reload.md (2 sign-offs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug (live-confirmed in production)
Router policy edits silently did not take effect until a pod restart. Root cause: all four router change-detection watchers (
InferencePolicy,EgressAllowlist,KarsMemory, AGT governance) useddir_max_mtimebuilt onDirEntry::metadata()— anlstatthat does not follow symlinks.Kubernetes projects ConfigMap mounts as per-key symlinks into an atomically-swapped
..datadir. On update kubelet swaps..databut never recreates the per-key symlink, so itslstatmtime is frozen at pod start. The 5s poll therefore never saw a change → the router kept enforcing the boot-time policy.Live proof: patched an
InferencePolicy→ controller updated the ConfigMap in 22s → router logged no reload in 120s. Only a pod bump applied it.Blast radius
Every router-enforced live edit until pod restart: prompt shields, content-safety floors, token budgets, model prefs, egress allowlists/approvals, memory bindings, AGT governance. This is a fail-open gap (tightening a policy didn't apply).
The fix
One shared
config_mount::dir_max_mtime(dir, exts)that stats viastd::fs::metadata(e.path())(follows symlinks → real file mtime, which advances on every..dataswap). All four watchers delegate to it so it can't diverge again.Why the existing tests missed it
watcher_reloads_on_mtime_changewrites a plain file (mtime changes). The newdetects_configmap_data_symlink_swapreproduces kubelet's atomic..datasymlink rename — proven to FAIL on the oldlstatcode and PASS on the fix.Also:
kars connectfirst-run footgunsWaits for the pod to be Running + agent Ready before port-forward (fail-fast on ImagePull/CrashLoop), and auto-picks a free local port — fixes
pod is not running. status=Pendingandaddress already in use.Verification
docs/internal/security-audits/2026-06-25-router-configmap-hot-reload.md(2 sign-offs).Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com