feat: MCP out-of-the-box (egress auto-derive + session keepalive), kars update, CI fix#481
Merged
Conversation
The MCP forwarder spoke request/response only: it POSTed tools/call and read the immediate response, but never opened the standalone GET /mcp SSE stream nor answered server-initiated pings. Heartbeating MCP servers — Playwright MCP runs one (HTTP transport, runHeartbeat=true) — send the client a JSON-RPC ping every ~3s and call server.close() if no pong arrives within PLAYWRIGHT_MCP_PING_TIMEOUT_MS (default 5000). So every session was reaped ~5s after creation; the next tools/call got 404 "Session not found", the forwarder re-initialized, and the retry landed on a brand-new blank browser context — the agent saw about:blank mid-task (navigate/click state lost). Fix: act as a well-formed MCP client. For each stateful session, spawn a keepalive task that holds the standalone GET SSE stream open and replies pong to server pings, keeping the session — and the agent's live page — alive. The task is cancelled/replaced when the session is re-initialized. Also: - Tighten the session-loss classifier so it only triggers on genuine 4xx hard signals (never on healthy 2xx tool output that merely mentions "session", e.g. browser_evaluate returning sessionStorage), since a false positive is destructive for stateful servers. - Carry the triggering status+body on CallAttempt::SessionLost so the re-init log records exactly why a session was deemed lost. Tests: keepalive_holds_get_stream_and_pongs_server_ping plus the existing stateful/session-loss suite. 1048 router tests pass; clippy + fmt clean. Validated live on AKS: both sandbox routers hold a persistent GET stream to Playwright that survives well past the 5s reaper; zero session-loss. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cargo-audit and cargo-deny on main fail on RUSTSEC-2026-0190 — unsoundness in anyhow's Error::downcast_mut() (UB when downcasting a context-wrapped error). Fixed in 1.0.103. Lockfile-only bump; no API changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r URLs
Adding an MCP server should 'just work': the per-pod router is the only
network path to it, so the sandbox's default-deny NetworkPolicy must admit
the router→MCP hop. Until now the operator also had to hand-write a
networkPolicy.allowedEndpoints entry for every MCP; miss it and calls to an
in-cluster MCP silently time out.
The controller now parses each referenced McpServer's spec.url and emits the
correct egress rule automatically:
- in-cluster Service DNS (*.svc.cluster.local) -> namespaceSelector rule.
Under Cilium a K8s NetworkPolicy ipBlock (even 0.0.0.0/0) only matches the
reserved 'world' entity and never an in-cluster pod, so an ipBlock rule
would never admit traffic to another pod — namespaceSelector is the only
form that works.
- external host, non-443 -> coarse port-level ipBlock (router enforces host).
- external host, 443 -> already covered by the blanket HTTPS rule.
New helpers mcp_url_host_port / cluster_internal_namespace / mcp_egress_rule
with unit tests; the reconcile loop walks effective_mcp_server_refs and adds
de-duplicated rules. Missing/unparseable referents are logged and skipped.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every kars run now does a quiet, cached check (at most once per day) of the npm registry and prints a one-line notice when a newer @kars-runtime/cli has been published, with a short changelog summary pulled from the GitHub release. The new 'kars update' command is the explicit, always-fresh path: it shows the changelog and offers to install (npm install -g @kars-runtime/cli@latest), with --check (report-only, non-zero exit) and --yes (non-interactive) modes. Best-effort and unobtrusive by design: hard 1.5s network timeout, all errors swallowed, result cached in ~/.kars/update-check.json, never prompts after an arbitrary command, and silent in CI / non-TTY / when KARS_NO_UPDATE_CHECK=1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Helm v4 applies releases server-side, so a Deployment field owned by another field manager (e.g. a manual 'kubectl set env HERMES_RUNTIME_IMAGE') makes both the real apply AND its atomic rollback fail with a conflict, wedging the release mid-upgrade. kars upgrade now runs a server-side dry-run first and parses any field-manager conflicts. On conflict it stops BEFORE any change and prints the offending object/field/manager plus two remediation paths: --force-conflicts to take ownership, or a precise copy-paste command to hand the field back. It never silently clobbers a field a live operator legitimately manages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add docs/mcp.md — how to add an MCP server (McpServer CR + mcpServerRefs), how tool calls are governed through the per-pod router, out-of-the-box egress auto-derivation, and the session keepalive that keeps stateful flows (e.g. browser automation) on one live page instead of resetting to about:blank. Add examples/playwright-mcp/ — a browser-automation OpenClaw agent on the official Playwright MCP, end to end: in-cluster MCP Deployment/Service, the McpServer CR, and a KarsSandbox whose only MCP-specific line is one mcpServerRefs entry (no hand-written egress). Wire both into the docs nav, examples catalogue, and CLI reference (kars update). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Files
|
Move the MCP egress derivation (mcp_url_host_port / cluster_internal_namespace / mcp_egress_rule + the per-ref derive_mcp_egress_rules loop) and its unit tests out of reconciler/mod.rs into a focused reconciler::mcp_egress module. Keeps mod.rs under the ci/check-loc.sh phase0 budget (3700 LOC) and groups the egress logic in one place. Pure code-move + the reconcile loop now calls derive_mcp_egress_rules; no behaviour change. 854 controller tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Lakatos-Toth <pallakatos@microsoft.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address four CodeQL findings in the new update-check lib: - Strictly validate every version token (VERSION_RE / sanitizeVersion) before it is cached to disk or interpolated into a registry/GitHub URL, so untrusted response/cache bytes can't reach the filesystem or an outbound request (fixes 'network data written to file' + 'file data in outbound request'). - Percent-encode every '/' in the package name (replaceAll), not just the first (fixes 'incomplete string escaping'). - Tighten the test's registry-URL match to startsWith (fixes 'incomplete URL substring sanitization'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssion keepalive) + kars update Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Launch-day work to make MCP servers work out of the box on kars, add a CLI self-update flow, and turn the failing main CI green.
Fixes red CI on main
fix(deps): bumpanyhow→ 1.0.103 (RUSTSEC-2026-0190).cargo-auditandcargo-denywere failing on main on theanyhow::Error::downcast_mut()unsoundness advisory. Lockfile-only.MCP out-of-the-box
fix(mcp): router session keepalive. The forwarder is now a well-formed MCP client — it holds the standalone GET SSE stream open and answers server heartbeatpings withpongs. This fixes stateful MCPs (Playwright) getting reaped after ~5s and the agent landing onabout:blankmid-task. Validated live on AKS (persistent stream survives past the reaper; zero session-loss).feat(controller): auto-derive sandbox egress from referencedMcpServerURLs. No more hand-writtenallowedEndpointsfor MCP hosts. In-cluster Services get a Cilium-correctnamespaceSelectorrule; external non-443 hosts get a coarse port rule; 443 is covered by the blanket HTTPS rule.docs(mcp)+ example: newdocs/mcp.mdguide and a runnableexamples/playwright-mcp/(browser agent on the official Playwright MCP — onemcpServerRefsline, no manual egress). All CRDs validated against the live cluster.CLI self-update
feat(cli):kars update+ automatic update notice. Quiet, cached (≤1/day) npm check on every run prints a one-line notice + changelog when a newer@kars-runtime/cliis published;kars updateshows the changelog and offers to install. Best-effort, 1.5s timeout, silent in CI/non-TTY/KARS_NO_UPDATE_CHECK=1.Upgrade hardening
feat(upgrade): pre-flight server-side-apply conflict detection. Detects Deployment fields owned by another field manager before the atomic Helm upgrade and surfaces remediation instead of wedging the release.Validation
cargo test --all(controller 854 + router 957 + integration) green;cargo clippy --all-targets -D warningsclean;cargo audit/cargo deny checkclean.oxlint(0 errors) + 928 vitest tests green.kubectl apply --dry-run=serveragainst the live AKS cluster.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com