Skip to content

feat: MCP out-of-the-box (egress auto-derive + session keepalive), kars update, CI fix#481

Merged
pallakatos merged 10 commits into
mainfrom
launch/mcp-out-of-the-box
Jun 30, 2026
Merged

feat: MCP out-of-the-box (egress auto-derive + session keepalive), kars update, CI fix#481
pallakatos merged 10 commits into
mainfrom
launch/mcp-out-of-the-box

Conversation

@pallakatos

Copy link
Copy Markdown
Collaborator

Summary

Launch-day work to make MCP servers work out of the box on kars, add a CLI self-update flow, and turn the failing main CI green.

Fixes red CI on main

  • fix(deps): bump anyhow → 1.0.103 (RUSTSEC-2026-0190). cargo-audit and cargo-deny were failing on main on the anyhow::Error::downcast_mut() unsoundness advisory. Lockfile-only.

MCP out-of-the-box

  • fix(mcp): router session keepalive. The forwarder is now a well-formed MCP client — it holds the standalone GET SSE stream open and answers server heartbeat pings with pongs. This fixes stateful MCPs (Playwright) getting reaped after ~5s and the agent landing on about:blank mid-task. Validated live on AKS (persistent stream survives past the reaper; zero session-loss).
  • feat(controller): auto-derive sandbox egress from referenced McpServer URLs. No more hand-written allowedEndpoints for MCP hosts. In-cluster Services get a Cilium-correct namespaceSelector rule; external non-443 hosts get a coarse port rule; 443 is covered by the blanket HTTPS rule.
  • docs(mcp) + example: new docs/mcp.md guide and a runnable examples/playwright-mcp/ (browser agent on the official Playwright MCP — one mcpServerRefs line, no manual egress). All CRDs validated against the live cluster.

CLI self-update

  • feat(cli): kars update + automatic update notice. Quiet, cached (≤1/day) npm check on every run prints a one-line notice + changelog when a newer @kars-runtime/cli is published; kars update shows the changelog and offers to install. Best-effort, 1.5s timeout, silent in CI/non-TTY/KARS_NO_UPDATE_CHECK=1.

Upgrade hardening

  • feat(upgrade): pre-flight server-side-apply conflict detection. Detects Deployment fields owned by another field manager before the atomic Helm upgrade and surfaces remediation instead of wedging the release.

Validation

  • Rust: cargo test --all (controller 854 + router 957 + integration) green; cargo clippy --all-targets -D warnings clean; cargo audit / cargo deny check clean.
  • CLI: build + typecheck + oxlint (0 errors) + 928 vitest tests green.
  • Example CRDs validated with kubectl apply --dry-run=server against the live AKS cluster.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Pal Lakatos-Toth and others added 6 commits June 30, 2026 14:56
The MCP forwarder spoke request/response only: it POSTed tools/call and
read the immediate response, but never opened the standalone GET /mcp SSE
stream nor answered server-initiated pings. Heartbeating MCP servers —
Playwright MCP runs one (HTTP transport, runHeartbeat=true) — send the
client a JSON-RPC ping every ~3s and call server.close() if no pong
arrives within PLAYWRIGHT_MCP_PING_TIMEOUT_MS (default 5000). So every
session was reaped ~5s after creation; the next tools/call got
404 "Session not found", the forwarder re-initialized, and the retry
landed on a brand-new blank browser context — the agent saw about:blank
mid-task (navigate/click state lost).

Fix: act as a well-formed MCP client. For each stateful session, spawn a
keepalive task that holds the standalone GET SSE stream open and replies
pong to server pings, keeping the session — and the agent's live page —
alive. The task is cancelled/replaced when the session is re-initialized.

Also:
- Tighten the session-loss classifier so it only triggers on genuine
  4xx hard signals (never on healthy 2xx tool output that merely mentions
  "session", e.g. browser_evaluate returning sessionStorage), since a
  false positive is destructive for stateful servers.
- Carry the triggering status+body on CallAttempt::SessionLost so the
  re-init log records exactly why a session was deemed lost.

Tests: keepalive_holds_get_stream_and_pongs_server_ping plus the existing
stateful/session-loss suite. 1048 router tests pass; clippy + fmt clean.
Validated live on AKS: both sandbox routers hold a persistent GET stream
to Playwright that survives well past the 5s reaper; zero session-loss.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cargo-audit and cargo-deny on main fail on RUSTSEC-2026-0190 — unsoundness
in anyhow's Error::downcast_mut() (UB when downcasting a context-wrapped
error). Fixed in 1.0.103. Lockfile-only bump; no API changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r URLs

Adding an MCP server should 'just work': the per-pod router is the only
network path to it, so the sandbox's default-deny NetworkPolicy must admit
the router→MCP hop. Until now the operator also had to hand-write a
networkPolicy.allowedEndpoints entry for every MCP; miss it and calls to an
in-cluster MCP silently time out.

The controller now parses each referenced McpServer's spec.url and emits the
correct egress rule automatically:
  - in-cluster Service DNS (*.svc.cluster.local) -> namespaceSelector rule.
    Under Cilium a K8s NetworkPolicy ipBlock (even 0.0.0.0/0) only matches the
    reserved 'world' entity and never an in-cluster pod, so an ipBlock rule
    would never admit traffic to another pod — namespaceSelector is the only
    form that works.
  - external host, non-443 -> coarse port-level ipBlock (router enforces host).
  - external host, 443 -> already covered by the blanket HTTPS rule.

New helpers mcp_url_host_port / cluster_internal_namespace / mcp_egress_rule
with unit tests; the reconcile loop walks effective_mcp_server_refs and adds
de-duplicated rules. Missing/unparseable referents are logged and skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every kars run now does a quiet, cached check (at most once per day) of the
npm registry and prints a one-line notice when a newer @kars-runtime/cli has
been published, with a short changelog summary pulled from the GitHub release.
The new 'kars update' command is the explicit, always-fresh path: it shows the
changelog and offers to install (npm install -g @kars-runtime/cli@latest), with
--check (report-only, non-zero exit) and --yes (non-interactive) modes.

Best-effort and unobtrusive by design: hard 1.5s network timeout, all errors
swallowed, result cached in ~/.kars/update-check.json, never prompts after an
arbitrary command, and silent in CI / non-TTY / when KARS_NO_UPDATE_CHECK=1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Helm v4 applies releases server-side, so a Deployment field owned by another
field manager (e.g. a manual 'kubectl set env HERMES_RUNTIME_IMAGE') makes both
the real apply AND its atomic rollback fail with a conflict, wedging the
release mid-upgrade.

kars upgrade now runs a server-side dry-run first and parses any field-manager
conflicts. On conflict it stops BEFORE any change and prints the offending
object/field/manager plus two remediation paths: --force-conflicts to take
ownership, or a precise copy-paste command to hand the field back. It never
silently clobbers a field a live operator legitimately manages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add docs/mcp.md — how to add an MCP server (McpServer CR + mcpServerRefs), how
tool calls are governed through the per-pod router, out-of-the-box egress
auto-derivation, and the session keepalive that keeps stateful flows (e.g.
browser automation) on one live page instead of resetting to about:blank.

Add examples/playwright-mcp/ — a browser-automation OpenClaw agent on the
official Playwright MCP, end to end: in-cluster MCP Deployment/Service, the
McpServer CR, and a KarsSandbox whose only MCP-specific line is one
mcpServerRefs entry (no hand-written egress). Wire both into the docs nav,
examples catalogue, and CLI reference (kars update).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails
cargo/anyhow 1.0.103 🟢 4.7
Details
CheckScoreReason
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 76 commit(s) and 3 issue activity found in the last 90 days -- score normalized to 7
Packaging⚠️ -1packaging workflow not detected
Code-Review⚠️ 0Found 2/21 approved changesets -- score normalized to 0
Binary-Artifacts🟢 10no binaries found in the repo
Token-Permissions🟢 10GitHub workflow tokens follow principle of least privilege
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
Security-Policy🟢 3security policy file detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0

Scanned Files

  • Cargo.lock

Comment thread cli/src/lib/update-check.test.ts Fixed
Comment thread cli/src/lib/update-check.ts Fixed
Comment thread cli/src/lib/update-check.ts Fixed
Comment thread cli/src/lib/update-check.ts Fixed
Pal Lakatos-Toth and others added 4 commits June 30, 2026 15:39
Move the MCP egress derivation (mcp_url_host_port / cluster_internal_namespace
/ mcp_egress_rule + the per-ref derive_mcp_egress_rules loop) and its unit
tests out of reconciler/mod.rs into a focused reconciler::mcp_egress module.
Keeps mod.rs under the ci/check-loc.sh phase0 budget (3700 LOC) and groups the
egress logic in one place. Pure code-move + the reconcile loop now calls
derive_mcp_egress_rules; no behaviour change. 854 controller tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Lakatos-Toth <pallakatos@microsoft.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address four CodeQL findings in the new update-check lib:
- Strictly validate every version token (VERSION_RE / sanitizeVersion) before
  it is cached to disk or interpolated into a registry/GitHub URL, so untrusted
  response/cache bytes can't reach the filesystem or an outbound request
  (fixes 'network data written to file' + 'file data in outbound request').
- Percent-encode every '/' in the package name (replaceAll), not just the first
  (fixes 'incomplete string escaping').
- Tighten the test's registry-URL match to startsWith (fixes 'incomplete URL
  substring sanitization').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssion keepalive) + kars update

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos merged commit 7b882dc into main Jun 30, 2026
36 checks passed
@pallakatos pallakatos deleted the launch/mcp-out-of-the-box branch June 30, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants