Skip to content

fix(egress): repair learn/enforce flow — operator toggle + CLI approve/deny/enforce#480

Merged
pallakatos merged 3 commits into
mainfrom
fix/egress-learn-enforce-flow
Jun 29, 2026
Merged

fix(egress): repair learn/enforce flow — operator toggle + CLI approve/deny/enforce#480
pallakatos merged 3 commits into
mainfrom
fix/egress-learn-enforce-flow

Conversation

@pallakatos

@pallakatos pallakatos commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Fixes two operator-reported bugs in the egress learning/enforcement flow. Both are caused by CLI/operator code still calling router endpoints that Slice 5c.1 removed (/egress/approve, /egress/deny, /egress/enforce, /egress/pending).

The two bugs

1. kars operator can't move Strict → Learn (Shift-L errors).
learnEgress called the runtime /egress/learn probe first, uncaught — so a probe failure skipped the authoritative CRD patch entirely (while the symmetric enforceEgress, which only patches the CRD, worked → Strict worked, Learn didn't). It also sent no body, defaulting the router to enabled:false (i.e. disabling learn).
Fix: patch the CRD egressMode first (authoritative), then a best-effort {enabled:true} probe in its own .catch. Added a symmetric best-effort {enabled:false} toggle to enforceEgress.

2. kars egress prod-agent --approve api.telegram.org fails (exit code 1).
--approve/--deny/--enforce/--pending and the default status view all called removed endpoints. Re-pointed to the real, authoritative model:

  • --approve <domain[:port]> → add host:port (default :443, answering "do I need 443?" — yes) to baseline allowedEndpoints + re-sign.
  • --deny <domain> → remove host + re-sign. --deny is now in the signing context so the revocation is authoritative (previously it would not re-sign → fail-open).
  • --enforce → patch egressMode=Strict + sign baseline.
  • --pending / status → show learned-but-not-allowlisted domains.
  • --learn/--no-learn → durable CRD patch in k8s, runtime-only in Docker.

For temporary, TTL-scoped grants (the user's "EgressApproval for a period of time"), the existing kars egress allow-extra <name> --host <h> --ttl PT4H --reason "<why>" creates an EgressApproval CR — port is optional there.

Robustness (from two rubber-duck reviews)

  • Port-less baseline entries are normalized to :443 before signing (the signer requires a port and would otherwise silently drop them).
  • --approve/--deny self-heal a prior failed sign by always re-signing in a signing context; runSignFlow stays fail-closed (patches allowlistRef only after a successful cosign) and now returns status so callers warn "not yet authoritative" on failure.
  • Local Docker dev preserved: --approve/--deny/--enforce refuse clearly; --learn/--learned keep the runtime path. Merge-patch replaces only allowedEndpoints (preserves egressMode/allowlistRef).

Also

Rewrote tests/e2e-manual/scenarios/egress_lifecycle.sh to the CRD model (egressMode patch + EgressApproval create/delete) and fixed its stale spec.egress.modespec.networkPolicy.egressMode.

Tests

egress.test.ts +16 (parseDomainPort, unionEndpoint/removeHost incl. port-less preservation, updated error text). CLI typecheck + oxlint (0 errors) + build clean; vitest 903 pass / 2 skipped. All 7 ci-gates pass. Security audit: docs/security-audits/2026-06-29-egress-learn-enforce-flow-repair.md.


Follow-up: full CRD-move leftover sweep (2nd commit)

After the core fix, swept the whole repo for anything else left over from the egress→CRD move:

  • No live callers of the removed endpoints remained — but the router’s /egress/fetch 403 response was still telling agents to run the removed --pending/--approve flow (now points to the real --approve re-sign / allow-extra remediation).
  • Fixed stale comments (routes/internal.rs), operator drawer labels (learned, not pending), and rewrote the stale docs (egress-proxy.md API + CLI tables, cli-reference.md “enforce promotes all learned” → “Strict + sign”, gitops.md).
  • Verified learn_egress (spawn/handoff) is current, not stale (it bridges to networkPolicy.egressMode).

43 egress router tests pass; router compiles; CLI 903 pass.

…e/deny/enforce

Both bugs stem from CLI/operator code still calling router endpoints that
Slice 5c.1 removed (/egress/approve, /egress/deny, /egress/enforce,
/egress/pending). The authoritative model is now the KarsSandbox CRD
(allowedEndpoints → controller-published, cosign-verified bundle), the
EgressApproval CRD (TTL grants), and egressMode (Learn|Strict) with a live
POST /egress/learn toggle.

1. Operator (TUI) could not move Strict → Learn. learnEgress called the
   runtime /egress/learn probe FIRST, uncaught, so a probe failure skipped the
   authoritative CRD patch entirely. It also sent no body (defaulting the
   router to enabled:false, i.e. DISABLING learn). Fix: patch the CRD
   egressMode first (mirrors enforceEgress), then a best-effort {enabled:true}
   probe wrapped in its own .catch. Added a symmetric best-effort {enabled:false}
   toggle to enforceEgress.

2. kars egress --approve/--deny/--enforce/--pending + the default status view
   hit the removed endpoints (the reported exit code 1). Re-pointed to the real
   mechanisms:
   - --approve <domain[:port]>: add host:port (default :443) to baseline
     allowedEndpoints + re-sign (sign-by-default).
   - --deny <domain>: remove host + re-sign. --deny is now in the signing
     context so a revocation is authoritative (was fail-open).
   - --enforce: patch egressMode=Strict + sign baseline; best-effort live
     learn-off toggle.
   - --pending / status: show learned-but-not-allowlisted domains.
   - --learn/--no-learn: durable CRD patch in k8s mode, runtime-only in Docker.

Robustness: port-less baseline entries are normalized to :443 before signing
(the signer requires a port and would otherwise silently drop them); approve/
deny self-heal a prior failed sign by always re-signing in a signing context;
runSignFlow stays fail-CLOSED (allowlistRef patched only after a successful
cosign) and now returns status so callers warn "not yet authoritative" on
failure. Docker local-dev paths preserved (--approve/--deny/--enforce refuse
clearly; --learn keeps the runtime toggle).

Also rewrote tests/e2e-manual/scenarios/egress_lifecycle.sh to the CRD model
(egressMode patch + EgressApproval create/delete) and fixed its stale
spec.egress.mode field → spec.networkPolicy.egressMode.

Tests: egress.test.ts +16 (parseDomainPort, unionEndpoint/removeHost incl.
port-less preservation, updated signing-context error text). CLI typecheck +
oxlint (0 errors) + build clean; vitest 903 pass / 2 skipped. All 7 ci-gates
pass. Two rubber-duck reviews (k8s + architecture) addressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Pal Lakatos-Toth and others added 2 commits June 29, 2026 23:07
…ce + docs)

Follow-up sweep after the learn/enforce flow repair found more references to the
removed Slice 5c.1 endpoints — none were live callers, but several gave users
stale instructions:

- inference-router/src/routes/egress.rs: the /egress/fetch 403 response told
  agents to run the REMOVED `kars egress --pending`/`--approve` workflow. Now
  points to the real remediation (operator `--approve` re-sign, or a temporary
  EgressApproval via `allow-extra`). Also fixed the handler's doc comment
  ("create pending approval request" → deny in Strict / log in Learn; no queue).
- inference-router/src/routes/internal.rs: corrected a stale comment about the
  `/egress/learned/blocked` "old shape" / --pending workflow.
- cli/src/commands/operator/dialogs/egress.ts: legend "learned (pending
  approval)" → "learned (not yet approved)"; status label `pending=` → `learned=`
  (the `signed-pending` cosign-verification state is unrelated and unchanged).
- docs/egress-proxy.md: rewrote the lifecycle diagram, operator workflow, CLI
  table, and API-endpoints table (removed /egress/approve|deny|pending|enforce,
  documented the CRD + signed-bundle + EgressApproval model and :443 default).
- docs/cli-reference.md: fixed the egress section — `--enforce` is "Strict +
  sign", NOT "promote all learned"; corrected --approve/--deny/--pending
  descriptions; added allow-extra.
- docs/operations/gitops.md: error-text now lists --deny too.

Verified `learn_egress` (spawn/handoff) is CURRENT, not stale — spawn reads
networkPolicy.egressMode ↔ maps it back, so it's the internal bool form of the
CRD field. check_egress already returns the correct new-model deny message.

Router compiles; 43 egress router tests pass; CLI typecheck + oxlint (0 errors)
+ 903 vitest pass; all 7 ci-gates pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…y, deny-empty, e2e rigor)

Third review pass on the egress flow repair. All blocking findings resolved:

- operator/actions.ts: learnEgress/enforceEgress no longer early-return when the
  pod is down. The authoritative CRD egressMode patch now runs regardless — the
  operator MUST be able to flip the mode to recover a crashing sandbox — and only
  the best-effort live /egress/learn probe is gated on a running pod.
- egress.ts discoverKarsSandboxNamespace: a name that exists in MULTIPLE
  namespaces is now a hard error ("pass --namespace to disambiguate") instead of
  silently patching the first match (wrong-sandbox-mutation guard).
- egress.ts --deny: removing the LAST endpoint leaves an empty baseline, which
  the canonical signer refuses. Detect this and tell the operator clearly (the
  empty inline list already denies all egress under Strict) rather than failing
  deep inside runSignFlow.
- Messaging: "only allowlisted host:port pairs will pass" → "hosts" — the router
  enforces an L7 host match today; per-endpoint port enforcement is reserved for
  a later slice (the signed bundle already carries ports). CLI + docs.
- manual e2e: the core learn/enforce/approve/revoke assertions now log_fail (not
  log_skip) after prerequisites pass, using bounded poll_blocked/poll_allowed
  helpers to tolerate reconcile + NetworkPolicy propagation lag. Prerequisite
  failures (CRD missing, admission rejected) still skip.

Non-blocking, documented: operator TUI actions target the CR in kars-system (the
default operator namespace); a non-default release surfaces a clear "not found"
rather than a wrong-namespace mutation. The scriptable `kars egress` path
resolves the CR namespace + guards ambiguity.

egress.test.ts +1 (removeHost to empty). bash -n + shellcheck -S error clean;
CLI typecheck + oxlint (0 errors) + 904 vitest pass; all 7 ci-gates pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos merged commit 4538d3f into main Jun 29, 2026
36 checks passed
@pallakatos pallakatos deleted the fix/egress-learn-enforce-flow branch June 29, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant