fix(egress): repair learn/enforce flow — operator toggle + CLI approve/deny/enforce#480
Merged
Merged
Conversation
…e/deny/enforce
Both bugs stem from CLI/operator code still calling router endpoints that
Slice 5c.1 removed (/egress/approve, /egress/deny, /egress/enforce,
/egress/pending). The authoritative model is now the KarsSandbox CRD
(allowedEndpoints → controller-published, cosign-verified bundle), the
EgressApproval CRD (TTL grants), and egressMode (Learn|Strict) with a live
POST /egress/learn toggle.
1. Operator (TUI) could not move Strict → Learn. learnEgress called the
runtime /egress/learn probe FIRST, uncaught, so a probe failure skipped the
authoritative CRD patch entirely. It also sent no body (defaulting the
router to enabled:false, i.e. DISABLING learn). Fix: patch the CRD
egressMode first (mirrors enforceEgress), then a best-effort {enabled:true}
probe wrapped in its own .catch. Added a symmetric best-effort {enabled:false}
toggle to enforceEgress.
2. kars egress --approve/--deny/--enforce/--pending + the default status view
hit the removed endpoints (the reported exit code 1). Re-pointed to the real
mechanisms:
- --approve <domain[:port]>: add host:port (default :443) to baseline
allowedEndpoints + re-sign (sign-by-default).
- --deny <domain>: remove host + re-sign. --deny is now in the signing
context so a revocation is authoritative (was fail-open).
- --enforce: patch egressMode=Strict + sign baseline; best-effort live
learn-off toggle.
- --pending / status: show learned-but-not-allowlisted domains.
- --learn/--no-learn: durable CRD patch in k8s mode, runtime-only in Docker.
Robustness: port-less baseline entries are normalized to :443 before signing
(the signer requires a port and would otherwise silently drop them); approve/
deny self-heal a prior failed sign by always re-signing in a signing context;
runSignFlow stays fail-CLOSED (allowlistRef patched only after a successful
cosign) and now returns status so callers warn "not yet authoritative" on
failure. Docker local-dev paths preserved (--approve/--deny/--enforce refuse
clearly; --learn keeps the runtime toggle).
Also rewrote tests/e2e-manual/scenarios/egress_lifecycle.sh to the CRD model
(egressMode patch + EgressApproval create/delete) and fixed its stale
spec.egress.mode field → spec.networkPolicy.egressMode.
Tests: egress.test.ts +16 (parseDomainPort, unionEndpoint/removeHost incl.
port-less preservation, updated signing-context error text). CLI typecheck +
oxlint (0 errors) + build clean; vitest 903 pass / 2 skipped. All 7 ci-gates
pass. Two rubber-duck reviews (k8s + architecture) addressed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
…ce + docs)
Follow-up sweep after the learn/enforce flow repair found more references to the
removed Slice 5c.1 endpoints — none were live callers, but several gave users
stale instructions:
- inference-router/src/routes/egress.rs: the /egress/fetch 403 response told
agents to run the REMOVED `kars egress --pending`/`--approve` workflow. Now
points to the real remediation (operator `--approve` re-sign, or a temporary
EgressApproval via `allow-extra`). Also fixed the handler's doc comment
("create pending approval request" → deny in Strict / log in Learn; no queue).
- inference-router/src/routes/internal.rs: corrected a stale comment about the
`/egress/learned/blocked` "old shape" / --pending workflow.
- cli/src/commands/operator/dialogs/egress.ts: legend "learned (pending
approval)" → "learned (not yet approved)"; status label `pending=` → `learned=`
(the `signed-pending` cosign-verification state is unrelated and unchanged).
- docs/egress-proxy.md: rewrote the lifecycle diagram, operator workflow, CLI
table, and API-endpoints table (removed /egress/approve|deny|pending|enforce,
documented the CRD + signed-bundle + EgressApproval model and :443 default).
- docs/cli-reference.md: fixed the egress section — `--enforce` is "Strict +
sign", NOT "promote all learned"; corrected --approve/--deny/--pending
descriptions; added allow-extra.
- docs/operations/gitops.md: error-text now lists --deny too.
Verified `learn_egress` (spawn/handoff) is CURRENT, not stale — spawn reads
networkPolicy.egressMode ↔ maps it back, so it's the internal bool form of the
CRD field. check_egress already returns the correct new-model deny message.
Router compiles; 43 egress router tests pass; CLI typecheck + oxlint (0 errors)
+ 903 vitest pass; all 7 ci-gates pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…y, deny-empty, e2e rigor)
Third review pass on the egress flow repair. All blocking findings resolved:
- operator/actions.ts: learnEgress/enforceEgress no longer early-return when the
pod is down. The authoritative CRD egressMode patch now runs regardless — the
operator MUST be able to flip the mode to recover a crashing sandbox — and only
the best-effort live /egress/learn probe is gated on a running pod.
- egress.ts discoverKarsSandboxNamespace: a name that exists in MULTIPLE
namespaces is now a hard error ("pass --namespace to disambiguate") instead of
silently patching the first match (wrong-sandbox-mutation guard).
- egress.ts --deny: removing the LAST endpoint leaves an empty baseline, which
the canonical signer refuses. Detect this and tell the operator clearly (the
empty inline list already denies all egress under Strict) rather than failing
deep inside runSignFlow.
- Messaging: "only allowlisted host:port pairs will pass" → "hosts" — the router
enforces an L7 host match today; per-endpoint port enforcement is reserved for
a later slice (the signed bundle already carries ports). CLI + docs.
- manual e2e: the core learn/enforce/approve/revoke assertions now log_fail (not
log_skip) after prerequisites pass, using bounded poll_blocked/poll_allowed
helpers to tolerate reconcile + NetworkPolicy propagation lag. Prerequisite
failures (CRD missing, admission rejected) still skip.
Non-blocking, documented: operator TUI actions target the CR in kars-system (the
default operator namespace); a non-default release surfaces a clear "not found"
rather than a wrong-namespace mutation. The scriptable `kars egress` path
resolves the CR namespace + guards ambiguity.
egress.test.ts +1 (removeHost to empty). bash -n + shellcheck -S error clean;
CLI typecheck + oxlint (0 errors) + 904 vitest pass; all 7 ci-gates pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes two operator-reported bugs in the egress learning/enforcement flow. Both are caused by CLI/operator code still calling router endpoints that Slice 5c.1 removed (
/egress/approve,/egress/deny,/egress/enforce,/egress/pending).The two bugs
1.
kars operatorcan't move Strict → Learn (Shift-L errors).learnEgresscalled the runtime/egress/learnprobe first, uncaught — so a probe failure skipped the authoritative CRD patch entirely (while the symmetricenforceEgress, which only patches the CRD, worked → Strict worked, Learn didn't). It also sent no body, defaulting the router toenabled:false(i.e. disabling learn).Fix: patch the CRD
egressModefirst (authoritative), then a best-effort{enabled:true}probe in its own.catch. Added a symmetric best-effort{enabled:false}toggle toenforceEgress.2.
kars egress prod-agent --approve api.telegram.orgfails (exit code 1).--approve/--deny/--enforce/--pendingand the default status view all called removed endpoints. Re-pointed to the real, authoritative model:--approve <domain[:port]>→ addhost:port(default :443, answering "do I need 443?" — yes) to baselineallowedEndpoints+ re-sign.--deny <domain>→ remove host + re-sign.--denyis now in the signing context so the revocation is authoritative (previously it would not re-sign → fail-open).--enforce→ patchegressMode=Strict+ sign baseline.--pending/ status → show learned-but-not-allowlisted domains.--learn/--no-learn→ durable CRD patch in k8s, runtime-only in Docker.Robustness (from two rubber-duck reviews)
:443before signing (the signer requires a port and would otherwise silently drop them).--approve/--denyself-heal a prior failed sign by always re-signing in a signing context;runSignFlowstays fail-closed (patchesallowlistRefonly after a successful cosign) and now returns status so callers warn "not yet authoritative" on failure.--approve/--deny/--enforcerefuse clearly;--learn/--learnedkeep the runtime path. Merge-patch replaces onlyallowedEndpoints(preservesegressMode/allowlistRef).Also
Rewrote
tests/e2e-manual/scenarios/egress_lifecycle.shto the CRD model (egressMode patch + EgressApproval create/delete) and fixed its stalespec.egress.mode→spec.networkPolicy.egressMode.Tests
egress.test.ts+16 (parseDomainPort, unionEndpoint/removeHost incl. port-less preservation, updated error text). CLI typecheck + oxlint (0 errors) + build clean; vitest 903 pass / 2 skipped. All 7 ci-gates pass. Security audit:docs/security-audits/2026-06-29-egress-learn-enforce-flow-repair.md.Follow-up: full CRD-move leftover sweep (2nd commit)
After the core fix, swept the whole repo for anything else left over from the egress→CRD move:
/egress/fetch403 response was still telling agents to run the removed--pending/--approveflow (now points to the real--approvere-sign /allow-extraremediation).routes/internal.rs), operator drawer labels (learned, notpending), and rewrote the stale docs (egress-proxy.mdAPI + CLI tables,cli-reference.md“enforce promotes all learned” → “Strict + sign”,gitops.md).learn_egress(spawn/handoff) is current, not stale (it bridges tonetworkPolicy.egressMode).43 egress router tests pass; router compiles; CLI 903 pass.