Skip to content

feat(up): Foundry auto-setup, best-model selection, memory CRD parity + fix kars up hang#454

Merged
pallakatos merged 1 commit into
mainfrom
feat/foundry-setup-best-model-ux
Jun 25, 2026
Merged

feat(up): Foundry auto-setup, best-model selection, memory CRD parity + fix kars up hang#454
pallakatos merged 1 commit into
mainfrom
feat/foundry-setup-best-model-ux

Conversation

@pallakatos

Copy link
Copy Markdown
Collaborator

Summary

Makes kars up --foundry-endpoint actually set up a BYO Foundry project so Memory Store works out of the box, stops hardcoding a stale model, fixes the post-deploy hang, and surfaces previously-masked memory errors.

Surfaced from real clean-room runs (Pal + @laevenso). Not for merge until tested.

Foundry auto-setup (new cli/src/commands/up/foundry_setup.ts)

  • Discover the project; list deployed models via ARM control-plane (caller's own az token — no Graph).
  • Best deployed chat model instead of hardcoded gpt-4.1 (pure, tested ranking; --model always wins; excludes embedding/image/audio). On the live azureclaw-foundry set this picks gpt-5.4.
  • Ensure an embedding model (Memory Store needs one) — best-effort deploy text-embedding-3-small if absent.
  • Enable the project system-assigned MI if missing (Memory Store authenticates internally as the project MI), then re-read its principalId for the existing Azure AI User RBAC grant.
  • All idempotent + non-fatal — every failure degrades to a clear note; the deploy never aborts.

CRD parity + status

  • Emit a KarsMemory binding CR on kars up (Foundry endpoints only), matching what kars dev already creates.
  • CRD status report after apply (InferencePolicy / ToolPolicy / KarsMemory / KarsSandbox + phase).

Fix the kars up hang (two causes)

  • cli/src/preflight.ts: the RBAC spinner was only concluded when fetchSubscriptionPermissions threw or returned a non-empty set. An empty [] (no throw) left it spinning — its setInterval kept Node alive, so kars up hung after the summary with the spinner still animating (reproduced by two operators). Now concluded on the empty path. A second identical leak in the provider notFound path is also fixed.
  • up.ts: process.exit(0) on success (belt-and-suspenders for the detached kubectl port-forward handle).

Memory error unmasking (runtime)

  • ensureStore uses the strict router call for POST /memory_stores so the real 403/400 surfaces (MI not enabled / RBAC still propagating / no embedding model) instead of the generic "could not be created".

Security audit

docs/internal/security-audits/2026-06-25-foundry-autosetup-bestmodel-memory-spinner.md (2 sign-offs). No new role/scope/principal — the two writes are operator-scoped, idempotent, best-effort, on their own Foundry resource. security-audit-required + copyright-headers pass locally.

Verification

  • CLI: tsc clean, oxlint 0 errors, 831 tests (+10 new).
  • Runtime: tsc clean, oxlint 0 errors, 244 tests.

Note

The memory-unmask change lives in the sandbox image — needs kars push --only sandbox --apply (or the release build) to reach a running pod. The kars up changes are CLI-only and effective immediately.

… + fix kars up hang

Make `kars up --foundry-endpoint` actually set up a BYO Foundry project for
Memory Store, stop hardcoding a stale model, and fix the post-deploy hang.

Foundry auto-setup (new cli/src/commands/up/foundry_setup.ts):
- Discover the project; list deployed models (ARM control-plane, no Graph).
- Pick the BEST deployed chat model instead of hardcoded gpt-4.1 (pure, tested
  ranking; --model always wins). Excludes embedding/image/audio.
- Ensure an embedding model (Memory Store needs one); best-effort deploy
  text-embedding-3-small if absent.
- Enable the project's system-assigned managed identity if missing (Memory
  Store authenticates internally as the project MI), then re-read principalId
  for the existing Azure AI User RBAC grant. All idempotent + non-fatal.

CRD parity + status:
- Emit a KarsMemory binding CR on `kars up` (Foundry endpoints only), matching
  what `kars dev` already creates (refs.ts buildKarsMemory/memoryRefName).
- Print a CRD status report (InferencePolicy/ToolPolicy/KarsMemory/KarsSandbox).

Fix the hang (two causes):
- cli/src/preflight.ts: the RBAC spinner was only concluded when
  fetchSubscriptionPermissions threw or returned a non-empty set; an empty []
  left it spinning, and its setInterval kept Node alive — `kars up` hung after
  the summary with the spinner still animating. Conclude it on the empty path.
  Also fix a second identical leak in the provider notFound path.
- up.ts: process.exit(0) on success (belt-and-suspenders for the detached
  kubectl port-forward handle).

Memory error unmasking (runtime):
- foundry.ts ensureStore uses the STRICT router call for POST /memory_stores so
  the real 403/400 surfaces (MI not enabled / RBAC propagating / no embedding
  model) instead of the generic "could not be created".

Security audit: docs/internal/security-audits/2026-06-25-foundry-autosetup-bestmodel-memory-spinner.md (2 sign-offs).
Verification: CLI tsc+oxlint clean, 831 tests (+10); runtime tsc+oxlint clean,
244 tests; model ranking validated against the live azureclaw-foundry set.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@pallakatos pallakatos merged commit b719465 into main Jun 25, 2026
35 checks passed
@pallakatos pallakatos deleted the feat/foundry-setup-best-model-ux branch June 25, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant