Skip to content

Releases: omerakben/code-oz

v0.21.1-alpha.0

22 May 18:45

Choose a tag to compare

code-oz v0.21.1-alpha.0 — multi-target binary release. See checksums.txt for SHA256s.

v0.21.0-alpha.0

21 May 21:55

Choose a tag to compare

v0.21.0-alpha.0 — M17 brownfield AUDIT runtime

Release steps (run after merge to main):

git tag -a v0.21.0-alpha.0 -m "v0.21.0-alpha.0 — M17 brownfield AUDIT runtime"
git push origin main
git push origin v0.21.0-alpha.0
# release.yml builds the four per-arch binaries + checksums + install.sh,
# publishes @tuel/code-oz to npm, and creates the GitHub release.
# gh release edit v0.21.0-alpha.0 --notes-file docs/handoffs/2026-05-21-v0.21.0-release-notes.md

What this release adds

Before v0.21, code-oz could only start from a blank slate (greenfield DEFINE → PLAN → BUILD → VERIFY → REVIEW). M17 adds the brownfield entry phase: AUDIT → PLAN → .... A brownfield code-oz run now reads an existing repository plus an operator problem statement and produces AUDIT.md — where the problem lives (file:line localization), what was observed versus what the operator proposed, and the constraints a fix must respect — before any plan or code is written.

The brownfield flow, end to end

  1. code-oz run on a brownfield-configured project routes to the AUDIT phase (greenfield is unchanged).
  2. The auditor persona inspects the repo with bounded glob / grep / read tools and drafts AUDIT.md.
  3. The draft is validated against the locked AUDIT.md schema; the Scientist phase-tail writes HYPOTHESES.md + OPEN_QUESTIONS.md (rule 15).
  4. audit_completed records the artifact's sha256; code-oz approve audit re-hashes the on-disk AUDIT.md against it and runs the sidecar validators before writing the gate — the same sha-bound approval contract BUILD/VERIFY/REVIEW use, through the generic gate primitive (no new gate authority).
  5. PLAN reads AUDIT.md instead of SPEC.md for brownfield runs, citing SC-AUDIT-NNN sources under ## Audit sources. The profile is read from the run's event log, so editing config mid-run cannot flip a brownfield run to greenfield.

Discipline receipts

  • Cross-family review caught real defects. The milestone ran 12 Codex review rounds. The full-cycle end-to-end test surfaced two coupling bugs that per-commit fixture tests and per-commit reviews passed — an un-stripped ready-signal that broke AUDIT.md parsing, and a Scientist phase-tail that was never wired into AUDIT. The milestone completion review then caught a third: the live auditor advertised repo tools it never actually dispatched. All three are fixed and covered by an end-to-end test.
  • The auditor persona is hand-authored. Per rule 16, auditor.md, audit-system.md, and the Lead's brownfield section were written by a human and Claude collaboratively, not generated by an automated pass. A CI guard checks the persona body does not leak verbatim from research/planning documents.

Agent Gate Bench (first measured rows)

This release ships the Agent Gate Bench runner (bun run bench:agent-gate -- --fixture all --provider fake). The benchmark measures one claim only: for a fixed set of governance-failure fixtures, which workflows block or silently allow the failure. The deterministic code-oz Fake column is measured now — each cell drives a real production gate (sha-bound approval, cross-family review policy, REVIEW path validator, VERIFY intervention, verdict routing). The direct-agent and live-provider columns require local API keys and stay TBD until those runs land. FakeProvider numbers are determinism receipts, not model-quality claims. Protocol and methodology: docs/benchmarks/agent-gate-bench.md.

Install

npm install -g @tuel/code-oz
# or
curl -fsSL https://github.com/omerakben/code-oz/releases/download/v0.21.0-alpha.0/install.sh | sh -s -- --version v0.21.0-alpha.0

Verification

3762 offline tests pass (network-free, FakeProvider). Greenfield runs are unchanged; the brownfield path is proven by a spawned-CLI full-cycle e2e (AUDIT → approve → PLAN reads AUDIT.md). A live brownfield smoke against a real repository requires provider credentials and was not run in the build session; the deterministic e2e is the reproducible proof.

v0.20.3-alpha.0

15 May 00:21

Choose a tag to compare

v0.20.3-alpha.0 — friend-experience polish (7 findings)

Release flow note (read first): the repo's release.yml workflow auto-creates the GitHub release on tag push with thin auto-generated notes. After pushing the tag, Ozzy edits the release with these notes:

git tag -a v0.20.3-alpha.0 -m "v0.20.3-alpha.0 — friend-experience polish"
git push origin v0.20.3-alpha.0

# wait for the release.yml workflow to finish (creates the release with the binaries + checksums)
gh run watch

# then replace the auto-generated thin notes with the rich notes:
gh release edit v0.20.3-alpha.0 --notes-file docs/handoffs/2026-05-14-v0.20.3-release-notes.md

Why this release matters

v0.20.2 shipped two showstopper fixes (#0a TASK_BLOCK injection + #0b file-manifest expansion) that made BUILD actually function end-to-end against real providers. Two real-provider dogfoods (prdiff greenfield + quizr greenfield-friend) on 2026-05-14 validated the showstopper fixes and surfaced seven polish findings affecting the friend's first install + first run experience.

v0.20.3 closes all seven on individual branches with RED-first tests per rule 22. The release tightens the rough edges that 3,432 offline tests + the Codex pre-tag review missed because they all hid behind the orchestrator's happy path. Memory pin feedback_dogfood_real_user_actions.md was the discipline: every finding came from real friend interaction, not synthetic test coverage.

No new gate authority introduced (rule 20 holds). M17 AUDIT runtime stays scheduled for v0.21. Provider contract unchanged: same four live adapters (Claude, Codex, xAI, Fake), Gemini still a stub, OpenCode and Roo Code still future candidates.

Install

Three channels, same SHA-pinned binary, same checksums.txt:

# curl | sh
curl -fsSL https://github.com/omerakben/code-oz/releases/download/v0.20.3-alpha.0/install.sh \
  | sh -s -- --version v0.20.3-alpha.0

# npm
npm install -g @tuel/code-oz --@tuel:registry=https://registry.npmjs.org/

# Homebrew
brew tap omerakben/code-oz
brew install omerakben/code-oz/code-oz

Platform support: macOS arm64, macOS x64, Linux x64, Linux arm64. Windows + Scoop deferred.

First-friend recipe

The empty-repo intervention (#3) now surfaces with the exact remedy, but for friends new to code-oz the recipe below still gets them from zero to a working run with no manual remediation:

mkdir my-project && cd my-project
git init
git commit --allow-empty -m "init"
code-oz init
cat > INTENT.md <<'EOF'
[describe what you want built — a paragraph or two]
EOF
code-oz run --request "$(cat INTENT.md)" --effort lite

If they forget the git commit --allow-empty, v0.20.3's friendly intervention (#3) names the exact remedy command at the point of failure. If they write INTENT.md first and code-oz init afterward, the new greenfield-seed marker (#2) classifies the project correctly without manual remediation.

What changed in v0.20.3

Friend-experience fixes (the reason v0.20.3 exists)

  • fix(worktree): emit worktree_empty_repo intervention with actionable remedy (#3)git init without git commit no longer leaks fatal: ambiguous argument 'HEAD'. The new worktree_empty_repo intervention code names the remedy verbatim: git commit --allow-empty -m "init". captureBaseCommit() probes git rev-list --all --count to distinguish empty-repo from broken-HEAD.
  • fix(init): treat INTENT.md-only directories as greenfield seed (#2)code-oz init in a directory containing only .git/ + INTENT.md now classifies as greenfield. The new isOnlyGreenfieldSeed() helper short-circuits detectProfile() before any brownfield check fires. A real brownfield project that has INTENT.md alongside other source files still detects as brownfield (the negative-direction test guards this).
  • feat(build): worktree-reset between BUILD attempts on verify-fail restart (#1) — BUILD attempt N+1 on the verify-fail path now resets the worktree to the immutable base commit before persona compose / file-ref derivation / patch apply. The new resetWorktreeToBase primitive runs git reset --hard <baseSha> then git clean -fdx, emits worktree_reset_to_base on success or worktree_reset_failed intervention on failure. Scope: verify-fail only — the M9 review-needs-revision restart path preserves the worktree across attempts so delta patches still apply. Codex debate 019e28d9-bd57-71e0-b1a2-262cae205234 locked the design; verify-fail narrowing applied after test evidence surfaced the M9 contract.
  • fix(build): actionable build_report_notes_too_long intervention payload (#4) — when the builder produces a Notes bullet exceeding 200 characters, the intervention payload now names the bullet index (1-indexed), exact character count, and a preview (first 80 chars). Suggestion array points at the protocol document instead of the generic "run code-oz doctor" fallback.

GUI polish

  • fix(gui): persist workspace/provider/effort across reloads (#5)repoPath, providerMode, and composerEffort now persist to namespaced localStorage keys. Friends refreshing the page no longer lose their workspace selection. One-time mount-only hydration avoids SSR/CSR mismatch; private-mode browsers fall back silently.
  • fix(gui): disk-validate run-registry to surface stale runs (#6)getRunRecord now validates each live run's runDir on read. If the directory was deleted (e.g., rm -rf .code-oz/state/), the in-memory record transitions to lifecycle: 'stale' and the TopBar surfaces "Stale (runDir removed)" instead of a ghost "IN PROGRESS." Fixture records stay unchanged.

Trust hygiene

  • fix(trust): correct the npm scope-routing install recipe (#7)docs/TRUST.md previously recommended npm install -g @tuel/code-oz --registry=https://registry.npmjs.org/ to bypass user-level @tuel:registry mappings. The recipe didn't work: scope-specific ~/.npmrc config has higher precedence than --registry=. v0.20.3 corrects to the --@tuel:registry= form (which sets scope-specific routing on the command line, always wins). Also switches the diagnostic check to npm config get @tuel:registry --global so the check matches the install's actual resolution scope.

Real-provider dogfood validation

Two end-to-end dogfoods produced the v0.20.2 release; this release closes the seven polish findings those dogfoods surfaced. The fixes are validated by:

  • 3,428 CLI tests + 22 GUI unit tests (offline; +13 new tests across the 7 findings).
  • Codex implementation review on the largest change (#1 worktree-reset) — verdict push, no block-push items.

Note on the pre-tag real-provider dogfood

The scripts/release/dogfood-smoke.sh real-provider gate hit Anthropic API overload (HTTP 529 + claude CLI exited with non-zero status 1 mid-invocation) on the 2026-05-14 tag attempt — twice in succession, at different phases (BUILD attempt 1 the first time, PLAN Scientist the second). The same overload window failed an unrelated Codex review agent dispatch with API Error: 529 Overloaded. The failures are infrastructure-side, not v0.20.3 regressions:

  • None of the seven v0.20.3 fixes touch the Claude CLI subprocess path, the Scientist prompt, or token math.
  • The 3,428 offline tests + typecheck + Codex #1 review verdict push cover every changed code path.
  • Memory pin feedback_bug_free_motto.md ("BUG free or it doesn't ship") guards against real regressions, not Anthropic infrastructure incidents.

Re-running dogfood-smoke after the API window stabilizes is the recommended post-tag check. If a real regression hid behind the infra noise, the next dogfood will catch it before users see it.

Provider support matrix (v0.20.3 — unchanged from v0.20.2)

Provider Status Auth
claude Live Claude Max OAuth via claude CLI subprocess
codex Live ChatGPT OAuth via codex CLI subprocess
xai Live Direct HTTPS with XAI_API_KEY env var
fake Live Built-in deterministic adapter
gemini Stub Throws provider_gemini_not_yet_supported; for transparency only
opencode / roo Future candidate Not v0.1; not implemented

Canonical matrix: docs/contracts/PROVIDERS.md § "Provider status (v0.1)".

Known limitations and caveats

  • Public alpha. Treat as such. Production-hardened release line is v0.x stable, not yet shipped.
  • BUILD restart binary-spawn e2e is not yet asserting the new event. v0.20.3 #1 ships unit + integration coverage for the verify-fail worktree-reset path. The existing tests/e2e/cli-verify-fail-restart.test.ts is a binary-spawn e2e but does not yet assert the new worktree_reset_to_base event or compose-time base-only state. Codex review marked this as fix-soon (not block-push); the assertion lands in v0.20.4 unless a verify-fail dogfood surfaces an issue first.
  • Unsigned macOS binaries. Gatekeeper may prompt; the install script applies xattr -d com.apple.quarantine as the alpha workaround. brew install handles this automatically.
  • No GPG / Sigstore signing of checksums.txt. Cryptographic signing lands at v0.x stable.
  • No Windows or Scoop support. Windows is the documented v0.20.x deferred deliverable.
  • Brownfield AUDIT runtime not yet implemented. Detection works; runtime ships in v0.21 (M17).
  • Benchmark numbers are not yet measured. Protocol shipped in v0.20.1; runner + first measured rows ship in v0.21.

Trust verification

sha256sum ~/.local/bin/code-oz   # or wherever you installed it
# Compare to the SHA published at:
#   https://github.com/omerakben/code-oz/releases/download/v0.20.3-alpha.0/checksums.txt

The npm wrapper at npm-wrapper/index.cjs performs this verification automatically on first run. The Homebrew formula bakes the SHA into the formula at render time.

Cross-model peer review for this release

Codex `gpt-5...

Read more

v0.20.2-alpha.0

14 May 22:19

Choose a tag to compare

v0.20.2-alpha.0 — BUILD actually works end-to-end against real providers

Release flow note (read first): the repo's release.yml workflow auto-creates the GitHub release on tag push with thin auto-generated notes. After pushing the tag, Ozzy edits the release with these notes:

# tag and push (triggers release.yml)
git tag -a v0.20.2-alpha.0 -m "v0.20.2-alpha.0 — BUILD works end-to-end against real providers"
git push origin v0.20.2-alpha.0

# wait for the release.yml workflow to finish (creates the release with the binaries + checksums)
gh run watch

# then replace the auto-generated thin notes with the rich notes:
gh release edit v0.20.2-alpha.0 --notes-file docs/handoffs/2026-05-14-v0.20.2-release-notes.md

Why this release matters

v0.20.1 shipped polished install channels, a public-truth README, and a failure-mode demo. v0.20.1 did NOT actually run an end-to-end BUILD against real providers — every offline test path used the 01-todo-cli demo's hand-authored fake-script.jsonl, and no real-provider dogfood was attempted before tag. The first real dogfood (prdiff greenfield, 2026-05-14) surfaced two showstopper bugs:

  • #0asrc/prompts/build-system.md had no {{TASK_BLOCK}} substitution slot. Builder Opus received the universal rules + agent body + tool list but no per-task PLAN.md content. With nothing to build, Opus correctly refused via build_persona_protocol_violation on every retry.
  • #0b — BUILD and REVIEW invoked the agent with filesSent: 0. The orchestrator loaded PLAN.md for its own validation purposes but never threaded the task's file manifest into the agent invocation. The agent had no source files to read.

v0.20.2 fixes both. After the fix, BUILD reaches task_completed on real Opus + GPT-5.5 against two separate greenfield projects (prdiff and quizr); cross-family REVIEW returns substantive findings against real code instead of "I can't see the source." The motto for this release: BUG free or it doesn't ship. Half-working flows don't ship; the dogfood is the gate.

No new gate authority introduced (rule 20 holds). M17 AUDIT runtime stays scheduled for v0.21. Provider contract unchanged: same four live adapters (Claude, Codex, xAI, Fake), Gemini still a stub, OpenCode and Roo Code still future candidates.

Install

Three channels, same SHA-pinned binary, same checksums.txt:

# curl | sh
curl -fsSL https://github.com/omerakben/code-oz/releases/download/v0.20.2-alpha.0/install.sh \
  | sh -s -- --version v0.20.2-alpha.0

# npm (publish pending — see "Distribution status" below)
npm install -g @tuel/code-oz

# Homebrew (formula bump pending — see "Distribution status" below)
brew tap omerakben/code-oz
brew install omerakben/code-oz/code-oz

Platform support: macOS arm64, macOS x64, Linux x64, Linux arm64. Windows + Scoop deferred to v0.20.x (see "Known limitations").

First-friend recipe (do this verbatim)

The greenfield-friend dogfood proved the GUI+CLI flow works only when the project's git repo has at least one commit before BUILD fires. Until v0.20.3 closes the empty-repo intervention, friends should follow this recipe:

mkdir my-project && cd my-project
git init
git commit --allow-empty -m "init"   # REQUIRED until v0.20.3
code-oz init
cat > INTENT.md <<'EOF'
[describe what you want built — a paragraph or two]
EOF
code-oz run --request "$(cat INTENT.md)" --effort lite

This is the recipe that produced task_completed on the quizr greenfield dogfood. The GUI workflow follows the same flow (init the directory + commit, then point the GUI workspace at it).

What changed in v0.20.2

Showstopper fixes (the reason v0.20.2 exists)

  • feat(prompts): inject TASK_BLOCK into BUILD system prompt (#0a)src/prompts/build-system.md gains a {{TASK_BLOCK}} substitution slot. composeBuildPromptPure accepts a PlanTask parameter and renders the task's id, files, validation, risk, hypotheses, and sources into the prompt sent to Opus. The Codex debate that locked this design is captured at docs/design/V0_20_2_SHOWSTOPPER_0A_BRIEFING.md + ..._CODEX_RESPONSE.md (thread 019e281e).
  • feat(runtime): derive BUILD file refs from PlanTask + widen invokePersona (#0b)src/phases/build.ts and the REVIEW phase now build their ProviderRequest.files manifest from task.files (expanding directories, applying allowlists, dropping paths that escape the worktree root). Builder Opus and REVIEW GPT-5.5 receive real file content instead of empty manifests. The Codex debate is at docs/design/V0_20_2_SHOWSTOPPER_0B_BRIEFING.md + ..._CODEX_RESPONSE.md (thread 019e2827). Codex R1 caught a path-traversal escape in the file-manifest expansion; fix landed in 6720683.

GUI polish (closed in same release because all hit the friend-first-run path)

  • feat(gui): explicit honesty banner above Composer in fake-provider mode (#3) — FakeProvider banner renders on first paint with the literal "fake response" string + "reached the conversation cap" failure mode named. Disappears when user switches to "Real providers."
  • feat(gui): plumb --effort selector through Composer to code-oz spawn (#5) — GUI dropdown values (lite | balanced | max | beast) thread through to spawn argv. Validated via effort_envelope_applied lite event in events.jsonl on the quizr dogfood.
  • fix(gui): close exit-0 silent swallow in spawnCodeOzRun (#6) — Composer submit no longer silently swallows; RUN HISTORY immediately shows "TODAY · IN PROGRESS" instead of a 30-second blank wait.
  • fix(gui): make spawn runId timeout configurable + raise default to 60s (#4) — bun-based dev installations no longer hit the previous 30s spawn ceiling.

Trust hygiene

  • docs(trust): document npm @tuel scope-routing install trap (#2)docs/TRUST.md § "npm scope-routing trap" names the ~/.npmrc @tuel:registry override case that can silently fail npm install -g @tuel/code-oz. Codex R1 added the explicit --registry=https://registry.npmjs.org/ remedy command. The trap was the only documented blocker for friends installing from npm.

CI hygiene

  • fix(ci): green Tests badge — drop unused appendFile + scope bun test to ./tests — the README Tests badge resumes green status after the badge-source job stopped trying to run experimental test paths that don't exist in the open tree.

Pre-tag gate

  • feat(release): dogfood-smoke pre-tag gate scriptscripts/dogfood-smoke.sh runs a deterministic FakeProvider lifecycle to task_completed before any tag push. v0.20.2 ran this gate and passed before tagging.

Real-provider dogfood validation

Two end-to-end dogfoods produced the v0.20.2 ship gate. Both ran against fresh greenfield projects with real Opus 4.7 + GPT-5.5 spend.

prdiff dogfood — showstopper validation

Gate Status Evidence
BUILD T-001 attempt 1 emits <build-ready/> + valid unified-diff patch PASS build_completed event at 21:05:05; patch sha ebe6553b, 1659 bytes; 4 files created
Patch applies to the worktree PASS worktree_patch_applied event; files visible in .code-oz/runs/<run>/worktree/
VERIFY runs the validation command and captures evidence PASS verify_completed at 21:09:50; bun run src/prdiff.ts --version exit 0, 8ms, 6 bytes stdout
Cross-family REVIEW reads source files + returns honest verdict PASS REVIEW round 1 at 21:18:39 returned needs-revision score=5 with 2 fix-first findings against the actual code

Cost: ~$0.40 real Opus + GPT-5.5 spend, ~30 min wall clock. Run ID 01KRM4EBQRX0GK1RT85RP1EF6D. Full verdict at docs/handoffs/2026-05-14-v0.20.2-dogfood-verdict.md.

quizr greenfield-friend dogfood — full friend flow

A first-friend simulation: brand-new directory + INTENT.md + GUI at localhost:3000 + --effort lite. Builder's chain-of-thought line for T-001 confirms the task block reached the model:

"Empty greenfield worktree confirmed. I'll scaffold the project with strict shape-test discipline so downstream tasks can trust the bank data."

That sentence namechecks the verbatim risk text from PLAN.md T-001. Impossible without #0a's TASK_BLOCK injection working end-to-end.

Cost: ~$0.55 real Opus + Codex tokens (DEFINE + PLAN + BUILD attempt 1, --effort lite). Run ID 01KRM8G35J71Q1ZYAKGNZ6JE1V. Full verdict at docs/handoffs/2026-05-14-v0.20.2-greenfield-friend-dogfood.md.

Cross-family REVIEW now returns substantive findings

Before v0.20.2, REVIEW invocations carried filesSent: 0 and GPT-5.5 had no source to read; verdicts said "I can't see the code." After v0.20.2, REVIEW reads the same file manifest the builder produced, and findings are about the actual code:

F-001: src/prdiff.ts has --help and default-invocation branches that VERIFY didn't exercise (only --version was validated). Recommendation: add a Bun test file covering --version, --help, and default invocation.

F-002: package.json declares "@types/bun": "latest". Universal rule (docs/research/02-llm-failure-research.md rule 18: "every new dependency must be pinned before importing it") requires a pinned version.

Both findings are factually correct against the patched code. The reviewer is doing what cross-family REVIEW exists for — catching what the builder + verifier missed when both were Opus inside the same "scaffold passed, ship it" frame.

Provider support matrix (v0.20.2 — unchanged from v0.20.1)

Provider Status Auth
claude Live Claude Max OAuth via claude CLI subprocess
codex Live ChatGPT OAuth via codex CLI subprocess
xai Live Direct HTTPS with XAI_API_KEY env var
fake Live Built-in deterministic adapter
gemini Stub Throws provider_gemini_not_yet_supported; for transparency only
opencode / roo Future candidate Not v0.1; not implemented
...
Read more

v0.20.1-alpha.0

14 May 15:56

Choose a tag to compare

v0.20.1-alpha.0 — first-run polish + public truth sync

Release flow note (read first): the repo's release.yml workflow auto-creates the GitHub release on tag push with thin auto-generated notes. After pushing the tag, Ozzy edits the release with these notes:

# tag and push (triggers release.yml)
git tag -a v0.20.1-alpha.0 -m "v0.20.1-alpha.0 — first-run polish + public truth sync"
git push origin v0.20.1-alpha.0

# wait for the release.yml workflow to finish (creates the release with the binaries + checksums)
gh run watch

# then replace the auto-generated thin notes with the rich notes:
gh release edit v0.20.1-alpha.0 --notes-file docs/handoffs/2026-05-14-v0.20.1-release-notes.md

Why this release matters

A third-party-eye audit (gpt-5.5 Pro) scored code-oz engineering at 8.0/10 but 1000-star readiness at 3.5/10. The repo was technically more serious than the public surface suggested. v0.20.1 closes the readiness gap with two parts: first-run polish (small src/ fixes that improve the first-time-user experience) and a public truth sync (an honest README, security and community files, a failure-mode demo, a benchmark protocol).

No new gate authority introduced (rule 20 holds). M17 AUDIT runtime stays scheduled for v0.21. Provider contract unchanged: same four live adapters (Claude, Codex, xAI, Fake), Gemini still a stub, OpenCode and Roo Code still future candidates.

Install

Three channels, same SHA-pinned binary, same checksums.txt:

# curl | sh
curl -fsSL https://github.com/omerakben/code-oz/releases/download/v0.20.1-alpha.0/install.sh \
  | sh -s -- --version v0.20.1-alpha.0

# npm
npm install -g @tuel/code-oz

# Homebrew
brew tap omerakben/code-oz
brew install omerakben/code-oz/code-oz

Platform support: macOS arm64, macOS x64, Linux x64, Linux arm64. Windows + Scoop deferred.

Try it

Happy-path demo (1 command, runs offline against FakeProvider):

git clone https://github.com/omerakben/code-oz.git
cd code-oz
bun install
bun run demo:todo-cli

New: failure-mode demo (the demo to watch before trusting the tool):

bun run demo:failure-gates

Five fixtures exercise five production gate APIs and prove the gates refuse the wrong thing: tampered approval, scope-escape path, verify intervention, same-family REVIEW, needs-revision verdict routing. Captured outputs live in docs/demo/02-failure-gates/output/<fixture>/ for inspection.

Walkthrough: docs/demo/02-failure-gates/README.md.

What changed in v0.20.1

First-run polish (src/ fixes inherited from earlier branch work)

Three pre-existing src/ commits on the branch landed before the public-truth-sync session began:

  • fix(providers): classify expired subprocess auth — surfaces auth-expired errors with the actionable suggestion ("re-run claude login / codex login") instead of opaque subprocess stack traces.
  • fix(errors): make intervention pointers line-specificevents.jsonl:line=N pointers in NEEDS_INTERVENTION.json now point at the precise event line, not just the file.
  • fix(cli): close first-run fake and resume paths — first-run code-oz init && code-oz run against FakeProvider no longer hits gaps in the resume path; the init/run/doctor UX is tighter.

Plus the version bump: package.json, src/cli.ts, and src/config/schema.ts (DEFAULT_CONFIG.version) all advance to 0.20.1-alpha.0.

git diff --stat origin/main..HEAD -- src/ is NOT empty for v0.20.1 (the three first-run polish commits + the version bump). What is unchanged is the gate authority surface, the provider contract, and the lifecycle phase taxonomy.

Truth (truth correction track)

  • README hero rewritten to "CI-style gates for AI coding agents."
  • Provider claims corrected: Claude / Codex / xAI / Fake live; Gemini stub; OpenCode + Roo Code as future candidates.
  • package.json description rewritten ("simulation" word removed).
  • "AI software company" metaphor demoted to docs/ABOUT.md historical context.
  • CLAUDE.md truth-synced to match README provider claims.

Trust (trust hygiene track)

  • SECURITY.md with explicit unsigned-binary caveat + signing-milestone pointer.
  • CONTRIBUTING.md with setup, tests, conventional-commit rules, cross-model review discipline.
  • CODE_OF_CONDUCT.md (Contributor Covenant 2.1 by reference).
  • docs/TRUST.md covering data boundaries, artifact trust, install trust, provider auth.
  • .github/ issue templates (4 forms + config) and PR template.

Proof (proof assets track)

  • docs/demo/02-failure-gates/ — 5 fixtures + walkthrough + tests + captured outputs.
  • bun run demo:failure-gates — new package script.
  • docs/comparisons/ai-coding-agents.md — Codex-verified, footnote-sourced comparison vs Cursor / Claude Code / Aider / Continue / Devin.
  • docs/benchmarks/agent-gate-bench.md — benchmark protocol (runner ships in v0.21).
  • docs/design/ROADMAP.md — public Now/Next/Later summary at the top.

Provider support matrix (v0.20.1)

Provider Status Auth
claude Live Claude Max OAuth via claude CLI subprocess
codex Live ChatGPT OAuth via codex CLI subprocess
xai Live Direct HTTPS with XAI_API_KEY env var
fake Live Built-in deterministic adapter
gemini Stub Throws provider_gemini_not_yet_supported; for transparency only
opencode / roo Future candidate Not v0.1; not implemented

Canonical matrix: docs/contracts/PROVIDERS.md § "Provider status (v0.1)".

Limitations and caveats

  • Public alpha. Treat as such. Production-hardened release line is v0.x stable, not yet shipped.
  • Unsigned macOS binaries. Gatekeeper may prompt; the install script applies xattr -d com.apple.quarantine as the alpha workaround. brew install handles this automatically.
  • No GPG / Sigstore signing of checksums.txt. SHA chain protects against accidental tarball corruption, not against a determined supply-chain attacker. Cryptographic signing lands at v0.x stable.
  • No Windows or Scoop support. Deferred.
  • Brownfield AUDIT runtime not yet implemented. Detection works; runtime ships in v0.21 (M17).
  • Benchmark numbers are not yet measured. The protocol doc ships in v0.20.1; the runner and first measured rows ship in v0.21.

Trust verification

# Verify the SHA of your downloaded binary matches the published checksum:
sha256sum ~/.local/bin/code-oz   # or wherever you installed it
# Compare to the SHA published at:
#   https://github.com/omerakben/code-oz/releases/download/v0.20.1-alpha.0/checksums.txt

The npm wrapper at npm-wrapper/index.cjs performs this verification automatically on first run. The Homebrew formula bakes the SHA into the formula at render time.

Cross-model peer review for this release

Codex gpt-5.5 xhigh reviewed both the planning + implementation phases:

  • Planning convergence (R0) — verdict accept-with-modifications. Five block-approve closures + five medium + five missed risks folded into the design before implementation. (docs/design/CODEX_RESPONSE_V0_20_1_POLISH_R0.md)
  • Failure-demo code track (R1) — verdict fix-first. Three block-push findings on framing claims (overclaimed events.jsonl ledger, incorrect verify-fail semantics, fixture 01 NEEDS_INTERVENTION mismatch) closed in commit 52f6c4c. (docs/design/CODEX_RESPONSE_V0_20_1_POLISH_R1.md)

This is the project's discipline: no release ships without an independent (different model family) review.

Next: v0.21.0-alpha.0 (M17 AUDIT runtime)

Brownfield AUDIT runtime lands in v0.21. Roadmap: docs/design/ROADMAP.md#now-next-later.

Tests

bun test: 3395 pass / 0 fail / 2 skip. The 2 skips are opt-in live-provider tests (xAI) gated behind CODE_OZ_LIVE_PROVIDER_TESTS=xai.

v0.20.0-alpha.0

12 May 03:24

Choose a tag to compare

v0.20.0-alpha.0 — first public alpha (backfilled notes)

For Ozzy to post via:

gh release edit v0.20.0-alpha.0 \
  --notes-file docs/handoffs/2026-05-14-v0.20.0-release-notes-backfill.md

This backfill replaces the original v0.20.0-alpha.0 release notes per GPT Pro audit issue #5 ("latest release notes are too thin compared with prior milestone releases"). Backfilling preserves provenance; nothing changes about the binaries.

Why this release matters

v0.20.0-alpha.0 is the first public alpha of code-oz available through curl, npm, and Homebrew. Same SHA-pinned binary across all three channels.

code-oz runs AI coding agents through a repo-local delivery loop: DEFINE → PLAN → BUILD → VERIFY → REVIEW → SHIP. File-based gates, SHA-256-bound approvals, isolated worktrees, events.jsonl ledger, cross-family REVIEW (builder and reviewer must differ in model family).

Use it when direct AI coding is too unconstrained and you want every change to pass through inspectable artifacts before it ships.

Install

# curl | sh
curl -fsSL https://github.com/omerakben/code-oz/releases/download/v0.20.0-alpha.0/install.sh \
  | sh -s -- --version v0.20.0-alpha.0

# npm (scoped under TUEL AI publisher; binary still runs as `code-oz`)
npm install -g @tuel/code-oz

# Homebrew
brew tap omerakben/code-oz
brew install omerakben/code-oz/code-oz

Platform support: macOS arm64, macOS x64, Linux x64, Linux arm64.

Demo

Deterministic happy-path demo runs offline against FakeProvider:

git clone https://github.com/omerakben/code-oz.git
cd code-oz
bun install
bun run demo:todo-cli

The demo runs one full DEFINE → SHIP lifecycle, writes all 5 gate files, exercises cross-family REVIEW (BUILD on Claude family, REVIEW on Codex family), and emits a complete events.jsonl ledger under docs/demo/01-todo-cli/output/.

What shipped in v0.20.0

This release closed W3a (multi-channel distribution surface) plus the locked B1a --effort flag, on top of the M14 / M15 / M16 / PE-1 lifecycle work shipped through earlier alphas.

Distribution (W3a)

  • Four per-arch native binaries built in CI via bun build --compile:
    • darwin-arm64
    • darwin-x64
    • linux-x64
    • linux-arm64
  • Fail-closed install script (scripts/install.sh) with SHA-256 verification (sha256sum → shasum → openssl chain), Linux detection, and tagged-release fetch.
  • npm Node-launcher wrapper (npm-wrapper/index.cjs):
    • No postinstall hook; survives npm ci --ignore-scripts.
    • Downloads + SHA-verifies + caches the per-arch binary at ~/.cache/code-oz/<version>/code-oz on first run.
  • Homebrew formula at omerakben/homebrew-code-oz tap, rendered at release time from checksums.txt.
  • Single SHA-pinned binary contract across all three install channels.

Effort envelope (B1a)

  • code-oz run --effort lite|balanced|max|beast scales budgets.global and budgets.perPhase uniformly.
  • The flag NEVER changes assurance invariants (review rounds, panel slot count, mutation gate threshold, BUILD restart attempt cap, debate-policy thresholds).
  • Run-shape envelope is locked at run start; active-run replay reads the recorded snapshot, not the live config.

Lifecycle and tooling (recap of M14 / M15 / M16 / PE-1)

  • M14 Reviewer panel v1 with cross-family quorum (first simultaneous-provider surface).
  • M15 Debate-policy scheduler v1 (triggers debate on score grey-zone and panel disagreement).
  • M16 Production CLI completion authorities (full init, run, approve, doctor surface on greenfield multi-task PLANs).
  • PE-1 xAI direct HTTP adapter (XAI_API_KEY + redaction discipline + typed error class).

Provider support matrix (v0.20.0)

Provider Status Auth
claude Live Claude Max OAuth via claude CLI subprocess
codex Live ChatGPT OAuth via codex CLI subprocess
xai Live Direct HTTPS with XAI_API_KEY env var
fake Live Built-in deterministic adapter
gemini Stub Throws on invocation; for transparency only

For OpenCode and Roo Code, see the v0.20.1 PROVIDERS.md restructure: future adapter candidates, not v0.1.

Limitations

  • Public alpha. Production-hardened release line is v0.x stable, not yet shipped.
  • Unsigned macOS binaries (Gatekeeper may prompt; xattr -d com.apple.quarantine is the alpha workaround).
  • No GPG / Sigstore signing of checksums.txt yet.
  • No Windows or Scoop support.
  • Brownfield AUDIT runtime is detected but not yet executed; runtime ships in v0.21 (M17).
  • Gemini is a stub (real Gemini adapter is a future-candidates roadmap item).

Trust verification

sha256sum ~/.local/bin/code-oz
# Compare to:
#   https://github.com/omerakben/code-oz/releases/download/v0.20.0-alpha.0/checksums.txt

Tests

bun test: 3362 offline tests passing in CI. Live xAI integration tests are opt-in, gated behind CODE_OZ_LIVE_PROVIDER_TESTS=xai + CODE_OZ_LIVE_XAI_MODEL=<grok-variant>.

Next

The full v0.20.1 first-run polish + public truth sync release adds three small src/ first-run polish fixes (provider auth-expired classification, intervention-pointer specificity, CLI first-run-fake + resume-paths) plus the README + security + community + failure-demo + benchmark protocol work. See v0.20.1-alpha.0 when published. No new gate authority is introduced; provider contract unchanged.

M17 AUDIT runtime ships in v0.21.

v0.19.0-alpha.0 — B1a effort flag + opencode triage + runnable demo

11 May 19:51

Choose a tag to compare

What's new

Behavior changes (one new authority per rule 20):

  • code-oz run --effort lite|balanced|max|beast (B1a + new CLAUDE.md rule 23) scales budgets.global and budgets.perPhase envelopes at run start, locks the recorded snapshot into events.jsonl via a new effort_envelope_applied event at position 2 (between run_started and phase_entered), replays the snapshot at active-run reload sites (so editing .code-oz/config.yaml mid-run can no longer silently change the envelope), and rejects mismatched --effort on active runs. The flag scales budgets ONLY; never maxReviewRounds, panel slot count, mutation gate threshold, debate-policy thresholds, or AUDIT strictness — until an assurance-aware effort contract amends the rule.

Trust-boundary contracts (design-only, demand-gated implementation):

  • docs/contracts/MCP_TRUST_BOUNDARY.md (opencode Commit A 1/3): 12 invariants for any future MCP integration (no startup auto-connect, per-server allowlist, env/header redaction, OAuth direct-token discipline distinct from subscription-first auth lock, deny-by-default tool_use.mcp scope, audit-event envelopes mirroring repo_context_searched). Implementation milestone opens on demand checkpoint.

Roadmap candidate slots (post-M16, demand-gated, not pre-locked):

  • Deny-dominant wildcard permission expressions (opencode B2): pattern-language extension to permission enforcement with deny-override-allow semantics, permission_pattern_evaluated event.
  • Cancellation, timeout, and debate-recursion guard (opencode M-CANCEL): structured cancellation via AbortSignal, runReviewPanel per-voter timeoutMs, nested-requestDebate rejection.

Production prose fix:

  • src/phases/verify-mutation.ts:186,191 mutation-status notes now say "validation command" instead of "new tests" — honest for any validation command shape (test runners, build commands, file-existence checks). Caught by the post-3-session Codex retrospective (thread 019e188a).

Demo:

  • New bun run demo:todo-cli [--effort lite|balanced|max|beast] ships a runnable end-to-end greenfield todo CLI walkthrough. Drives DEFINE → PLAN → BUILD → VERIFY → REVIEW → SHIP through bun run src/cli.ts via FakeProvider with canned responses authored fresh for the todo CLI. All 5 gate files + 8 artifacts + events.jsonl captured per effort level under docs/demo/01-todo-cli/output/. README walkthrough at docs/demo/01-todo-cli/README.md. Asciicast deferred to a v0.19.x follow-up.

What stayed

3299 offline tests pass at this tag (2 skipped — opt-in live xAI integration tests).

Process notes

  • Locked 3-session plan + demo prep executed across 2026-05-12 (Sessions 1/2/3 + Codex retrospective + tag commit).
  • Cross-model peer review: 4 Codex rounds on B1a (pre-design + R0 + R1 + R2 push), 1 Codex round on the opencode triage merge (R-merge push), 1 Codex retrospective on the full sweep (019e188a, verdict fix-first → all actionable findings closed except deferred block-next-comparison index hygiene).
  • Two memory entries added: feedback_preflight_worktree_state.md (preflight WIP worktree state before drafting multi-session plans) + feedback_stash_on_stale_base.md (record merge-base SHA in stash messages when stashing on a stale base).
  • All 6 version-bearing surfaces (package.json, src/cli.ts, src/config/schema.ts, tests/m5-fix-first.test.ts, tests/cli-init.test.ts, tests/smoke-test.test.ts) bumped in a single commit — applies the v0.18 release residue lesson.

Diff range

`e64e4ff..4f4d061` (15 commits)

Full changelog

v0.18.0-alpha.0 — Template-comparison sweep landing

11 May 04:26

Choose a tag to compare

Summary

17 PRs merged on 2026-05-11 closing the full 22-template comparison series with 12 substantive borrows.

Rules added/expanded

  • Rule 22 — consumer-first design + RED-first TDD (byterover-cli borrow)
  • Rule 1 — intervention-writer authority expansion (Mimir C-MIMIR-1): intervention/control primitives (writeNeedsInterventionGate/writePauseGate/writeStopGate via writeControlGate) explicitly listed as gate-file writers alongside approval primitives
  • Rule 9 — generalized from .ts escape hatch to any executable runner (codex template borrow): .ts/.py/.sh/native binaries all require permission manifest with command/interpreter/cwd/file_roots/network/env/secrets/timeout/output_caps
  • Rule 16 — persona-generation discipline pinned (Mimir C-MIMIR-4): LLM-generated personas forbidden; deterministic template renderer required. Universal-rules list expanded from 20 to 21 items (agent-skills round-2)

Features landed

  • Named approval presets (auto/paranoid/interactive) — codex template
  • REVIEW specialist rubric + module-size sub-skill prompts
  • PLAN mutation/exploration discipline
  • lintSpecQuality DEFINE warnings (M-SPEC1, prd-taskmaster borrow)
  • Allowlisted env reader with Bun /proc/self/environ fallback (pi-mono B5)
  • Cross-family handoff matrix test, 12 pairs (pi-mono B4)
  • parentTaskId fan-out cost rollup (byterover B3)
  • Actor-attribution discipline on all event types (Chorus §3.5)
  • Guardrails fail-open posture for malformed warn rules + CRLF tolerance + validationOutcome round-trip preservation + dedicated guardrail_invalid_condition_field error code (claude-code)
  • Codegraph runner symbol-guard hardening + Windows path portability + strengthened guard-order test
  • ADR gate + architecture vocabulary affordances (mattpocock-skills B3'/B4', M18b partial)
  • Agent-skills PLAN bug-fix bullet (Bugfix:) for tasks reusing existing failing tests

Roadmap reservations

  • M17 candidate — cross-tool AGENT_FILES discovery (gptme B3-narrowed)
  • M18 candidate — deterministic context-projection + compaction-opportunity probe (gptme B1-narrowed)
  • M18b — ADR gate + architecture vocabulary (mattpocock B3'/B4')
  • M19 validation-loop — feedback-loop declaration in PLAN/BUILD/VERIFY + [CODEOZ-DEBUG-<runId>] prefix + VERIFY residue check (mattpocock B2'/B5')
  • M19+ candidate — worktree topology refusal modes (gptme B2-deferred)
  • M20+ candidate — release/run-quality eval harness (gptme D3)

Tests

3244 pass / 2 skip / 0 fail (up from 3108 baseline at session start; +136 tests).

Authorities preserved

  • M14 Reviewer panel v1
  • M15 Debate-policy scheduler v1
  • M16 Production CLI completion
  • PE-1 xAI direct HTTP adapter

Process

All 17 merges went through GitHub squash. Conflict resolutions were applied for CLAUDE.md, README.md, ROADMAP.md, and src/state/schemas.ts where parallel work touched the same files. Per the cross-model peer review rule, individual PRs already passed their own Codex review rounds; this tag aggregates pre-reviewed work.

PRs in this release

#12, #13, #14, #15, #16, #17, #18, #19, #20, #21, #22, #23, #24, #25, #26, #27, #28

v0.17.0-alpha.0 — M16 production CLI completion

10 May 15:54

Choose a tag to compare

M16 — Production CLI completion

Wires the production CLI runtime for the full DEFINE → PLAN → BUILD → VERIFY → REVIEW → SHIP cycle on greenfield multi-task PLANs. Pre-M16 the runtime functions (runBuild, runVerify, runReview) had full test coverage but were never wired into code-oz run.

Highlights

  • Per-task lifecycle cursor. New event types (task_started, task_review_passed, task_completed, review_remediation_recorded, gate_file_cleared, fake_provider_warning_emitted) drive multi-task progression through the existing phase machine without changing it.
  • Production dispatch surface. dispatchBuild / dispatchVerify / dispatchReview mirror the dispatchPlan shape; production seams (productionInvokePersona, productionRunner, productionRevertSeam, productionPanelistInvoker) decouple seam complexity from dispatch logic.
  • Cursor-aware approval. approveReviewTaskGate advances phase_entered to build for the next pending task or ship when cursor.allCompleted=true.
  • Multi-task gate-file lifecycle. New clearStaleGateFile helper + gate_file_cleared event emit on task and attempt boundaries; supersedence pattern applied across 6 sibling state helpers.
  • Phase locks. .build.lock / .verify.lock / .review.lock (separate from lockDir) prevent concurrent dispatchers; new .worktree.lock serializes loadOrCreateRunWorktree.
  • code-oz doctor run. Read-only inspector for runId, currentPhase, task cursor, last 10 events, intervention state, worktree state, scheduler events.
  • --provider fake warning. Loud stderr banner + audit event on every dispatcher when fake provider is active.
  • CLI e2e via binary spawn. Multi-task lifecycle test (T-001 happy + T-002 needs-revision-restart + T-003 happy) drives the full surface through bun run src/cli.ts. New VERIFY-fail restart e2e covers the destroy-and-recreate path.

Cross-model peer review trail

  • R0 (planning convergence): feature-with-modifications, accepted in 971988d. Caught the M16/M17 split (no task lifecycle cursor was a structural blind spot).
  • R1 (post-implementation): fix-first, 4 block-push + 1 fix-soon. All closed in 6 follow-up commits (70107dc..f8e385e).
  • R2 (re-review): push. No findings; all R1 closures verified.

Empirical lesson

12 production bugs were caught and closed within M16 — 8 by the C12 multi-task e2e (tests/e2e/cli-multi-task-cycle.test.ts), 4 by Codex R1. None were caught by per-commit unit tests or per-commit Codex pre-design alone. Empirical case for the durable rule that integration tests are non-negotiable for cross-cutting state-machine work. Lesson for M17: rule 20 (one new authority per milestone) needs sharper application — C9 bundled six sub-surfaces under "task-loop dispatch" and the breadth let coupling bugs through.

Test count

2706 → 3108 (+402)

What's next (M17)

  • SHIP runtime + full code-oz resume.
  • Per-task scaling for default per-phase budgets (M16 raised defaults; per-task scaling deferred).
  • Process-kill resume e2e (mid-attempt SIGKILL recovery).
  • Panel-mode multi-task e2e via binary spawn.

🤖 Generated with Claude Code

v0.16.0-alpha.0 — M15 Debate-policy scheduler v1

09 May 00:06

Choose a tag to compare

M15 — Debate-policy scheduler v1

Orchestrator-side automatic-trigger policy for the existing single-opponent requestDebate() runtime built in M10. The scheduler decides when to fire a cross-family debate based on objective signals from completed REVIEW artifacts; M10's runtime decides how the debate executes; M14's panel surface is read-only consumed.

What landed

  • Pure scheduler predicate (src/policy/debate-scheduler.ts) — 11-gate first-match-wins decision function; mode-aware trigger split (single mode evaluates score_in_grey_zone [5, 7] and needs_revision_with_high_score; panel mode evaluates panel_voter_disagreement only).
  • Production fire path (src/phases/review.ts:1178-1419) — closure-based executor wired into the new private runReviewRoundLocked. Selects an M11-eligible cross-family opponent, runs requestDebate, then re-enters runReviewRoundLocked recursively with schedulerEnabled: 'disabled_post_debate' so the post-debate REVIEW round runs inside the same outer .review.lock envelope (no recursion into the public runReview, no double-fire).
  • Aggregate budget preflight wired at both single + panel call sites; assertWithinBudget per-call chokepoint stays as the backstop.
  • Real fingerprint+severity finding diff (src/phases/review-fire-path.ts:243-323) — actionableFindingsAddedCount derived from canonical pre/post REVIEW artifacts; severity escalation from nit/fyi to {block, fix-first} counts as actionable added but not findings added.
  • Rule-21 ship gate (src/commands/doctor-debate-baseline.ts) — denominator counts every debate_scheduler_fired; discriminated JoinedFire union (success | error | missing); errorCount + missingTerminalCount surfaced so the gate cannot be gamed by orphaned/errored fires.
  • C13a fired-before-debate-started orderingemitFired callback locks the trace order to evaluated → fired → debate_started → debate_resolved → postreview so resume contracts don't break.
  • Scheduler-resume mismatch detection (src/phases/review-fire-path.ts:577-742) — three crash points (evaluated_no_terminal, fired_no_debate_started, debate_resolved_no_postreview) halt with typed NEEDS_INTERVENTION and actionable suggestions.
  • Bundled reviewer permissionsrc/agents/defaults/reviewer.md gets tool_use.debate { opposingProviders: ['claude'], maxConcurrent: 1, ... } so auto-mode works out of the box.
  • C17 production-trace e2e (tests/e2e/debate-scheduler-production-baseline.test.ts) — full DEFINE → PLAN → BUILD → VERIFY → REVIEW pipeline through real runReview + real requestDebate + real recursive post-debate round; proves the rule-21 reducer reads production-emitted events, not fixture math.
  • code-oz doctor --debate-policy + code-oz doctor --debate-policy-baseline commands.

Cross-model peer review

  • R1 (docs/research/CODEX_REVIEW_M15.md, thread 019e092f) — fix-first, 5 block-push + 4 fix-soon. Production auto-fire was a no-op at 38f2c10; rule-21 proved fixture math, not scheduler behavior.
  • Replan (docs/research/CODEX_RESPONSE_M15_REPLAN.md, thread 019e093d) — accept-with-modifications on Path B (reshape M15 to include full production wiring; tag once at end). C12-C19 commit sequence locked.
  • R2 (docs/research/CODEX_REVIEW_M15_R2.md, thread 019e09bb) — push. All 9 R1 findings closed at the runtime surface; one non-blocking N1 nit on briefing meta-drift deferred per nits/fyis-only-can-defer policy.

Authority boundary (rule 20)

Orchestrator-side automatic-trigger policy for existing single-opponent requestDebate. M10 primitive unchanged. M14 panel surface read-only consumed. M16+ deferred: multi-opponent debate, panel corrective-delta oracle semantics, broad auto-resume UX, advisory-block triggers, verdict-confidence triggers, pre-VERIFY scheduling, scheduler persona.

Verification

  • Tests: 2706 pass / 0 fail / 1 skip (live xAI gated)
  • Typecheck: clean
  • bun run dev doctor --debate-policy-baseline tests/fixtures/debate-scheduler-baseline: rule-21 ship gate PASSES on the canonical fixture set
  • Tag SHA: 83ce091 (Codex-blessed)