Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions docs/blog/2026-05-ai-release-gate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: I built an AI release gate. It just blocked my own release.
date: 2026-05
canonical: https://github.com/omerakben/code-oz/blob/main/docs/blog/2026-05-ai-release-gate.md
status: DRAFT for Ozzy review — personalize voice, add your own framing where marked [OZZY]
---

# I built an AI release gate. It just blocked my own release.

I was about to tag `v0.20.0-alpha.0` of code-oz and push it to npm, Homebrew, and a curl installer. The local test suite was green: 3361 pass, 0 fail, 2 skipped. Typecheck said nothing. Every check I could run on my machine was happy.

Then the release gate refused the release. It was right to.

[OZZY: a sentence or two of personal context here — why you were shipping that night, how close you were to just pushing. Your voice carries the hook.]

## What code-oz is

code-oz runs coding agents through a software delivery lifecycle with hard gates between phases: define, plan, build, verify, review, ship. Each phase writes a plain Markdown artifact, and the next phase cannot start until a schema-validated gate file approves the last one. The agents are Claude and Codex through their own CLI logins, and xAI through an API key; code-oz is the orchestrator around them, not a model itself.

One rule sits underneath all of it: the REVIEW phase must run on a different model family than the one that wrote the code. If Claude builds, Codex reviews. Same-family review tends to share the same blind spots, so code-oz treats cross-family review as a requirement, not a setting: the builder and reviewer families are written into the run's event log, so the pairing is auditable rather than assumed.

I hold the project to that rule twice. The product enforces it inside `code-oz run`. And I run code-oz's own development through it: before any milestone tag lands, a Codex review reads the diff and returns one of `push`, `fix-first`, or `debate-required`. No tag ships on a `fix-first` until the findings are closed. The v0.20.0 release was sitting in that gate.

## The bug

The first review round, R1, had already found six issues — one that would block the push, three to fix soon, two nits. All closed. I queued a second round, R2, expecting a clean `push`.

R2 found a new one. The release workflow built the binaries before installing dependencies.

`.github/workflows/release.yml` ran `bun build --compile` to produce the native binaries, but it never ran `bun install` first. On my laptop that is invisible: `node_modules` already exists from months of work. On a clean GitHub Actions runner there is no `node_modules` and no Bun cache. The build step would reach `src/config/schema.ts`, which imports the `yaml` package, fail to resolve it, and exit — before producing a single release asset.

The tag push triggers that workflow. So the moment I pushed the tag, the release would have failed in public, with no binaries, in front of whatever audience the launch brought. Nothing on my machine could see it, because the bug only exists in the one environment I never run: a checkout with nothing installed.

## The catch

Here is the finding, verbatim, from the review response file (`docs/design/CODEX_RESPONSE_W3A_R2.md`, thread `019e1a2c-9fbe-7742-88c7-7e9808434bd5`, model `gpt-5.5`, verdict `fix-first`):

```
### Block-push (new in R2)

`.github/workflows/release.yml:35` does not install dependencies
before the build step at `release.yml:53`. In a clean `git archive
HEAD` temp checkout, `bun build --compile --target=bun-linux-x64
src/cli.ts` fails with:

> Could not resolve: "yaml". Maybe you need to "bun install"?

A tag push would run this workflow and fail before release assets are
produced. Fix by adding `bun install --frozen-lockfile` after `Setup
Bun` in the `build` job, and add a workflow test for it.
```

A different model family, reading the diff with no stake in my deadline, traced an execution path that my green test suite could not reach.

## The fix

The fix is three lines (commit `1d520fe`):

```diff
+ - name: Install dependencies
+ run: bun install --frozen-lockfile
+
- name: Resolve VERSION
```

The review also asked for a test, which matters more than the three lines. A one-line fix that nobody pins will rot back the next time the workflow is edited. So the same commit added `tests/ci-workflows.test.ts`, which parses the workflow, finds the install step and the build step, and asserts the build runs after the install:

```ts
const installIdx = steps.findIndex((step) => /\bbun install\b/.test(step.run))
expect(installIdx).toBeGreaterThan(-1)
const buildIdx = steps.findIndex((step) => /bun build/.test(step.run))
expect(buildIdx).toBeGreaterThan(installIdx)
Comment on lines +69 to +72
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In TypeScript, if step.run is optional (string | undefined), passing it directly to RegExp.prototype.test() will cause a type error under strict null checks. Additionally, if step is null or undefined, accessing step.run will throw a runtime error. It is safer to guard the access by checking step.run first.

Suggested change
const installIdx = steps.findIndex((step) => /\bbun install\b/.test(step.run))
expect(installIdx).toBeGreaterThan(-1)
const buildIdx = steps.findIndex((step) => /bun build/.test(step.run))
expect(buildIdx).toBeGreaterThan(installIdx)
const installIdx = steps.findIndex((step) => step?.run && /\bbun install\b/.test(step.run))
expect(installIdx).toBeGreaterThan(-1)
const buildIdx = steps.findIndex((step) => step?.run && /bun build/.test(step.run))
expect(buildIdx).toBeGreaterThan(installIdx)

```

Run that test against the pre-fix workflow and `installIdx` is `-1`. It fails for the right reason, which is the only way I trust a test. After the fix it passes, and it will fail again if anyone reorders those steps.

The release shipped a few commits later, with binaries that actually build.

## Why a second model caught what mine could not

This is not a story about a smart model. It is a story about a different one.

My loop — write code, run tests, read the diff myself — is one perspective applied repeatedly. It is good at the failure modes I already think about and blind to the ones I do not. A clean-checkout dependency ordering bug is squarely in my blind spot, because I have never had a dirty checkout in my life. More rounds of my own review would not have found it. The test suite could not, because the failure lives outside the environment the suite runs in.

A reviewer from another model family does not share that blind spot. Two models trained by different labs on different data tend to fail on different inputs, so their mistakes do not line up. It is not better at CI than I am; it is differently wrong, which is exactly what you want at a gate. The improvement came from the disagreement, not from the intelligence.

That is the whole bet code-oz makes. Model bias and provider bias are real. A single agent run to completion inherits one model's blind spots. Putting agent output through evidence gates and a cross-family reviewer trades some speed and tokens for a reviewer that fails differently than the builder. On this release that trade caught a bug that would have failed the launch in public.

A fair caveat, because it is the first question a careful reader should ask. code-oz also ships deterministic demos that run the full lifecycle offline with a built-in fake provider. Those demos prove the gate, worktree, and event machinery is real and replayable; they prove nothing about model quality, because no real model runs in them. The release-gate story above is the opposite kind of evidence: a real `gpt-5.5` read real code and found a real bug, and the fix is in git history. code-oz keeps those two kinds of evidence labeled and separate, on a [receipts page](../RECEIPTS.md) you can check.

## Try it

code-oz is MIT, open source, and runs on the CLI logins you already have for Claude and Codex (xAI needs a key).

```sh
# npm
npm install -g @tuel/code-oz

# Homebrew
brew tap omerakben/code-oz
brew install omerakben/code-oz/code-oz
Comment on lines +100 to +101
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The brew tap command is redundant when installing using the fully qualified formula name omerakben/code-oz/code-oz. Homebrew automatically taps the repository when the fully qualified name is provided. Alternatively, if you do tap first, you can install using the short name code-oz.

Suggested change
brew tap omerakben/code-oz
brew install omerakben/code-oz/code-oz
brew install omerakben/code-oz/code-oz


# curl
curl -fsSL https://github.com/omerakben/code-oz/releases/download/v0.21.1-alpha.0/install.sh | sh
```

Then `code-oz init` and `code-oz run`. The cross-family review is on by default at the REVIEW gate.

The receipts behind this post, plus the M14 and M15 review trails and the deterministic demo ledgers, are at [docs/RECEIPTS.md](../RECEIPTS.md). The full comparison against using Claude Code, Codex, Cursor, and Aider directly is at [docs/comparisons/ai-coding-agents.md](../comparisons/ai-coding-agents.md).

[OZZY: close in your own voice. The receipts carry the argument; the closer should be you.]
135 changes: 135 additions & 0 deletions docs/blog/2026-05-launch-copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
title: Launch copy — Show HN, X thread, community submissions
status: DRAFT for Ozzy review. Nothing here is published. Personalize voice before posting.
source-of-truth: docs/blog/2026-05-ai-release-gate.md (the essay), docs/RECEIPTS.md, docs/comparisons/ai-coding-agents.md
---

# Launch copy (Phase 5.2 – 5.4)

All copy below tells the same true story as the essay and stays inside the calibrated claims in the comparison table. No claim here is stronger than what `docs/comparisons/ai-coding-agents.md` can defend with a footnote.

Two facts to keep straight while editing, because a skeptical reader will test both:
- The cross-family adversarial REVIEW row is the one capability where every compared competitor is marked `❌`. Lead with it. It is the real wedge.
- "Runs on CLI auth" is `partial` for code-oz, not a clean win — Claude and Codex are keyless through their CLI logins, xAI needs `XAI_API_KEY`. Say "keyless for Claude and Codex" and you stay honest.

---

## 5.2 Show HN

**Title** (74 chars, under the 80 limit):

```
Show HN: code-oz – AI agents through a gated SDLC, with cross-family review
```

**Link:** the GitHub repo (`https://github.com/omerakben/code-oz`), so the click that follows interest is a star, not a scroll.

**First comment** (post immediately after submitting; this is where the story goes):

> I build code-oz, an orchestrator that runs coding agents — Claude and Codex through their own CLI logins, xAI through an API key — across a gated software lifecycle: define, plan, build, verify, review, ship. Each phase writes a Markdown artifact and the next phase is blocked by a schema-validated gate file. The one rule underneath it: REVIEW must run on a different model family than the one that built the code.
>
> The story I want to put in front of you is one where that rule cost me, not one where it looked good in a demo. Before tagging v0.20.0, the cross-family review I run on every release (a Codex pass over the diff) blocked the release. My local suite was green — 3361 pass, 0 fail. The bug: the release workflow built binaries before running `bun install`, so it would have failed on a clean GitHub Actions checkout and shipped zero assets in public. Invisible to me because my laptop always has node_modules. A different model family, with no stake in my deadline, traced the path mine couldn't. Fix is commit 1d520fe, three lines plus a test that pins the ordering. This is not the only one: the receipts page has three more real-model reviews — M14 ran nine cross-family rounds and closed seven block-push findings to zero before shipping, and M15's planning review caught four design gaps before a line of code landed.
>
> Full write-up with the verbatim review excerpt and the SHAs: [essay link]. Receipts page separates the real-model reviews (Tier 1) from the deterministic FakeProvider demos (Tier 2) so the two never get conflated: [receipts link].
>
> It's MIT. The cross-family review needs both Claude and Codex configured. Honest about limits: it's alpha, the macOS binaries aren't signed yet, and the offline demos prove the gate machinery is real, not that any model writes good code. Happy to answer anything for the next few hours.

[OZZY: rewrite the first sentence in your own voice. HN rewards a real person, not a press release. Keep the "it blocked my own release" framing and the 3361/0 number — those are what make it land.]

**Timing:** Tuesday or Wednesday, US morning (roughly 8–10am ET). Not Friday, not the weekend. [OZZY decision: pick the day.]

---

## 5.3 X / Twitter thread

11 tweets. Each is under 280 characters. Asset slots are marked; the GIF and the two screenshots need you (B6). Post the thread the same day as Show HN, after the first HN comments land, so you can link the discussion.

**1/ (hook + GIF)**
> I built an AI release gate for my own coding-agent tool. Last week it blocked my own release. It was right to. 🧵
>
> [OZZY asset: GIF of the `fix-first` verdict / the gate refusing — B6]

**2/ (what it is)**
> code-oz runs coding agents — Claude, Codex, xAI — through a gated lifecycle: define → plan → build → verify → review → ship. Each phase writes an artifact; a schema-validated gate file blocks the next one. The agents are workers. It's the discipline around them.

**3/ (the rule)**
> One rule underneath all of it: REVIEW must run on a different model family than the builder. Claude builds, Codex reviews. Different labs and training mean their mistakes don't line up — same-family review shares the same blind spots. Cross-family is required, not a toggle.

**4/ (the setup)**
> I hold my own releases to that rule. Before tagging v0.20.0 I ran the Codex review over the diff. Local suite: 3361 pass, 0 fail, typecheck silent. Everything I could run on my machine was green.

**5/ (the bug)**
> The review blocked it. The release workflow built the binaries before running `bun install`. On my laptop, invisible — node_modules already exists. On a clean GitHub runner it would fail to resolve `yaml` and ship zero binaries. In public.

**6/ (the catch — screenshot)**
> The reviewer traced the exact failure path my green suite couldn't reach:
>
> [OZZY asset: screenshot of the verbatim Codex finding from RECEIPTS.md lines 19–34]

**7/ (the fix — screenshot)**
> The fix was 3 lines (commit 1d520fe). The part that matters more: the same commit added a test that finds the install and build steps and asserts build runs after install, so it can't rot back.
>
> [OZZY asset: screenshot of the diff + test]

**8/ (the point)**
> This isn't a story about a smart model. It's about a different one. A reviewer from another family isn't better at CI — it's *differently wrong*, which is exactly what you want at a gate. The win came from the disagreement.

**9/ (the bet)**
> That's the whole bet: model bias is real, a single agent inherits one model's blind spots, and trading some speed for a cross-family reviewer catches what self-review can't. On this release it caught a launch-breaking bug.

**10/ (honesty)**
> Fair caveat: code-oz also ships deterministic offline demos. Those prove the gate machinery runs and replays — not that any model writes good code. The release story above is the other kind of evidence: a real model, real bug, fix in git. Both are labeled, never mixed.

**11/ (CTA)**
> MIT. Keyless for Claude and Codex via their CLI logins; xAI needs `XAI_API_KEY`. Cross-family review needs both Claude and Codex.
> npm: `npm i -g @tuel/code-oz`
> brew: `brew tap omerakben/code-oz && brew install omerakben/code-oz/code-oz`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The brew tap command is redundant when using the fully qualified formula name. You can simplify this to a single brew install command.

Suggested change
> brew: `brew tap omerakben/code-oz && brew install omerakben/code-oz/code-oz`
brew: brew install omerakben/code-oz/code-oz

> Repo + receipts: [repo link]
> Full write-up: [essay link]

[OZZY: tweets 1, 8, and 9 are where your voice matters most. The numbers and SHAs are verified — keep them exact.]

---

## 5.4 Community submissions

Same story, fitted to each venue. Submit in the 24 hours after Show HN.

**lobste.rs** (tags: `devops`, `ai`, `practices`)
> Title: code-oz: a gated SDLC around coding agents, with cross-family review
> Link: the essay (lobste.rs prefers the write-up over a repo).
> Note: lobste.rs is invite-only and allergic to marketing. Lead with the bug, not the product.

**r/programming** (link the essay, not the repo)
> Title: My AI release gate blocked my own release — a different model family caught a CI bug my green test suite couldn't

**r/coolgithubprojects**
> Title: code-oz — run coding agents through a gated SDLC with cross-family adversarial review (MIT)
> Body: two sentences + the bug anecdote + repo link.

**dev.to + Hashnode** (cross-post the full essay)
> Add a canonical-URL header pointing back to `docs/blog/2026-05-ai-release-gate.md` so the GitHub copy stays canonical. dev.to flags AI-generated submissions; the essay's specific commits, SHAs, and personal framing are the defense — do not strip them.

**Newsletter pitches** (Ben's Bites, TLDR AI, The Batch — one short paragraph each)
> Subject: A coding-agent tool whose own release gate caught a launch-breaking bug
> Body: I make code-oz (MIT), which runs coding agents through a gated SDLC where REVIEW is forced onto a different model family than the builder. Before my last release, that cross-family review caught a CI bug — the release workflow built before installing deps and would have shipped zero binaries from a clean runner. Write-up with the verbatim review and the fix commit: [link]. Happy to share the receipts page that separates real-model reviews from deterministic demos.

---

## Needs Ozzy before launch

- **[B6] Demo asset (GIF + 2 screenshots).** The X thread has three asset slots: the gate refusing (tweet 1), the verbatim Codex finding (tweet 6), the diff + test (tweet 7). The text in RECEIPTS.md lines 19–56 is the source for the two screenshots. Only you can record/capture these.
- **[B9] Friend reactions / first-impressions pass.** Phase 3.5 asks for 3 unprompted developer reactions to the README + essay before Show HN. Not drafted here — it needs real people.
- **Voice.** The essay and the Show HN first comment are drafted in a plausible first-person voice. They are your launch under your name; personalize the marked spots.
- **Launch day + the live brownfield smoke.** Tue/Wed is recommended; the day is yours. The optional M17 brownfield smoke (a real model fixing a real bug through the AUDIT phase) would be a strong second receipt for tweet 7 / the Show HN secondary link, but it needs live credentials and hasn't been run.

---

## Adversarial review trail

These drafts went through a three-lens adversarial pass (fact/overclaim auditor, hostile HN reader, writing-rules/authenticity) before this commit. The writing-rules lens returned clean. Applied fixes: xAI auth precision in the essay and Show HN comment (xAI needs a key, it is not a CLI login); the X thread `brew` command now includes `brew tap` (the bare `brew install` fails with "formula not found"); the Show HN comment now names the M14/M15 receipts up front to answer the "n=1 anecdote" attack; a grounded one-line reason for why different families have different blind spots (uncorrelated errors from independent training, not an unverifiable "Claude does correctness, Codex does velocity" claim).

Two findings were not applied as suggested:

- **Flagged for you (not applied):** the comparison page (`docs/comparisons/ai-coding-agents.md`) says "external `gpt-5.5` review." The HN lens noted a reader may assume that is Claude reviewing its own product. Naming it as a Codex / OpenAI-family review would strengthen the cross-model-fact-check credibility. It is a one-line clarity fix on a canonical, already-shipped doc, so it is your call, not folded into this launch branch.
- **Rejected (would introduce an inaccuracy):** the HN lens wanted the essay to say the release catch used "the same `code-oz run` command users get." That is false — the catch was the milestone Codex-review discipline, not the product's automated REVIEW phase running on itself. The essay keeps the honest distinction: the product enforces cross-family review inside `code-oz run`; the release story is the human-orchestrated version of the same principle. Conflating them is exactly the overclaim a skeptic would catch.
Loading