From 5cba612a334f397818ce00af3ff12dfc87d1fa9b Mon Sep 17 00:00:00 2001 From: omerakben Date: Tue, 2 Jun 2026 14:58:00 -0400 Subject: [PATCH 1/3] =?UTF-8?q?test(aha):=20A/B=20evaluation=20=E2=80=94?= =?UTF-8?q?=20align-then-build=20vs=20build-direct?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Controlled experiment measuring what the AHA framework buys. 3 fuzzy frontend tasks (Pomodoro, to-do, tip splitter), both arms on claude-sonnet-4-6, blind Opus 4.8 referee, runtime-verified in-browser. Tokens summed from per-subagent transcripts, USD-weighted at Sonnet 4.6 rates. Result: AHA cost 4.6x and won 2 of 3 on quality (68 vs 61 /75). Wins were robustness/scope surfaced at the ask gate (persistence, validation, graceful degradation). Loss (tip splitter): alignment over-specified penny reconciliation into a self-contradictory spec; the build dropped the remainder (silent money loss) and the simpler bare app scored higher. AHA reads as a requirements amplifier, not a coding upgrade; cost figure is an upper bound (isolated subagents re-read context cold vs a real warm /aha session). Includes both arms' source per task, AHA alignment artifacts, maestro gate answers, blind judge cards + keys, raw token table, screenshots, and RESULTS.md. Co-Authored-By: Claude Opus 4.8 (1M context) --- Repos/ab-aha-vs-bare-2026-06-02/README.md | 32 + Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md | 128 ++++ .../arm-a/task1/00-aha/critique.md | 56 ++ .../arm-a/task1/00-aha/executable-prompt.md | 165 ++++++ .../arm-a/task1/00-aha/handoff-packet.md | 43 ++ .../arm-a/task1/00-aha/ledger.md | 38 ++ .../arm-a/task1/app/BUILD-NOTES.md | 49 ++ .../arm-a/task1/app/index.html | 552 ++++++++++++++++++ .../arm-a/task2/00-aha/critique.md | 40 ++ .../arm-a/task2/00-aha/executable-prompt.md | 123 ++++ .../arm-a/task2/00-aha/handoff-packet.md | 64 ++ .../arm-a/task2/00-aha/ledger.md | 35 ++ .../arm-a/task2/app/BUILD-NOTES.md | 47 ++ .../arm-a/task2/app/index.html | 485 +++++++++++++++ .../arm-a/task3/00-aha/critique.md | 40 ++ .../arm-a/task3/00-aha/executable-prompt.md | 122 ++++ .../arm-a/task3/00-aha/handoff-packet.md | 46 ++ .../arm-a/task3/00-aha/ledger.md | 33 ++ .../arm-a/task3/app/BUILD-NOTES.md | 42 ++ .../arm-a/task3/app/index.html | 481 +++++++++++++++ .../arm-b/task1/app/BUILD-NOTES.md | 23 + .../arm-b/task1/app/index.html | 387 ++++++++++++ .../arm-b/task2/app/BUILD-NOTES.md | 37 ++ .../arm-b/task2/app/index.html | 358 ++++++++++++ .../arm-b/task3/app/BUILD-NOTES.md | 31 + .../arm-b/task3/app/index.html | 305 ++++++++++ .../gate-answers/task1.md | 22 + .../gate-answers/task2.md | 5 + .../gate-answers/task3.md | 6 + .../metrics/judge/task1-blind-key.txt | 1 + .../metrics/judge/task1.md | 23 + .../metrics/judge/task2-blind-key.txt | 1 + .../metrics/judge/task2.md | 8 + .../metrics/judge/task3-blind-key.txt | 1 + .../metrics/judge/task3.md | 8 + .../metrics/raw/tokens.md | 19 + .../screenshots/abtest-task1-armA-aha.png | Bin 0 -> 30845 bytes .../screenshots/abtest-task1-armB-bare.png | Bin 0 -> 35473 bytes .../screenshots/abtest-task3-armA-aha.png | Bin 0 -> 32256 bytes .../screenshots/abtest-task3-armB-bare.png | Bin 0 -> 37319 bytes .../prompts/task1-pomodoro.md | 3 + .../prompts/task2-todo.md | 3 + .../prompts/task3-tip-splitter.md | 3 + 43 files changed, 3865 insertions(+) create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/README.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/executable-prompt.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/handoff-packet.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/ledger.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/BUILD-NOTES.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/index.html create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/critique.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/executable-prompt.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/handoff-packet.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/ledger.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/app/BUILD-NOTES.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/app/index.html create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/critique.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/executable-prompt.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/handoff-packet.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/ledger.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/BUILD-NOTES.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/index.html create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/BUILD-NOTES.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/BUILD-NOTES.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/index.html create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/BUILD-NOTES.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task1.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task2.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task3.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task1-blind-key.txt create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task1.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task2-blind-key.txt create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task2.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task3-blind-key.txt create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task3.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/raw/tokens.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task1-armA-aha.png create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task1-armB-bare.png create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task3-armA-aha.png create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task3-armB-bare.png create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/prompts/task1-pomodoro.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/prompts/task2-todo.md create mode 100644 Repos/ab-aha-vs-bare-2026-06-02/prompts/task3-tip-splitter.md diff --git a/Repos/ab-aha-vs-bare-2026-06-02/README.md b/Repos/ab-aha-vs-bare-2026-06-02/README.md new file mode 100644 index 0000000..e0d2755 --- /dev/null +++ b/Repos/ab-aha-vs-bare-2026-06-02/README.md @@ -0,0 +1,32 @@ +# AHA A/B test — align-then-build vs build-direct + +Experiment date: 2026-06-02. Model held constant: claude-sonnet-4-6 (every builder and AHA +alignment agent). Referee: Opus 4.8, blind. Design spec (local-only): docs/superpowers/specs/2026-06-02-aha-ab-test-design.md + +## Question +What does the AHA framework actually buy? AHA is a planning layer (ask -> align -> critique -> +optimize -> handoff) whose output is an aligned executable prompt; the build happens downstream. +So: Arm A = run AHA on a fuzzy prompt, then a Sonnet builder executes the aligned prompt. Arm B = +a Sonnet builder executes the raw prompt directly. Same model. Measure cost, time, resource use, quality. + +## Layout +- prompts/ the shared moderately-specified starting prompt per task (both arms start here) +- gate-answers/ the maestro's live answers to AHA's ask-phase gate, per task +- arm-a/taskN/00-aha/ AHA artifacts: ledger, critique, executable-prompt, handoff-packet +- arm-a/taskN/app/ Arm A built product +- arm-b/taskN/app/ Arm B built product +- metrics/raw/tokens.md exact per-agent token + USD table +- metrics/judge/ blind referee cards + blind-key (X/Y -> arm mapping) +- metrics/screenshots/ runtime screenshots +- RESULTS.md scorecard, runtime verification, aggregate, verdict + +## Headline +Cost A/B 4.6x. Quality 68 vs 61 (/75). AHA won Pomodoro + To-do (robustness/scope surfaced at the +ask gate), lost Tip splitter (alignment over-specified penny reconciliation; build dropped the +remainder -> silent money loss). AHA is a requirements amplifier: it pays when unstated intent +matters and the human answers well; it backfires when alignment adds ambition the build botches. + +## Reproduce +Each agent's exact prompt is in RESULTS.md / this folder. Tokens were summed from each subagent's +transcript JSONL (~/.claude/projects///subagents/agent-*.jsonl), USD-weighted at +Sonnet 4.6 rates. To view an app: `python3 -m http.server` in this dir, open arm-a|b/taskN/app/index.html. diff --git a/Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md b/Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md new file mode 100644 index 0000000..6791a41 --- /dev/null +++ b/Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md @@ -0,0 +1,128 @@ +# AHA A/B test — results + +Spec: docs/superpowers/specs/2026-06-02-aha-ab-test-design.md (local-only) +Model held constant: claude-sonnet-4-6 for every builder + alignment agent. Referee: Opus 4.8, blind. +Cost = USD-weighted billed tokens (Sonnet 4.6 rates: in $3, out $15, cache-write $3.75/5m or $6/1h, cache-read $0.30 per MTok). + +## Task 1 — Pomodoro (prompt: moderately specified) + +### Cost and resource use (all agents verified claude-sonnet-4-6) +| agent | USD | input | output | cache-wr | cache-rd | tools | time | +| ------------ | ------- | ----- | ------ | -------- | -------- | ----- | ---- | +| A-ask | $0.2188 | 7 | 773 | 32,853 | 33,475 | 2 | 22s | +| A-align | $0.6185 | 24 | 5,000 | 68,455 | 442,380 | 12 | 173s | +| A-build | $0.5182 | 17 | 7,790 | 50,761 | 322,599 | 8 | 124s | +| Arm A (AHA) | $1.3555 | 48 | 13,563 | 152,069 | 798,454 | 22 | 319s | +| Arm B (bare) | $0.3447 | 14 | 384* | 51,803 | 93,480 | 3 | 62s | +| ratio A / B | 3.93x | | | | | 7.3x | 5.1x | + +\* B-build's big Write was billed mostly under cache-creation, not output (serving-layer +categorization quirk). Total billed cost of the turn is captured; only the category split is odd. + +### Quality (blind Opus referee, /25) +| arm | total | coverage | correctness | code | ux | robustness | +| ------------ | ----- | -------- | ----------- | ---- | --- | ---------- | +| Arm A (AHA) | 25/25 | 5 | 5 | 5 | 5 | 5 | +| Arm B (bare) | 18/25 | 4 | 4 | 4 | 4 | 2 | +| delta | +7 | +1 | +1 | +1 | +1 | +3 | + +Build size: Arm A 552 LOC, Arm B 387 LOC. + +### Task 1 read +AHA cost 3.93x the USD and 5.1x the wall-clock, and bought +7/25 quality (+39%). The gap +concentrates in robustness (+3): persistence, input validation, and the long-break protocol — +all surfaced at the ask gate, none stated in the raw prompt. The bare app shipped 3 real bugs +(1s-long phases, no persistence, destructive mode-tab click). Whether 4x cost for that delta +"pays" depends on whether the work is throwaway or kept. + +### Runtime smoke test (maestro, live in browser) +- Arm A (AHA): start works; writes 7 localStorage keys immediately; after reload restored + 24:48 -> 24:41 (elapsed-time reconstruction works) and stayed PAUSED (no auto-start, as + the executable prompt required). 4-dot long-break cycle and sound toggle visible. Only + console "error" is a favicon 404 (not an app bug). +- Arm B (bare): start works; writes 0 localStorage keys; after reload reset to 25:00 (running + state lost). Confirms the referee's "no persistence" finding at runtime. Polished coral card, + Work/Break tabs (the tabs are also the source of the destructive-click bug). +- Verdict holds at runtime: the referee's static read matches observed behavior. + +### Methodology caveats to weigh before the full battery +1. Arm A's ~4x cost is partly an artifact of my harness: it runs 4 ISOLATED subagents, each + re-reading context cold (cache_read dominates). A real `/aha` user runs ask->align->... + inline in ONE warm session, so true AHA overhead is LOWER than 4x. Treat 3.93x as an + upper bound on cost, not a typical figure. +2. The +7 quality gap includes scope the human injected at the ask gate (persistence, + long-break, mute). That is the AHA treatment by design, but it means the test rewards + alignment-elicited requirements, not just "better building." + +## Task 2 — To-do (moderately specified) +| metric | Arm A (AHA) | Arm B (bare) | +|---|---|---| +| cost (USD) | $1.2906 (ask $0.218 + align $0.753 + build $0.319) | $0.2961 | +| ratio | 4.36x | — | +| tools | 17 | 4 | +| time | ~233s | ~62s | +| quality (blind Opus /25) | 24 | 21 | +| LOC | 485 | 358 | + +Quality gap = robustness. AHA guards `Array.isArray(parsed)` on load; bare does `JSON.parse(...) || []` +with no shape check. Live-verified: with a corrupt non-array under the real storage key, the BARE app +renders 0 items, throws on load, and a 2nd error fires on add — fully bricked. AHA shrugs it off (add still +works). Bare did ship a nicer clear-completed + items-left footer (referee gave bare ux 5 vs AHA 4). + +## Task 3 — Tip splitter (moderately specified) <-- AHA LOSES +| metric | Arm A (AHA) | Arm B (bare) | +|---|---|---| +| cost (USD) | $1.1886 (ask $0.219 + align $0.645 + build $0.325) | $0.1986 | +| ratio | 5.98x | — | +| tools | 19 | 3 | +| time | ~234s | ~52s | +| quality (blind Opus /25) | 19 | 22 | +| LOC | 481 | 305 | + +The maestro gate answer demanded "exact penny reconciliation — distribute the remainder." The AHA align +phase encoded a self-contradiction: display only the floored base amount AND claim shares sum to total. +The AHA build implemented floor-and-DROP the remainder -> displayed shares under-collect the shown total +(silent money loss). Live-verified: bill 100 / 18% / 3 people shows Total $118.00 but "Each pays $39.33" +(x3 = $117.99). The bare arm never attempted reconciliation, used a plain rounded share, and the referee +judged honestly-wrong-but-simple > ambitiously-wrong-but-complex. AHA over-engineered and lost. + +## Aggregate (3 tasks) +| metric | Arm A (AHA) | Arm B (bare) | A vs B | +|---|---|---|---| +| total cost (USD) | $3.835 | $0.839 | 4.57x more | +| total agent wall-clock | 786s (13.1m) | 176s (2.9m) | 4.47x more | +| total tool calls | 58 | 10 | 5.8x more | +| quality total (/75) | 68 | 61 | +7 | +| quality avg (/25) | 22.7 | 20.3 | +2.4 (+12%) | +| tasks won | 2 (Pomodoro, To-do) | 1 (Tip) | — | +| worst single defect | tip money under-collection | to-do corrupt-storage crash | — | + +Where AHA's cost goes (avg per task): ask ~$0.22, align ~$0.67, build ~$0.34. The align phase alone is +~50% of AHA's spend and produces no product — it is the alignment tax. A-ask + A-align (~$0.89/task) is +pure overhead vs the bare build (~$0.28/task). + +## Verdict + +AHA cost 4.6x the money and 4.5x the time to deliver +12% average quality, winning 2 of 3 tasks. + +When AHA paid off (tasks 1, 2): the wins were entirely in robustness and scope that the raw prompt left +unstated — persistence, input validation, graceful degradation, the long-break protocol. The ask gate +surfaced these; the bare arm guessed and guessed wrong. This is AHA's real mechanism: structured +elicitation of human intent, not better raw coding. + +When AHA backfired (task 3): alignment over-specified. It (and the maestro) raised a bar — exact penny +reconciliation — that was internally contradictory and the builder implemented as a real defect. Simplicity +avoided the trap. Adding ambition without matching execution produced a worse product at 6x the cost. + +Honest takeaways: +1. AHA wins when the task hides requirements that matter and the human knows them. It converts tacit intent + into explicit spec. On truly simple, fully-specified tasks the bare arm is "good enough" at a fraction of cost. +2. AHA can lose by over-engineering: alignment can encode contradictions or demand sophistication the builder + botches. More spec is not strictly safer. +3. The 4.6x cost is an UPPER BOUND. The harness runs AHA as isolated subagents that re-read context cold + (cache_read dominates). A real /aha runs warm in one session; true overhead is lower. Quality numbers are + unaffected. +4. The quality delta is elicitation, not coding skill. The fairest read: AHA is a requirements amplifier. Its + value tracks how much unstated-but-important intent exists, and how well the human answers the gates. +5. Economics: ~$3 extra across 3 throwaway toys is a bad trade. The same $3 on kept/extended software, where + the avoided rework and the bricked-on-corrupt-storage class of bug actually bite, is likely a good trade. diff --git a/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md b/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md new file mode 100644 index 0000000..748fd3e --- /dev/null +++ b/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md @@ -0,0 +1,56 @@ +# Premortem critique + +Draft critiqued: the aligned brief from ledger.md (single-file Pomodoro timer with Web Audio, localStorage persistence, SVG ring countdown, dark polished UI). + +--- + +## What this gets right + +1. The delivery format is unambiguous: "one `.html` file with all CSS and JS inlined — no CDN links, no external fonts" removes every dependency question a build agent would otherwise need to ask. +2. The session count persistence rule is concrete: "Persist `completedSessions`, `currentPhase`, and `remainingSeconds` to `localStorage` on every tick" gives the agent an exact write strategy, not a vague "save it somehow." +3. The audio spec names the implementation path: "`AudioContext.createOscillator()` — 440 Hz sine wave at ~0.3 gain, ~0.8 s duration" so the agent cannot reach for a `