omerakben · omerakben · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/Repos/ab-aha-vs-bare-2026-06-02/README.md b/Repos/ab-aha-vs-bare-2026-06-02/README.md
@@ -0,0 +1,32 @@
+# AHA A/B test — align-then-build vs build-direct
+
+Experiment date: 2026-06-02. Model held constant: claude-sonnet-4-6 (every builder and AHA
+alignment agent). Referee: Opus 4.8, blind. Design spec (local-only): docs/superpowers/specs/2026-06-02-aha-ab-test-design.md
+
+## Question
+What does the AHA framework actually buy? AHA is a planning layer (ask -> align -> critique ->
+optimize -> handoff) whose output is an aligned executable prompt; the build happens downstream.
+So: Arm A = run AHA on a fuzzy prompt, then a Sonnet builder executes the aligned prompt. Arm B =
+a Sonnet builder executes the raw prompt directly. Same model. Measure cost, time, resource use, quality.
+
+## Layout
+- prompts/        the shared moderately-specified starting prompt per task (both arms start here)
+- gate-answers/   the maestro's live answers to AHA's ask-phase gate, per task
+- arm-a/taskN/00-aha/   AHA artifacts: ledger, critique, executable-prompt, handoff-packet
+- arm-a/taskN/app/      Arm A built product
+- arm-b/taskN/app/      Arm B built product
+- metrics/raw/tokens.md exact per-agent token + USD table
+- metrics/judge/        blind referee cards + blind-key (X/Y -> arm mapping)
+- metrics/screenshots/  runtime screenshots
+- RESULTS.md            scorecard, runtime verification, aggregate, verdict
+
+## Headline
+Cost A/B 4.6x. Quality 68 vs 61 (/75). AHA won Pomodoro + To-do (robustness/scope surfaced at the
+ask gate), lost Tip splitter (alignment over-specified penny reconciliation; build dropped the
+remainder -> silent money loss). AHA is a requirements amplifier: it pays when unstated intent
+matters and the human answers well; it backfires when alignment adds ambition the build botches.
+
+## Reproduce
+Each agent's exact prompt is in RESULTS.md / this folder. Tokens were summed from each subagent's
+transcript JSONL (~/.claude/projects/<proj>/<session>/subagents/agent-*.jsonl), USD-weighted at
+Sonnet 4.6 rates. To view an app: `python3 -m http.server` in this dir, open arm-a|b/taskN/app/index.html.
diff --git a/Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md b/Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md
@@ -0,0 +1,207 @@
+# AHA A/B test — results
+
+Spec: docs/superpowers/specs/2026-06-02-aha-ab-test-design.md (local-only)
+Model held constant: claude-sonnet-4-6 for every builder + alignment agent. Referee: Opus 4.8, blind.
+Cost = USD-weighted billed tokens (Sonnet 4.6 rates: in $3, out $15, cache-write $3.75/5m or $6/1h, cache-read $0.30 per MTok).
+
+## Task 1 — Pomodoro (prompt: moderately specified)
+
+### Cost and resource use (all agents verified claude-sonnet-4-6)
+| agent        | USD     | input | output | cache-wr | cache-rd | tools | time |
+| ------------ | ------- | ----- | ------ | -------- | -------- | ----- | ---- |
+| A-ask        | $0.2188 | 7     | 773    | 32,853   | 33,475   | 2     | 22s  |
+| A-align      | $0.6185 | 24    | 5,000  | 68,455   | 442,380  | 12    | 173s |
+| A-build      | $0.5182 | 17    | 7,790  | 50,761   | 322,599  | 8     | 124s |
+| Arm A (AHA)  | $1.3555 | 48    | 13,563 | 152,069  | 798,454  | 22    | 319s |
+| Arm B (bare) | $0.3447 | 14    | 384*   | 51,803   | 93,480   | 3     | 62s  |
+| ratio A / B  | 3.93x   |       |        |          |          | 7.3x  | 5.1x |
+
+\* B-build's big Write was billed mostly under cache-creation, not output (serving-layer
+categorization quirk). Total billed cost of the turn is captured; only the category split is odd.
+
+### Quality (blind Opus referee, /25)
+| arm          | total | coverage | correctness | code | ux  | robustness |
+| ------------ | ----- | -------- | ----------- | ---- | --- | ---------- |
+| Arm A (AHA)  | 25/25 | 5        | 5           | 5    | 5   | 5          |
+| Arm B (bare) | 18/25 | 4        | 4           | 4    | 4   | 2          |
+| delta        | +7    | +1       | +1          | +1   | +1  | +3         |
+
+Build size: Arm A 552 LOC, Arm B 387 LOC.
+
+### Task 1 read
+AHA cost 3.93x the USD and 5.1x the wall-clock, and bought +7/25 quality (+39%). The gap
+concentrates in robustness (+3): persistence, input validation, and the long-break protocol —
+all surfaced at the ask gate, none stated in the raw prompt. The bare app shipped 3 real bugs
+(1s-long phases, no persistence, destructive mode-tab click). Whether 4x cost for that delta
+"pays" depends on whether the work is throwaway or kept.
+
+### Runtime smoke test (maestro, live in browser)
+- Arm A (AHA): start works; writes 7 localStorage keys immediately; after reload restored
+  24:48 -> 24:41 (elapsed-time reconstruction works) and stayed PAUSED (no auto-start, as
+  the executable prompt required). 4-dot long-break cycle and sound toggle visible. Only
+  console "error" is a favicon 404 (not an app bug).
+- Arm B (bare): start works; writes 0 localStorage keys; after reload reset to 25:00 (running
+  state lost). Confirms the referee's "no persistence" finding at runtime. Polished coral card,
+  Work/Break tabs (the tabs are also the source of the destructive-click bug).
+- Verdict holds at runtime: the referee's static read matches observed behavior.
+
+### Methodology caveats to weigh before the full battery
+1. Arm A's ~4x cost is partly an artifact of my harness: it runs 4 ISOLATED subagents, each
+   re-reading context cold (cache_read dominates). A real `/aha` user runs ask->align->...
+   inline in ONE warm session, so true AHA overhead is LOWER than 4x. Treat 3.93x as an
+   upper bound on cost, not a typical figure.
+2. The +7 quality gap includes scope the human injected at the ask gate (persistence,
+   long-break, mute). That is the AHA treatment by design, but it means the test rewards
+   alignment-elicited requirements, not just "better building."
+
+## Task 2 — To-do (moderately specified)
+| metric | Arm A (AHA) | Arm B (bare) |
+|---|---|---|
+| cost (USD) | $1.2906 (ask $0.218 + align $0.753 + build $0.319) | $0.2961 |
+| ratio | 4.36x | — |
+| tools | 17 | 4 |
+| time | ~233s | ~62s |
+| quality (blind Opus /25) | 24 | 21 |
+| LOC | 485 | 358 |
+
+Quality gap = robustness. AHA guards `Array.isArray(parsed)` on load; bare does `JSON.parse(...) || []`
+with no shape check. Live-verified: with a corrupt non-array under the real storage key, the BARE app
+renders 0 items, throws on load, and a 2nd error fires on add — fully bricked. AHA shrugs it off (add still
+works). Bare did ship a nicer clear-completed + items-left footer (referee gave bare ux 5 vs AHA 4).
+
+## Task 3 — Tip splitter (moderately specified)   <-- AHA LOSES
+| metric | Arm A (AHA) | Arm B (bare) |
+|---|---|---|
+| cost (USD) | $1.1886 (ask $0.219 + align $0.645 + build $0.325) | $0.1986 |
+| ratio | 5.98x | — |
+| tools | 19 | 3 |
+| time | ~234s | ~52s |
+| quality (blind Opus /25) | 19 | 22 |
+| LOC | 481 | 305 |
+
+The maestro gate answer demanded "exact penny reconciliation — distribute the remainder." The AHA align
+phase encoded a self-contradiction: display only the floored base amount AND claim shares sum to total.
+The AHA build implemented floor-and-DROP the remainder -> displayed shares under-collect the shown total
+(silent money loss). Live-verified: bill 100 / 18% / 3 people shows Total $118.00 but "Each pays $39.33"
+(x3 = $117.99). The bare arm never attempted reconciliation, used a plain rounded share, and the referee
+judged honestly-wrong-but-simple > ambitiously-wrong-but-complex. AHA over-engineered and lost.
+
+## Aggregate (3 tasks)
+| metric | Arm A (AHA) | Arm B (bare) | A vs B |
+|---|---|---|---|
+| total cost (USD) | $3.835 | $0.839 | 4.57x more |
+| total agent wall-clock | 786s (13.1m) | 176s (2.9m) | 4.47x more |
+| total tool calls | 58 | 10 | 5.8x more |
+| quality total (/75) | 68 | 61 | +7 |
+| quality avg (/25) | 22.7 | 20.3 | +2.4 (+12%) |
+| tasks won | 2 (Pomodoro, To-do) | 1 (Tip) | — |
+| worst single defect | tip money under-collection | to-do corrupt-storage crash | — |
+
+Where AHA's cost goes (avg per task): ask ~$0.22, align ~$0.67, build ~$0.34. The align phase alone is
+~50% of AHA's spend and produces no product — it is the alignment tax. A-ask + A-align (~$0.89/task) is
+pure overhead vs the bare build (~$0.28/task).
+
+## Verdict
+
+AHA cost 4.6x the money and 4.5x the time to deliver +12% average quality, winning 2 of 3 tasks.
+
+When AHA paid off (tasks 1, 2): the wins were entirely in robustness and scope that the raw prompt left
+unstated — persistence, input validation, graceful degradation, the long-break protocol. The ask gate
+surfaced these; the bare arm guessed and guessed wrong. This is AHA's real mechanism: structured
+elicitation of human intent, not better raw coding.
+
+When AHA backfired (task 3): alignment over-specified. It (and the maestro) raised a bar — exact penny
+reconciliation — that was internally contradictory and the builder implemented as a real defect. Simplicity
+avoided the trap. Adding ambition without matching execution produced a worse product at 6x the cost.
+
+Honest takeaways:
+1. AHA wins when the task hides requirements that matter and the human knows them. It converts tacit intent
+   into explicit spec. On truly simple, fully-specified tasks the bare arm is "good enough" at a fraction of cost.
+2. AHA can lose by over-engineering: alignment can encode contradictions or demand sophistication the builder
+   botches. More spec is not strictly safer.
+3. The 4.6x cost is an UPPER BOUND. The harness runs AHA as isolated subagents that re-read context cold
+   (cache_read dominates). A real /aha runs warm in one session; true overhead is lower. Quality numbers are
+   unaffected.
+4. The quality delta is elicitation, not coding skill. The fairest read: AHA is a requirements amplifier. Its
+   value tracks how much unstated-but-important intent exists, and how well the human answers the gates.
+5. Economics: ~$3 extra across 3 throwaway toys is a bad trade. The same $3 on kept/extended software, where
+   the avoided rework and the bricked-on-corrupt-storage class of bug actually bite, is likely a good trade.
+
+## Scope and external validity (added after maestro review)
+
+This test measured AHA in the regime where it is weakest: tiny, zero-context, single-file tasks that
+a one-shot prompt already builds well. That is, by construction, the wrong place to see AHA's main mechanism.
+
+Model the cost as two parts:
+- Alignment cost (ask + align): roughly FIXED per task (~$0.9 here), nearly independent of codebase size.
+- Build cost: on a real project, dominated by the builder's blind exploration — reading many files,
+  rebuilding context across turns, wandering down wrong paths, verbose dead-ends, scope creep.
+
+On a toy task, exploration is ~0, so alignment is pure overhead and AHA looks 4.6x more expensive. On a
+real multi-layer pre-production codebase, exploration is the dominant token sink, and a bounded handoff
+(contract, what-not-to-touch, acceptance checks) caps it. Past a CROSSOVER point — the codebase
+size/complexity where exploration savings exceed the fixed alignment cost — AHA REDUCES total tokens and
+cost rather than raising them. This experiment sits far below that crossover; real projects sit far above it.
+
+The four token sinks AHA attacks in the real regime: (1) blind file/context discovery, (2) context
+rebuild/re-reading across turns, (3) wandering and rework down wrong paths, (4) verbosity and scope creep.
+ask-me + align-me convert tacit intent into an explicit contract so the builder reads with intent instead
+of crawling the repo at random.
+
+Caveat that does not vanish at scale: AHA's win scales with exploration savings AND with alignment
+correctness. A wrong or contradictory brief (see task 3) sends the builder confidently down a bounded wrong
+path — cheaper to run, but the error is structural and costlier to unwind in a real codebase. AHA lowers
+exploration cost and raises the stakes on getting alignment right; that is where the human's gate answers
+carry the weight.
+
+Refined verdict: AHA is a poor trade for quick, one-shot, well-understood tasks (toys, scripts). It is
+spot-on for ongoing, multi-layer, real projects in or near production, where it shifts from cost-multiplier
+to cost-reducer by replacing blind exploration with an explicit contract. The toy battery here is a lower
+bound on AHA's value, not a representative sample.
+
+## Production validation (operator-reported, 2026-06) — the real-regime ceiling
+
+Not lab-controlled. Reported by the operator (Ozzy) from production use. Included as the above-crossover
+data point the toy battery could not produce. It is the empirical counterpart to the "Scope and external
+validity" model above.
+
+Environment: a large production Playwright test framework with a multi-engineer team and a deep automated suite (smoke, regression, UI, e2e). Far
+above the toy battery's scale and well past the AHA crossover point.
+
+Workflow (AHA pipeline with deliberate cross-model routing):
+1. Run the full /aha pipeline on a cheap fast model (Cursor Composer 2.5, agent mode) -> get the TLDR and
+   the delivered executable prompt.
+2. Open a new session on Opus 4.8; have Opus build a plan from AHA's delivered prompt (one high-leverage step).
+3. Return to Composer 2.5 to execute the plan and create the Playwright tests.
+
+Compared configurations:
+- Bare: Opus + Sonnet
+- Bare: Opus + Composer
+- AHA pipeline: AHA(Composer) -> Opus plan -> Composer execute
+
+Reported result: the AHA pipeline ran ~30% cheaper overall (test creation + validation) than the AVERAGE
+of the two bare configurations.
+
+Quality, concretely (operator-observed, this is an SDET's read of generated test code):
+- AHA arm: generated tests matched the actual acceptance criteria (the real intent) and ran correctly.
+- Bare arm: tests did not run on the first attempt; on review the generated test steps were missing; and
+  one test case was a FALSE POSITIVE — the cheap executor (Composer) shortcut the AC, writing a plain
+  boolean true/false assertion instead of extracting the real value and cross-asserting it (which the AC
+  required). A vacuous test that passes without validating behavior.
+- Why AHA prevented it: AHA's handoff carries explicit acceptance checks / AC, so the executor cannot
+  shortcut to a trivially-true assertion. The contract forces real value extraction and cross-assertion.
+  Without that contract the bare agent took the path of least resistance (reward-hacked the task to green).
+
+This is the highest-value catch in test automation: a false-positive regression test is a silent hole in a
+large regression suite, worse than no test because it manufactures false confidence the whole team then trusts.
+The AC in AHA's aligned prompt is an anti-shortcut guardrail. It is the same mechanism that gave the toy
+AHA builds their robustness/correctness edge, operating here at production stakes.
+
+Why the cost result is consistent with the model: above the crossover, the bare configs spend premium-model
+tokens on blind exploration and rework inside a single session. AHA produces a portable, context-free
+executable prompt that decouples thinking from doing, so each step runs on the cost-appropriate model: cheap
+model for the high-volume alignment and execution, premium model (Opus) for the single high-leverage planning
+step. Premium tokens are spent surgically instead of burned wandering. Net: lower cost AND better output.
+
+The model-routing point generalizes: AHA's clean handoff is what makes cheap-model substitution viable for
+the bulk work. A bare agent keeps its context in one expensive session and cannot route cheaply.
diff --git a/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md b/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md
@@ -0,0 +1,56 @@
+# Premortem critique
+
+Draft critiqued: the aligned brief from ledger.md (single-file Pomodoro timer with Web Audio, localStorage persistence, SVG ring countdown, dark polished UI).
+
+---
+
+## What this gets right
+
+1. The delivery format is unambiguous: "one `.html` file with all CSS and JS inlined — no CDN links, no external fonts" removes every dependency question a build agent would otherwise need to ask.
+2. The session count persistence rule is concrete: "Persist `completedSessions`, `currentPhase`, and `remainingSeconds` to `localStorage` on every tick" gives the agent an exact write strategy, not a vague "save it somehow."
+3. The audio spec names the implementation path: "`AudioContext.createOscillator()` — 440 Hz sine wave at ~0.3 gain, ~0.8 s duration" so the agent cannot reach for a `<audio src>` tag or an npm package.
+
+---
+
+## Where this fails
+
+### 1. AudioContext is blocked on first user gesture in all modern browsers (H)
+
+The brief specifies playing a tone "at session end" but does not require the AudioContext to be created or resumed inside a user-gesture handler. Browsers block AudioContext creation (or leave it in `suspended` state) until the user interacts with the page. If the timer auto-starts from a restored localStorage state, the very first tone will be silenced and the browser console will log an uncaught rejection.
+
+Fix: add an explicit rule — "Create the AudioContext lazily, on the first user click on any button. Before playing any tone call `audioCtx.resume()` and await the promise."
+
+Severity: H
+
+### 2. localStorage restore can resurrect a phantom running timer (H)
+
+The brief says "save phase and remaining seconds so reloading mid-session resumes." But if the tab closed while the timer was running, there is no way to know whether 10 seconds or 10 minutes have elapsed since the save. Restoring `remainingSeconds` directly would show a time that is wrong by the gap, and auto-resuming the timer would silently skip break triggers. The brief has no rule for handling elapsed wall-clock time.
+
+Fix: add the rule — "On page load, if a saved `lastSavedAt` timestamp exists, compute elapsed wall-clock seconds since that timestamp, subtract from `remainingSeconds`, clamp to zero, and advance phase if the result is negative. Do not auto-start the timer on load; show a 'Resume' button."
+
+Severity: H
+
+### 3. The reset behavior is ambiguous for the in-progress session (M)
+
+The brief says "reset clears current phase timer back to its starting duration and stops the clock." It does not state whether pressing reset during a break resets the break or resets the entire sequence back to a work session. An agent will have to guess. If it resets to work always, a user who accidentally hits reset during a break gets dropped out of their rhythm unexpectedly.
+
+Fix: state "Reset returns the current phase to its starting duration without changing the phase type. A separate 'Restart' button (or long-press) returns to work-phase 1 if that behavior is wanted; it is not required here."
+
+Severity: M
+
+---
+
+## The single change with the highest leverage
+
+Merge the wall-clock drift rule into the brief. The localStorage restore is the only piece of behavior that can silently corrupt the timer state in a way the user cannot see or recover from. If `remainingSeconds` is restored naively, a user who closes the tab for 20 minutes during a work session will reopen to a timer showing 15 minutes remaining, skip the break they were owed, and accumulate incorrect session counts. The AudioContext gesture problem is also real, but a muted tone is merely annoying; a wrong session count is a data integrity failure. Adding the `lastSavedAt` timestamp and the elapsed-time correction at load takes fewer than 10 lines of JavaScript and closes both the drift problem and the auto-start problem at once.
+
+---
+
+## Fixes merged into brief_v2
+
+The following High-severity findings are merged directly into the executable brief (brief_v2):
+
+- **H1 (AudioContext gesture block)**: added rule — create AudioContext lazily on first button click; call `audioCtx.resume()` and await before every tone play.
+- **H2 (localStorage wall-clock drift)**: added rule — save `lastSavedAt` timestamp on every localStorage write; on page load, compute elapsed seconds since `lastSavedAt`, subtract from `remainingSeconds`, clamp to zero, advance phases as needed if the result is negative; do not auto-start on load, show a Resume button instead.
+
+Medium finding M3 (reset phase ambiguity) is carried into the brief as a clarifying constraint: reset returns the current phase to its full starting duration without switching phase type.