Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions Repos/ab-aha-vs-bare-2026-06-02/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# AHA A/B test — align-then-build vs build-direct

Experiment date: 2026-06-02. Model held constant: claude-sonnet-4-6 (every builder and AHA
alignment agent). Referee: Opus 4.8, blind. Design spec (local-only): docs/superpowers/specs/2026-06-02-aha-ab-test-design.md

## Question
What does the AHA framework actually buy? AHA is a planning layer (ask -> align -> critique ->
optimize -> handoff) whose output is an aligned executable prompt; the build happens downstream.
So: Arm A = run AHA on a fuzzy prompt, then a Sonnet builder executes the aligned prompt. Arm B =
a Sonnet builder executes the raw prompt directly. Same model. Measure cost, time, resource use, quality.

## Layout
- prompts/ the shared moderately-specified starting prompt per task (both arms start here)
- gate-answers/ the maestro's live answers to AHA's ask-phase gate, per task
- arm-a/taskN/00-aha/ AHA artifacts: ledger, critique, executable-prompt, handoff-packet
- arm-a/taskN/app/ Arm A built product
- arm-b/taskN/app/ Arm B built product
- metrics/raw/tokens.md exact per-agent token + USD table
- metrics/judge/ blind referee cards + blind-key (X/Y -> arm mapping)
- metrics/screenshots/ runtime screenshots
- RESULTS.md scorecard, runtime verification, aggregate, verdict

## Headline
Cost A/B 4.6x. Quality 68 vs 61 (/75). AHA won Pomodoro + To-do (robustness/scope surfaced at the
ask gate), lost Tip splitter (alignment over-specified penny reconciliation; build dropped the
remainder -> silent money loss). AHA is a requirements amplifier: it pays when unstated intent
matters and the human answers well; it backfires when alignment adds ambition the build botches.

## Reproduce
Each agent's exact prompt is in RESULTS.md / this folder. Tokens were summed from each subagent's
transcript JSONL (~/.claude/projects/<proj>/<session>/subagents/agent-*.jsonl), USD-weighted at
Sonnet 4.6 rates. To view an app: `python3 -m http.server` in this dir, open arm-a|b/taskN/app/index.html.
Comment on lines +29 to +32
207 changes: 207 additions & 0 deletions Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# AHA A/B test — results

Spec: docs/superpowers/specs/2026-06-02-aha-ab-test-design.md (local-only)
Model held constant: claude-sonnet-4-6 for every builder + alignment agent. Referee: Opus 4.8, blind.
Cost = USD-weighted billed tokens (Sonnet 4.6 rates: in $3, out $15, cache-write $3.75/5m or $6/1h, cache-read $0.30 per MTok).

## Task 1 — Pomodoro (prompt: moderately specified)

### Cost and resource use (all agents verified claude-sonnet-4-6)
| agent | USD | input | output | cache-wr | cache-rd | tools | time |
| ------------ | ------- | ----- | ------ | -------- | -------- | ----- | ---- |
| A-ask | $0.2188 | 7 | 773 | 32,853 | 33,475 | 2 | 22s |
| A-align | $0.6185 | 24 | 5,000 | 68,455 | 442,380 | 12 | 173s |
| A-build | $0.5182 | 17 | 7,790 | 50,761 | 322,599 | 8 | 124s |
| Arm A (AHA) | $1.3555 | 48 | 13,563 | 152,069 | 798,454 | 22 | 319s |
| Arm B (bare) | $0.3447 | 14 | 384* | 51,803 | 93,480 | 3 | 62s |
| ratio A / B | 3.93x | | | | | 7.3x | 5.1x |

\* B-build's big Write was billed mostly under cache-creation, not output (serving-layer
categorization quirk). Total billed cost of the turn is captured; only the category split is odd.

### Quality (blind Opus referee, /25)
| arm | total | coverage | correctness | code | ux | robustness |
| ------------ | ----- | -------- | ----------- | ---- | --- | ---------- |
| Arm A (AHA) | 25/25 | 5 | 5 | 5 | 5 | 5 |
| Arm B (bare) | 18/25 | 4 | 4 | 4 | 4 | 2 |
| delta | +7 | +1 | +1 | +1 | +1 | +3 |

Build size: Arm A 552 LOC, Arm B 387 LOC.

### Task 1 read
AHA cost 3.93x the USD and 5.1x the wall-clock, and bought +7/25 quality (+39%). The gap
concentrates in robustness (+3): persistence, input validation, and the long-break protocol —
all surfaced at the ask gate, none stated in the raw prompt. The bare app shipped 3 real bugs
(1s-long phases, no persistence, destructive mode-tab click). Whether 4x cost for that delta
"pays" depends on whether the work is throwaway or kept.

### Runtime smoke test (maestro, live in browser)
- Arm A (AHA): start works; writes 7 localStorage keys immediately; after reload restored
24:48 -> 24:41 (elapsed-time reconstruction works) and stayed PAUSED (no auto-start, as
the executable prompt required). 4-dot long-break cycle and sound toggle visible. Only
console "error" is a favicon 404 (not an app bug).
- Arm B (bare): start works; writes 0 localStorage keys; after reload reset to 25:00 (running
state lost). Confirms the referee's "no persistence" finding at runtime. Polished coral card,
Work/Break tabs (the tabs are also the source of the destructive-click bug).
- Verdict holds at runtime: the referee's static read matches observed behavior.

### Methodology caveats to weigh before the full battery
1. Arm A's ~4x cost is partly an artifact of my harness: it runs 4 ISOLATED subagents, each
re-reading context cold (cache_read dominates). A real `/aha` user runs ask->align->...
inline in ONE warm session, so true AHA overhead is LOWER than 4x. Treat 3.93x as an
upper bound on cost, not a typical figure.
2. The +7 quality gap includes scope the human injected at the ask gate (persistence,
long-break, mute). That is the AHA treatment by design, but it means the test rewards
alignment-elicited requirements, not just "better building."

## Task 2 — To-do (moderately specified)
| metric | Arm A (AHA) | Arm B (bare) |
|---|---|---|
| cost (USD) | $1.2906 (ask $0.218 + align $0.753 + build $0.319) | $0.2961 |
| ratio | 4.36x | — |
| tools | 17 | 4 |
| time | ~233s | ~62s |
| quality (blind Opus /25) | 24 | 21 |
| LOC | 485 | 358 |

Quality gap = robustness. AHA guards `Array.isArray(parsed)` on load; bare does `JSON.parse(...) || []`
with no shape check. Live-verified: with a corrupt non-array under the real storage key, the BARE app
renders 0 items, throws on load, and a 2nd error fires on add — fully bricked. AHA shrugs it off (add still
works). Bare did ship a nicer clear-completed + items-left footer (referee gave bare ux 5 vs AHA 4).

## Task 3 — Tip splitter (moderately specified) <-- AHA LOSES
| metric | Arm A (AHA) | Arm B (bare) |
|---|---|---|
| cost (USD) | $1.1886 (ask $0.219 + align $0.645 + build $0.325) | $0.1986 |
| ratio | 5.98x | — |
| tools | 19 | 3 |
| time | ~234s | ~52s |
| quality (blind Opus /25) | 19 | 22 |
| LOC | 481 | 305 |

The maestro gate answer demanded "exact penny reconciliation — distribute the remainder." The AHA align
phase encoded a self-contradiction: display only the floored base amount AND claim shares sum to total.
The AHA build implemented floor-and-DROP the remainder -> displayed shares under-collect the shown total
(silent money loss). Live-verified: bill 100 / 18% / 3 people shows Total $118.00 but "Each pays $39.33"
(x3 = $117.99). The bare arm never attempted reconciliation, used a plain rounded share, and the referee
judged honestly-wrong-but-simple > ambitiously-wrong-but-complex. AHA over-engineered and lost.

## Aggregate (3 tasks)
| metric | Arm A (AHA) | Arm B (bare) | A vs B |
|---|---|---|---|
| total cost (USD) | $3.835 | $0.839 | 4.57x more |
| total agent wall-clock | 786s (13.1m) | 176s (2.9m) | 4.47x more |
| total tool calls | 58 | 10 | 5.8x more |
| quality total (/75) | 68 | 61 | +7 |
| quality avg (/25) | 22.7 | 20.3 | +2.4 (+12%) |
| tasks won | 2 (Pomodoro, To-do) | 1 (Tip) | — |
| worst single defect | tip money under-collection | to-do corrupt-storage crash | — |

Where AHA's cost goes (avg per task): ask ~$0.22, align ~$0.67, build ~$0.34. The align phase alone is
~50% of AHA's spend and produces no product — it is the alignment tax. A-ask + A-align (~$0.89/task) is
pure overhead vs the bare build (~$0.28/task).

## Verdict

AHA cost 4.6x the money and 4.5x the time to deliver +12% average quality, winning 2 of 3 tasks.

When AHA paid off (tasks 1, 2): the wins were entirely in robustness and scope that the raw prompt left
unstated — persistence, input validation, graceful degradation, the long-break protocol. The ask gate
surfaced these; the bare arm guessed and guessed wrong. This is AHA's real mechanism: structured
elicitation of human intent, not better raw coding.

When AHA backfired (task 3): alignment over-specified. It (and the maestro) raised a bar — exact penny
reconciliation — that was internally contradictory and the builder implemented as a real defect. Simplicity
avoided the trap. Adding ambition without matching execution produced a worse product at 6x the cost.

Honest takeaways:
1. AHA wins when the task hides requirements that matter and the human knows them. It converts tacit intent
into explicit spec. On truly simple, fully-specified tasks the bare arm is "good enough" at a fraction of cost.
2. AHA can lose by over-engineering: alignment can encode contradictions or demand sophistication the builder
botches. More spec is not strictly safer.
3. The 4.6x cost is an UPPER BOUND. The harness runs AHA as isolated subagents that re-read context cold
(cache_read dominates). A real /aha runs warm in one session; true overhead is lower. Quality numbers are
unaffected.
4. The quality delta is elicitation, not coding skill. The fairest read: AHA is a requirements amplifier. Its
value tracks how much unstated-but-important intent exists, and how well the human answers the gates.
5. Economics: ~$3 extra across 3 throwaway toys is a bad trade. The same $3 on kept/extended software, where
the avoided rework and the bricked-on-corrupt-storage class of bug actually bite, is likely a good trade.

## Scope and external validity (added after maestro review)

This test measured AHA in the regime where it is weakest: tiny, zero-context, single-file tasks that
a one-shot prompt already builds well. That is, by construction, the wrong place to see AHA's main mechanism.

Model the cost as two parts:
- Alignment cost (ask + align): roughly FIXED per task (~$0.9 here), nearly independent of codebase size.
- Build cost: on a real project, dominated by the builder's blind exploration — reading many files,
rebuilding context across turns, wandering down wrong paths, verbose dead-ends, scope creep.

On a toy task, exploration is ~0, so alignment is pure overhead and AHA looks 4.6x more expensive. On a
real multi-layer pre-production codebase, exploration is the dominant token sink, and a bounded handoff
(contract, what-not-to-touch, acceptance checks) caps it. Past a CROSSOVER point — the codebase
size/complexity where exploration savings exceed the fixed alignment cost — AHA REDUCES total tokens and
cost rather than raising them. This experiment sits far below that crossover; real projects sit far above it.

The four token sinks AHA attacks in the real regime: (1) blind file/context discovery, (2) context
rebuild/re-reading across turns, (3) wandering and rework down wrong paths, (4) verbosity and scope creep.
ask-me + align-me convert tacit intent into an explicit contract so the builder reads with intent instead
of crawling the repo at random.

Caveat that does not vanish at scale: AHA's win scales with exploration savings AND with alignment
correctness. A wrong or contradictory brief (see task 3) sends the builder confidently down a bounded wrong
path — cheaper to run, but the error is structural and costlier to unwind in a real codebase. AHA lowers
exploration cost and raises the stakes on getting alignment right; that is where the human's gate answers
carry the weight.

Refined verdict: AHA is a poor trade for quick, one-shot, well-understood tasks (toys, scripts). It is
spot-on for ongoing, multi-layer, real projects in or near production, where it shifts from cost-multiplier
to cost-reducer by replacing blind exploration with an explicit contract. The toy battery here is a lower
bound on AHA's value, not a representative sample.

## Production validation (operator-reported, 2026-06) — the real-regime ceiling

Not lab-controlled. Reported by the operator (Ozzy) from production use. Included as the above-crossover
data point the toy battery could not produce. It is the empirical counterpart to the "Scope and external
validity" model above.
Comment on lines +162 to +166

Environment: a large production Playwright test framework with a multi-engineer team and a deep automated suite (smoke, regression, UI, e2e). Far
above the toy battery's scale and well past the AHA crossover point.

Workflow (AHA pipeline with deliberate cross-model routing):
1. Run the full /aha pipeline on a cheap fast model (Cursor Composer 2.5, agent mode) -> get the TLDR and
the delivered executable prompt.
2. Open a new session on Opus 4.8; have Opus build a plan from AHA's delivered prompt (one high-leverage step).
3. Return to Composer 2.5 to execute the plan and create the Playwright tests.

Compared configurations:
- Bare: Opus + Sonnet
- Bare: Opus + Composer
- AHA pipeline: AHA(Composer) -> Opus plan -> Composer execute

Reported result: the AHA pipeline ran ~30% cheaper overall (test creation + validation) than the AVERAGE
of the two bare configurations.

Quality, concretely (operator-observed, this is an SDET's read of generated test code):
- AHA arm: generated tests matched the actual acceptance criteria (the real intent) and ran correctly.
- Bare arm: tests did not run on the first attempt; on review the generated test steps were missing; and
one test case was a FALSE POSITIVE — the cheap executor (Composer) shortcut the AC, writing a plain
boolean true/false assertion instead of extracting the real value and cross-asserting it (which the AC
required). A vacuous test that passes without validating behavior.
- Why AHA prevented it: AHA's handoff carries explicit acceptance checks / AC, so the executor cannot
shortcut to a trivially-true assertion. The contract forces real value extraction and cross-assertion.
Without that contract the bare agent took the path of least resistance (reward-hacked the task to green).

This is the highest-value catch in test automation: a false-positive regression test is a silent hole in a
large regression suite, worse than no test because it manufactures false confidence the whole team then trusts.
The AC in AHA's aligned prompt is an anti-shortcut guardrail. It is the same mechanism that gave the toy
AHA builds their robustness/correctness edge, operating here at production stakes.

Why the cost result is consistent with the model: above the crossover, the bare configs spend premium-model
tokens on blind exploration and rework inside a single session. AHA produces a portable, context-free
executable prompt that decouples thinking from doing, so each step runs on the cost-appropriate model: cheap
model for the high-volume alignment and execution, premium model (Opus) for the single high-leverage planning
step. Premium tokens are spent surgically instead of burned wandering. Net: lower cost AND better output.

The model-routing point generalizes: AHA's clean handoff is what makes cheap-model substitution viable for
the bulk work. A bare agent keeps its context in one expensive session and cannot route cheaply.
56 changes: 56 additions & 0 deletions Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Premortem critique

Draft critiqued: the aligned brief from ledger.md (single-file Pomodoro timer with Web Audio, localStorage persistence, SVG ring countdown, dark polished UI).

---

## What this gets right

1. The delivery format is unambiguous: "one `.html` file with all CSS and JS inlined — no CDN links, no external fonts" removes every dependency question a build agent would otherwise need to ask.
2. The session count persistence rule is concrete: "Persist `completedSessions`, `currentPhase`, and `remainingSeconds` to `localStorage` on every tick" gives the agent an exact write strategy, not a vague "save it somehow."
3. The audio spec names the implementation path: "`AudioContext.createOscillator()` — 440 Hz sine wave at ~0.3 gain, ~0.8 s duration" so the agent cannot reach for a `<audio src>` tag or an npm package.

---

## Where this fails

### 1. AudioContext is blocked on first user gesture in all modern browsers (H)

The brief specifies playing a tone "at session end" but does not require the AudioContext to be created or resumed inside a user-gesture handler. Browsers block AudioContext creation (or leave it in `suspended` state) until the user interacts with the page. If the timer auto-starts from a restored localStorage state, the very first tone will be silenced and the browser console will log an uncaught rejection.

Fix: add an explicit rule — "Create the AudioContext lazily, on the first user click on any button. Before playing any tone call `audioCtx.resume()` and await the promise."

Severity: H

### 2. localStorage restore can resurrect a phantom running timer (H)

The brief says "save phase and remaining seconds so reloading mid-session resumes." But if the tab closed while the timer was running, there is no way to know whether 10 seconds or 10 minutes have elapsed since the save. Restoring `remainingSeconds` directly would show a time that is wrong by the gap, and auto-resuming the timer would silently skip break triggers. The brief has no rule for handling elapsed wall-clock time.

Fix: add the rule — "On page load, if a saved `lastSavedAt` timestamp exists, compute elapsed wall-clock seconds since that timestamp, subtract from `remainingSeconds`, clamp to zero, and advance phase if the result is negative. Do not auto-start the timer on load; show a 'Resume' button."

Severity: H

### 3. The reset behavior is ambiguous for the in-progress session (M)

The brief says "reset clears current phase timer back to its starting duration and stops the clock." It does not state whether pressing reset during a break resets the break or resets the entire sequence back to a work session. An agent will have to guess. If it resets to work always, a user who accidentally hits reset during a break gets dropped out of their rhythm unexpectedly.

Fix: state "Reset returns the current phase to its starting duration without changing the phase type. A separate 'Restart' button (or long-press) returns to work-phase 1 if that behavior is wanted; it is not required here."

Severity: M

---

## The single change with the highest leverage

Merge the wall-clock drift rule into the brief. The localStorage restore is the only piece of behavior that can silently corrupt the timer state in a way the user cannot see or recover from. If `remainingSeconds` is restored naively, a user who closes the tab for 20 minutes during a work session will reopen to a timer showing 15 minutes remaining, skip the break they were owed, and accumulate incorrect session counts. The AudioContext gesture problem is also real, but a muted tone is merely annoying; a wrong session count is a data integrity failure. Adding the `lastSavedAt` timestamp and the elapsed-time correction at load takes fewer than 10 lines of JavaScript and closes both the drift problem and the auto-start problem at once.

---

## Fixes merged into brief_v2

The following High-severity findings are merged directly into the executable brief (brief_v2):

- **H1 (AudioContext gesture block)**: added rule — create AudioContext lazily on first button click; call `audioCtx.resume()` and await before every tone play.
- **H2 (localStorage wall-clock drift)**: added rule — save `lastSavedAt` timestamp on every localStorage write; on page load, compute elapsed seconds since `lastSavedAt`, subtract from `remainingSeconds`, clamp to zero, advance phases as needed if the result is negative; do not auto-start on load, show a Resume button instead.

Medium finding M3 (reset phase ambiguity) is carried into the brief as a clarifying constraint: reset returns the current phase to its full starting duration without switching phase type.
Loading