Skip to content

A/B evaluation: align-then-build vs build-direct (with production case study)#2

Open
omerakben wants to merge 3 commits into
mainfrom
experiment/aha-ab-test
Open

A/B evaluation: align-then-build vs build-direct (with production case study)#2
omerakben wants to merge 3 commits into
mainfrom
experiment/aha-ab-test

Conversation

@omerakben
Copy link
Copy Markdown
Owner

@omerakben omerakben commented Jun 2, 2026

Adds the AHA A/B evaluation as committed evidence under Repos/ab-aha-vs-bare-2026-06-02/. The kit pattern (docs/orchestrator.md cross-model handoff) and the README evaluation pointer are already on main; this PR brings in the underlying study.

What this measures

Same fuzzy prompt, two arms, model held at claude-sonnet-4-6, blind Opus 4.8 referee, runtime-verified in-browser:

  • Arm A (AHA): /aha produces an aligned executable prompt, then a builder runs it.
  • Arm B (bare): a builder runs the raw prompt directly.

Three tasks: Pomodoro, to-do, tip splitter.

Findings

  • AHA cost ~4.6x and won 2 of 3 on quality (68 vs 61 /25-scale total). Wins were robustness/scope surfaced at the ask gate (persistence, validation, graceful degradation).
  • AHA lost the tip splitter: alignment over-specified penny reconciliation into a self-contradictory spec, the build dropped the remainder (silent money loss), and the simpler bare app scored higher. More spec is not strictly safer.
  • The toy cost is a lower bound on AHA's value. Above the crossover (large multi-layer repos) the alignment toll is fixed while blind-exploration cost dominates, so AHA flips to a cost reducer. A production case study (sanitized) is in RESULTS.md.

Contents

Both arms' source per task, AHA alignment artifacts (00-aha/), maestro gate answers, blind judge cards + keys, raw token table, runtime screenshots, and RESULTS.md (scorecard + verdict + scope + production case).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added comprehensive experimental results and analysis documentation
    • Included implementation guides and build notes for three web applications: Pomodoro timer, to-do list, and tip splitter
    • Added detailed specifications, methodology documentation, and metrics for comparison evaluation
    • Included test setup details, gate answers, and blind referee assessments

omerakben and others added 3 commits June 2, 2026 14:58
Controlled experiment measuring what the AHA framework buys. 3 fuzzy frontend
tasks (Pomodoro, to-do, tip splitter), both arms on claude-sonnet-4-6, blind
Opus 4.8 referee, runtime-verified in-browser. Tokens summed from per-subagent
transcripts, USD-weighted at Sonnet 4.6 rates.

Result: AHA cost 4.6x and won 2 of 3 on quality (68 vs 61 /75). Wins were
robustness/scope surfaced at the ask gate (persistence, validation, graceful
degradation). Loss (tip splitter): alignment over-specified penny reconciliation
into a self-contradictory spec; the build dropped the remainder (silent money
loss) and the simpler bare app scored higher. AHA reads as a requirements
amplifier, not a coding upgrade; cost figure is an upper bound (isolated
subagents re-read context cold vs a real warm /aha session).

Includes both arms' source per task, AHA alignment artifacts, maestro gate
answers, blind judge cards + keys, raw token table, screenshots, and RESULTS.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The toy battery measured AHA in its weakest regime. Reframe cost as fixed
alignment toll vs variable build-time exploration: on real multi-layer
codebases exploration dominates, so a bounded handoff caps it and AHA flips
from cost-multiplier to cost-reducer past a crossover. The 4.6x figure is a
lower bound on AHA's value, not representative. Win still scales with
alignment correctness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operator-reported real-regime result: an AHA cross-model pipeline on a large
production Playwright test suite ran ~30% cheaper than the average of two bare
configurations and produced better tests, including catching a false-positive
test (boolean assertion instead of value-extract + cross-assert) that the bare
run shipped. The AC contract in AHA's handoff is the anti-shortcut guardrail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 2, 2026 21:00
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR documents a comprehensive A/B test comparing an "AHA align-then-build" workflow against direct implementation across three web UI tasks. It includes experimental design, aligned specifications and implementations, bare implementations, blind evaluations, and detailed cost/quality analysis.

Changes

AHA vs. Bare A/B Test Experiment

Layer / File(s) Summary
Experiment Framework & Setup
README.md, gate-answers/task*.md, prompts/task*.md
README introduces the 2026-06-02 A/B test design comparing AHA alignment-first versus direct build. Gate answers define maestro intent for each task (Pomodoro persistence/audio/polish, to-do UX/styling/offline, tip splitter reconciliation/presets). Base prompts state minimal requirements for each task.
Task 1 Pomodoro: AHA Alignment Documents
arm-a/task1/00-aha/*.md
Pre-build alignment documents: critique identifies AudioContext gesture-blocking, localStorage wall-clock drift, and reset ambiguities; ledger specifies phase timing, SVG rendering, and persistence keys; executable prompt details eval rubric, localStorage example, and no-auto-start rule; handoff packet documents lazy AudioContext resume and elapsed-time correction as AHA delta.
Task 1 Pomodoro: AHA Implementation
arm-a/task1/app/BUILD-NOTES.md, arm-a/task1/app/index.html
Single-file Pomodoro timer implementing alignment spec: localStorage session persistence across reloads, Web Audio lazy-init beeps with mute toggle, SVG circular countdown synced to remaining time, automatic long-break every 4 work sessions, elapsed-time correction on load (no auto-start), and reset behavior preserving session count.
Task 1 Pomodoro: Bare Implementation
arm-b/task1/app/BUILD-NOTES.md, arm-b/task1/app/index.html
Minimal Pomodoro timer: work/break mode tabs, SVG ring countdown, session counter with emoji, toast notifications with auto-switch, no audio/persistence/long-breaks, and no wall-clock drift correction.
Task 2 To-do: AHA Alignment Documents
arm-a/task2/00-aha/*.md
AHA specification: critique identifies blank-input and Enter-key ambiguities plus completed-task styling gaps; ledger defines UUID-per-task stable identity and filter-not-persisted rule; executable prompt specifies WCAG AA contrast and per-filter empty states; handoff packet adds acceptance checklist for 10 observable conditions.
Task 2 To-do: AHA Implementation
arm-a/task2/app/BUILD-NOTES.md, arm-a/task2/app/index.html
UUID-based to-do app with localStorage validation (parse-error fallback and per-item shape checks), filter tabs with aria-selected sync, Enter-key submission, strikethrough + 0.5 opacity done states, empty-state messaging per filter, and safe textContent-based rendering.
Task 2 To-do: Bare Implementation
arm-b/task2/app/BUILD-NOTES.md, arm-b/task2/app/index.html
Direct-build to-do: task CRUD, All/Active/Completed filters, localStorage persistence without validation, clear-completed button, and active-item count footer without UUID identity guarantees or corrupt-storage safeguards.
Task 3 Tip Splitter: AHA Alignment Documents
arm-a/task3/00-aha/*.md
AHA pre-build spec: critique merges validation rules (finite/positive/bounded checks) and per-person labeling ("You pay" vs "Each person pays"), ledger defines largest-remainder cent distribution for exact closure, executable prompt details rounding edge cases and input caps (bill ≤ 999,999.99, people ≤ 100), handoff packet provides acceptance-check arithmetic proof.
Task 3 Tip Splitter: AHA Implementation
arm-a/task3/app/BUILD-NOTES.md, arm-a/task3/app/index.html
Bill/tip calculator with preset buttons (15/18/20/25), editable tip field, per-person share via largest-remainder in cents, inline per-field validation (highlights offending fields), error messages, output hidden until valid, and live recalculation on input changes.
Task 3 Tip Splitter: Bare Implementation
arm-b/task3/app/BUILD-NOTES.md, arm-b/task3/app/index.html
Direct-build tip splitter: preset buttons with active styling, custom tip input, non-negative validation, per-person calculation formatted to 2 decimals, error display, and live updates without cent-level reconciliation or output-hiding on invalid state.
Results Analysis & Blind Evaluations
RESULTS.md, metrics/judge/*.md, metrics/judge/*-blind-key.txt, metrics/raw/tokens.md
Per-task cost/resource tables, blind referee scorecards (Opus 4.8) comparing coverage/correctness/code/UX/robustness, bug observations and verdicts, aggregate metrics (USD, token counts, tool calls), verdict on when AHA alignment pays off (Pomodoro robustness, to-do validation) vs backfires (tip splitter reconciliation defect under-collection), methodology caveats, and production validation showing ~30% cost savings for AHA pipeline in real workloads.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes


A/B test hops through three task dens,
Align-then-build against bare build's lens,
Pomodoro ticks, to-dos persist with care,
While tip math stumbles—but spreadsheets tell the tale fair! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: an A/B evaluation comparing two development approaches (align-then-build vs build-direct) with a production case study.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch experiment/aha-ab-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comparative study of the AHA framework (Arm A) versus direct building (Arm B) across three toy applications: a Pomodoro timer, a To-do list, and a Tip splitter, including design specs, metrics, and source code. The review feedback highlights several critical issues: double-escaping of HTML entities in accessibility labels in the Arm A To-do app, a potential crash in the Arm B To-do app due to unvalidated localStorage data, a timing bug in the Arm B Pomodoro timer causing phases to run one second too long, and premature validation errors displayed on initial load in the Arm B Tip splitter.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

checkbox.type = 'checkbox';
checkbox.className = 'task-checkbox';
checkbox.checked = task.done;
checkbox.setAttribute('aria-label', 'Mark "' + escapeText(task.text) + '" as ' + (task.done ? 'active' : 'complete'));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using escapeText on task.text inside setAttribute causes double-escaping of HTML entities in the accessibility label. Since setAttribute treats the value as a plain string and does not parse it as HTML, screen readers will literally read out the raw HTML entities (e.g., reading 'A & B' as 'A amp B'). Passing task.text directly is safe and correct.

Suggested change
checkbox.setAttribute('aria-label', 'Mark "' + escapeText(task.text) + '" as ' + (task.done ? 'active' : 'complete'));
checkbox.setAttribute('aria-label', 'Mark "' + task.text + '" as ' + (task.done ? 'active' : 'complete'));

delBtn.type = 'button';
delBtn.className = 'btn-delete';
delBtn.textContent = 'Delete';
delBtn.setAttribute('aria-label', 'Delete "' + escapeText(task.text) + '"');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using escapeText on task.text inside setAttribute causes double-escaping of HTML entities in the accessibility label. Since setAttribute treats the value as a plain string and does not parse it as HTML, screen readers will literally read out the raw HTML entities (e.g., reading 'A & B' as 'A amp B'). Passing task.text directly is safe and correct.

Suggested change
delBtn.setAttribute('aria-label', 'Delete "' + escapeText(task.text) + '"');
delBtn.setAttribute('aria-label', 'Delete "' + task.text + '"');

Comment on lines +225 to +228
function load() {
try { tasks = JSON.parse(localStorage.getItem(STORAGE_KEY)) || []; }
catch { tasks = []; }
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If localStorage contains a non-array JSON value (such as an object or boolean), JSON.parse will succeed but tasks will not be an array. This will cause a hard crash when tasks.filter is called during rendering. Validating that the parsed value is an array ensures robustness.

Suggested change
function load() {
try { tasks = JSON.parse(localStorage.getItem(STORAGE_KEY)) || []; }
catch { tasks = []; }
}
function load() {
try {
const parsed = JSON.parse(localStorage.getItem(STORAGE_KEY));
tasks = Array.isArray(parsed) ? parsed : [];
} catch {
tasks = [];
}
}

Comment on lines +265 to +275
function tick() {
if (remaining <= 0) {
clearInterval(intervalId);
running = false;
onPhaseEnd();
return;
}
remaining--;
renderCountdown();
renderRing();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the current implementation, the timer runs 1 second too long for each phase because the condition remaining <= 0 is checked before remaining is decremented. Decrementing remaining first ensures that the phase transitions exactly when the duration has elapsed.

Suggested change
function tick() {
if (remaining <= 0) {
clearInterval(intervalId);
running = false;
onPhaseEnd();
return;
}
remaining--;
renderCountdown();
renderRing();
}
function tick() {
remaining--;
if (remaining <= 0) {
clearInterval(intervalId);
running = false;
onPhaseEnd();
return;
}
renderCountdown();
renderRing();
}

Comment on lines +239 to +253
function calculate() {
billErr.textContent = '';
tipErr.textContent = '';
peopleErr.textContent = '';

const bill = parseFloat(billInput.value);
const tipPct = parseFloat(tipInput.value);
const people = parseInt(peopleInput.value, 10);

let valid = true;

if (isNaN(bill) || bill < 0) {
billErr.textContent = 'Enter a valid bill amount.';
valid = false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

On initial page load, the bill input field is empty, which causes parseFloat(billInput.value) to return NaN. This triggers the validation error 'Enter a valid bill amount.' immediately upon opening the page, which is a poor user experience. Checking if any of the inputs are empty before running the validation logic prevents this issue.

  function calculate() {
    billErr.textContent = '';
    tipErr.textContent  = '';
    peopleErr.textContent = '';

    const billRaw = billInput.value.trim();
    const tipRaw = tipInput.value.trim();
    const peopleRaw = peopleInput.value.trim();

    if (billRaw === '' || tipRaw === '' || peopleRaw === '') {
      outTip.textContent       = '—';
      outTotal.textContent     = '—';
      outPerPerson.textContent = '—';
      return;
    }

    const bill   = parseFloat(billRaw);
    const tipPct = parseFloat(tipRaw);
    const people = parseInt(peopleRaw, 10);

    let valid = true;

    if (isNaN(bill) || bill < 0) {
      billErr.textContent = 'Enter a valid bill amount.';
      valid = false;
    }

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a complete, self-contained A/B evaluation bundle under Repos/ab-aha-vs-bare-2026-06-02/ comparing “align-then-build” (AHA) vs “build-direct” (bare) across three small single-file web-app tasks, including raw metrics, blind-judge cards/keys, runtime notes, and an aggregated results write-up (plus a sanitized production case study narrative).

Changes:

  • Add RESULTS.md and a folder README.md that summarize methodology, per-task outcomes, and aggregate cost/quality findings.
  • Add the exact prompts, AHA ask-gate answers, and raw token/cost tables used for the experiment.
  • Check in both arms’ generated app artifacts (Pomodoro / To-do / Tip splitter) plus the AHA alignment artifacts (00-aha/) for traceability.

Reviewed changes

Copilot reviewed 39 out of 43 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md Consolidated per-task + aggregate scorecard and narrative, including “scope/external validity” and production case study.
Repos/ab-aha-vs-bare-2026-06-02/README.md Entry-point documentation for layout, headline findings, and reproduction steps.
Repos/ab-aha-vs-bare-2026-06-02/prompts/task1-pomodoro.md Shared starting prompt for Task 1.
Repos/ab-aha-vs-bare-2026-06-02/prompts/task2-todo.md Shared starting prompt for Task 2.
Repos/ab-aha-vs-bare-2026-06-02/prompts/task3-tip-splitter.md Shared starting prompt for Task 3.
Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task1.md Maestro answers to AHA ask gate for Task 1.
Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task2.md Maestro answers to AHA ask gate for Task 2.
Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task3.md Maestro answers to AHA ask gate for Task 3.
Repos/ab-aha-vs-bare-2026-06-02/metrics/raw/tokens.md Raw per-agent token/cost breakdown used in the write-up.
Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task1.md Blind referee card for Task 1 (with unsealed mapping).
Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task1-blind-key.txt Task 1 blind key (X/Y → arm mapping).
Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task2.md Blind referee card for Task 2 (with mapping).
Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task2-blind-key.txt Task 2 blind key (X/Y → arm mapping).
Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task3.md Blind referee card for Task 3 (with mapping).
Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task3-blind-key.txt Task 3 blind key (X/Y → arm mapping).
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/index.html Arm A Task 1 app artifact (Pomodoro).
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/BUILD-NOTES.md Build notes for Arm A Task 1 artifact.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/ledger.md AHA ledger (restatement/decisions/assumptions) for Task 1.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md AHA premortem critique for Task 1.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/executable-prompt.md AHA executable prompt delivered to the builder for Task 1.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/handoff-packet.md AHA handoff packet summarizing Task 1 deliverable/constraints.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/app/index.html Arm A Task 2 app artifact (To-do).
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/app/BUILD-NOTES.md Build notes for Arm A Task 2 artifact.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/ledger.md AHA ledger for Task 2.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/critique.md AHA premortem critique for Task 2.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/executable-prompt.md AHA executable prompt delivered to the builder for Task 2.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/handoff-packet.md AHA handoff packet summarizing Task 2 deliverable/constraints.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/index.html Arm A Task 3 app artifact (Tip splitter).
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/BUILD-NOTES.md Build notes for Arm A Task 3 artifact.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/ledger.md AHA ledger for Task 3.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/critique.md AHA premortem critique for Task 3.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/executable-prompt.md AHA executable prompt delivered to the builder for Task 3.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/handoff-packet.md AHA handoff packet summarizing Task 3 deliverable/constraints.
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html Arm B Task 1 app artifact (Pomodoro).
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/BUILD-NOTES.md Build notes for Arm B Task 1 artifact.
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/index.html Arm B Task 2 app artifact (To-do).
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/BUILD-NOTES.md Build notes for Arm B Task 2 artifact.
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html Arm B Task 3 app artifact (Tip splitter).
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/BUILD-NOTES.md Build notes for Arm B Task 3 artifact.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +40 to +42
## Known gaps

None identified. The rounding display shows the floor (base) per-person value as specified; the prompt does not require showing the split breakdown (some pay +1 cent).
Comment on lines +162 to +166
## Production validation (operator-reported, 2026-06) — the real-regime ceiling

Not lab-controlled. Reported by the operator (Ozzy) from production use. Included as the above-crossover
data point the toy battery could not produce. It is the empirical counterpart to the "Scope and external
validity" model above.
Comment on lines +9 to +11
```
open /Users/ozzy-mac/Projects/aha/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/index.html
```
Comment on lines +21 to +23
```
open /Users/ozzy-mac/Projects/aha/Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/index.html
```
Comment on lines +29 to +32
## Reproduce
Each agent's exact prompt is in RESULTS.md / this folder. Tokens were summed from each subagent's
transcript JSONL (~/.claude/projects/<proj>/<session>/subagents/agent-*.jsonl), USD-weighted at
Sonnet 4.6 rates. To view an app: `python3 -m http.server` in this dir, open arm-a|b/taskN/app/index.html.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (6)
Repos/ab-aha-vs-bare-2026-06-02/README.md (1)

30-30: ⚡ Quick win

Clarify "this folder" reference.

The statement "Each agent's exact prompt is in RESULTS.md / this folder" is ambiguous. The base prompts are in prompts/, but AHA artifacts (executable prompts, handoff packets) are scattered across arm-a/taskN/00-aha/ subdirectories. Consider specifying the exact locations more precisely for reproducibility, e.g., "base prompts in prompts/, AHA artifacts in arm-a/taskN/00-aha/, and build notes in arm-*/taskN/app/BUILD-NOTES.md."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/README.md` at line 30, Update the ambiguous
sentence in README.md that reads "Each agent's exact prompt is in RESULTS.md /
this folder" to explicitly list where different artifacts live: reference
RESULTS.md for summarized prompts, the prompts/ directory for base prompts,
arm-a/taskN/00-aha/ for AHA artifacts (executable prompts and handoff packets),
and arm-*/taskN/app/BUILD-NOTES.md for build notes so readers can find each item
unambiguously.
Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md (2)

5-5: 💤 Low value

Clarify the cache-write pricing notation.

The notation "cache-write $3.75/5m or $6/1h" is ambiguous. Does this mean different rates based on request duration, cache TTL, or something else? Consider rephrasing for clarity, e.g., "cache-write $3.75 per MTok (5-minute TTL) or $6 per MTok (1-hour TTL)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md` at line 5, Clarify the ambiguous
"cache-write $3.75/5m or $6/1h" phrase in RESULTS.md by specifying the unit and
what 5m/1h refer to (e.g., TTL or billing interval); replace it with a clear
phrase such as "cache-write $3.75 per MToken (5‑minute TTL) or $6 per MToken
(1‑hour TTL)" or otherwise indicate if the rates apply per request duration,
per-token, or per TTL so readers understand the billing basis.

10-10: ⚡ Quick win

Add blank lines before tables to improve Markdown rendering.

Markdown best practices recommend blank lines before and after tables for consistent rendering across parsers.

📝 Proposed fix

Add a blank line before each table. For example, before line 10:

 ### Cost and resource use (all agents verified claude-sonnet-4-6)
+
 | agent        | USD     | input | output | cache-wr | cache-rd | tools | time |

Repeat for lines 23, 58, 73, and 90.

Also applies to: 23-23, 58-58, 73-73, 90-90

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md` at line 10, The Markdown tables
lack a preceding blank line which can break rendering; insert a single blank
line immediately before each table row starting with the pipe header (the line
containing "| agent        | USD     | input | output | cache-wr | cache-rd |
tools | time |") and similarly add a blank line before the other table header
lines referenced (those at the other occurrences) so each table is separated by
an empty line above it.
Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/BUILD-NOTES.md (1)

20-24: 💤 Low value

Optional: Add language identifier to code block.

The static analysis tool flagged the fenced code block for missing a language specifier. Adding shell or bash after the opening fence would improve Markdown rendering, but since this is experiment documentation, the impact is minimal.

Suggested diff
 ## How to open
 
-```
+```shell
 open index.html          # macOS
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/BUILD-NOTES.md` around lines
20 - 24, Update the fenced code block in BUILD-NOTES.md by adding a language
identifier (e.g., shell or bash) after the opening triple backticks so it reads
```shell (or ```bash) before the commands `open index.html`, `start index.html`,
and `xdg-open index.html`; this will satisfy the static analysis tool and
improve Markdown rendering.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/executable-prompt.md (1)

1-124: 💤 Low value

Optional: Add language identifier to example code block.

The static analysis tool flagged the fenced code block at line 38 for missing a language specifier. Adding html after the opening fence would improve Markdown rendering and satisfy linters, but since this is experiment documentation (not production), the impact is minimal.

Suggested diff
 ### Worked example (sketch only — not the full output)
 
-```
+```html
 <!DOCTYPE html>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/executable-prompt.md`
around lines 1 - 124, The fenced example code block missing a language specifier
should be updated by changing the opening fence from ``` to ```html so the
example (the block that begins with <!DOCTYPE html> in the "Worked example"
section) is explicitly marked as HTML; locate the triple-backtick fence
enclosing that example and add "html" immediately after the opening backticks to
satisfy the linter and improve Markdown rendering.
Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/executable-prompt.md (1)

71-90: 💤 Low value

Add language identifier to fenced code block.

The code block showing the expected HTML structure should specify html as the language identifier for proper syntax highlighting and accessibility.

📝 Suggested fix
-```
+```html
 <!DOCTYPE html>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/executable-prompt.md`
around lines 71 - 90, The fenced code block that shows the HTML snippet needs a
language identifier for proper highlighting; update the opening fence from ```
to ```html in the block containing the DOCTYPE/HTML sample so the snippet is
treated as HTML (the block that begins with the <!DOCTYPE html> line).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/critique.md`:
- Line 37: The spec/behavior that labels the line "Each person pays" and shows
only the base per-person (floor) amount when people > 1 is misleading when a
remainder exists; change the output logic so that when total % people != 0 you
either display a per-person range like "$X–$Y" (e.g., "$3.83–$3.84") or present
an explicit per-person breakdown, replacing the current single "Each person pays
$floor" text; update the code/path that emits the "Each person pays" label and
its amount formatting to compute remainder = total - floor*people and
conditionally render the range or breakdown instead of the floor-only value.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/BUILD-NOTES.md`:
- Line 23: Update the documentation and implementation to reflect the required
long-break behavior: change the BUILD-NOTES.md line that currently says "No
long-break support" to state that a 15-minute long break is scheduled after
every 4 completed work sessions, matching the gate-answers in task1.md; then
check and, if missing, implement long-break logic in the scheduling code by
updating the session tracking flow (functions like recordSessionCompletion,
getNextBreakDuration or scheduleBreaks/startSession) to increment a work-session
counter and return a 15-minute break after every 4th completed work session
while resetting the counter.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html`:
- Around line 265-275: The tick() function has an off-by-one: the remaining
counter is checked before decrementing so a phase shows 00:00 for one extra
second; fix by decrementing remaining before the zero check (or change the
condition to remaining < 0) inside tick(), then call clearInterval(intervalId),
set running = false and onPhaseEnd() when the decremented remaining has reached
zero; ensure renderCountdown() and renderRing() are called after updating
remaining so the displayed time matches the phase end timing.
- Around line 362-369: renderSessions() currently clears tomatoWrap and
recreates every 'tomato' span each call, causing the pop animation to replay for
all items; change it to only append newly added tomatoes by tracking the last
rendered count (e.g., a module-scoped lastRenderedSessions variable) and in
renderSessions() loop from lastRenderedSessions to sessions to create and append
only new span elements (set className 'tomato' and textContent '🍅'), then
update lastRenderedSessions; alternatively, if you prefer to keep full
re-rendering, locate the newly created element after adding and add/remove the
animation class only on that element instead of re-adding it to all existing
spans.
- Around line 323-341: The setMode function currently calls resetTimer()
whenever a tab is clicked, which silently discards progress; change it so it
does not reset when the clicked tab is the already-active mode and, when
switching modes while a timer is running, require an explicit confirmation
(e.g., showConfirm) before calling resetTimer(); use the existing mode variable
to compare the new mode (m) to the current mode and only proceed with
resetTimer() after user confirmation when m !== mode and running is true,
otherwise simply update UI state without discarding remaining progress.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html`:
- Around line 270-276: The per-person value uses floating-point division
(perPerson = total / people) which loses cents; change to cent-based integer
arithmetic: convert bill and tip to cents (use Math.round) to compute
tipAmtCents, totalCents, then compute baseShareCents = Math.floor(totalCents /
people) and remainder = totalCents % people and produce per-person shares that
distribute the remainder (e.g., add 1 cent to the first remainder people) so the
sum of per-person shares equals totalCents; update the code paths that reference
tipAmt, total and outPerPerson (and the fmt function output) to use these
cent-based values (divide by 100 only for display) so rounding discrepancies are
eliminated.
- Line 190: Remove the hardcoded default values that cause calculations on load:
delete the value="18" from the tip input (id="tip-pct") and delete the default
value on the people input (id for people input in the same form). Also remove
the active state/class from the 18% preset button so no preset appears selected
on load. Finally update the initial render/validation logic (the function that
runs calculations on input — e.g., calculateTip/renderResults/validateInputs) to
treat empty inputs as invalid and skip calculations until all three fields
contain valid numbers, so the results area stays blank on page load.
- Around line 250-253: The bill validation currently allows zero; update the
check that uses the bill variable so it rejects zero as well (change the
condition from "bill < 0" to "bill <= 0") and adjust the error text on billErr
to indicate the bill must be greater than zero (e.g., "Enter a bill amount
greater than zero."); ensure the valid flag is still set to false when this
check fails so the form behaves correctly.
- Around line 183-187: The preset buttons include an extra 10% button that
violates the experiment protocol; remove the <button> element with class
"preset-btn" and data-pct="10" so only the four presets remain (data-pct="15",
"18", "20", "25"), and ensure the "active" class is applied to the intended
default preset (e.g., the element with data-pct="18" if that should be default).
Locate elements by class "preset-btn" and data-pct attributes to update the
markup and any JavaScript that selects/initializes the active preset.
- Around line 258-261: The people validation currently accepts truncated
decimals because parseInt was used; modify the check to reject non-integer
values by validating that the parsed number is an integer (e.g., use
Number.isInteger(people) or a /^\d+$/ check against peopleInput.value) in
addition to existing isNaN/positive checks, and update the peopleErr.textContent
to state "At least 1 person required and must be an integer." Ensure references
to people, peopleInput, and peopleErr are used so the change is applied in the
same validation block.

---

Nitpick comments:
In `@Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/executable-prompt.md`:
- Around line 1-124: The fenced example code block missing a language specifier
should be updated by changing the opening fence from ``` to ```html so the
example (the block that begins with <!DOCTYPE html> in the "Worked example"
section) is explicitly marked as HTML; locate the triple-backtick fence
enclosing that example and add "html" immediately after the opening backticks to
satisfy the linter and improve Markdown rendering.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/executable-prompt.md`:
- Around line 71-90: The fenced code block that shows the HTML snippet needs a
language identifier for proper highlighting; update the opening fence from ```
to ```html in the block containing the DOCTYPE/HTML sample so the snippet is
treated as HTML (the block that begins with the <!DOCTYPE html> line).

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/BUILD-NOTES.md`:
- Around line 20-24: Update the fenced code block in BUILD-NOTES.md by adding a
language identifier (e.g., shell or bash) after the opening triple backticks so
it reads ```shell (or ```bash) before the commands `open index.html`, `start
index.html`, and `xdg-open index.html`; this will satisfy the static analysis
tool and improve Markdown rendering.

In `@Repos/ab-aha-vs-bare-2026-06-02/README.md`:
- Line 30: Update the ambiguous sentence in README.md that reads "Each agent's
exact prompt is in RESULTS.md / this folder" to explicitly list where different
artifacts live: reference RESULTS.md for summarized prompts, the prompts/
directory for base prompts, arm-a/taskN/00-aha/ for AHA artifacts (executable
prompts and handoff packets), and arm-*/taskN/app/BUILD-NOTES.md for build notes
so readers can find each item unambiguously.

In `@Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md`:
- Line 5: Clarify the ambiguous "cache-write $3.75/5m or $6/1h" phrase in
RESULTS.md by specifying the unit and what 5m/1h refer to (e.g., TTL or billing
interval); replace it with a clear phrase such as "cache-write $3.75 per MToken
(5‑minute TTL) or $6 per MToken (1‑hour TTL)" or otherwise indicate if the rates
apply per request duration, per-token, or per TTL so readers understand the
billing basis.
- Line 10: The Markdown tables lack a preceding blank line which can break
rendering; insert a single blank line immediately before each table row starting
with the pipe header (the line containing "| agent        | USD     | input |
output | cache-wr | cache-rd | tools | time |") and similarly add a blank line
before the other table header lines referenced (those at the other occurrences)
so each table is separated by an empty line above it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3753aaa9-a496-4ae1-89a8-dc075fc3dcd0

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab2e10 and f192870.

⛔ Files ignored due to path filters (4)
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task1-armA-aha.png is excluded by !**/*.png
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task1-armB-bare.png is excluded by !**/*.png
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task3-armA-aha.png is excluded by !**/*.png
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/screenshots/abtest-task3-armB-bare.png is excluded by !**/*.png
📒 Files selected for processing (39)
  • Repos/ab-aha-vs-bare-2026-06-02/README.md
  • Repos/ab-aha-vs-bare-2026-06-02/RESULTS.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/critique.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/executable-prompt.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/handoff-packet.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/00-aha/ledger.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/BUILD-NOTES.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task1/app/index.html
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/critique.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/executable-prompt.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/handoff-packet.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/00-aha/ledger.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/app/BUILD-NOTES.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task2/app/index.html
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/critique.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/executable-prompt.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/handoff-packet.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/ledger.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/BUILD-NOTES.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/app/index.html
  • Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/BUILD-NOTES.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html
  • Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/BUILD-NOTES.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-b/task2/app/index.html
  • Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/BUILD-NOTES.md
  • Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html
  • Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task1.md
  • Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task2.md
  • Repos/ab-aha-vs-bare-2026-06-02/gate-answers/task3.md
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task1-blind-key.txt
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task1.md
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task2-blind-key.txt
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task2.md
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task3-blind-key.txt
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/judge/task3.md
  • Repos/ab-aha-vs-bare-2026-06-02/metrics/raw/tokens.md
  • Repos/ab-aha-vs-bare-2026-06-02/prompts/task1-pomodoro.md
  • Repos/ab-aha-vs-bare-2026-06-02/prompts/task2-todo.md
  • Repos/ab-aha-vs-bare-2026-06-02/prompts/task3-tip-splitter.md


Display rules:
- When people = 1, show the per-person line but label it "You pay" instead of "Per person."
- When people > 1, label the line "Each person pays" and show the per-person amount (note that amounts may differ by $0.01 between people due to remainder distribution; display only the base per-person amount unless showing a breakdown is requested — it is not requested here, so show the single rounded-down amount).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Specification creates misleading per-person display.

The merged brief specifies displaying "only the base per-person amount" (the floor value), but when a remainder exists, most people may pay floor+1. For the example in executable-prompt.md (total=$11.50, 3 people): floor=$3.83, but the actual distribution is [$3.84, $3.83, $3.83]. Displaying "Each person pays $3.83" is accurate for 2 people but incorrect for 1 person. This contradicts the promise of "exact reconciliation" from the user's perspective—they see $3.83 but the total requires someone to pay $3.84.

This aligns with the PR summary noting that AHA "lost the tip-splitter task due to an over-specified, self-contradictory alignment that caused silent money loss when the build dropped the remainder."

For an experiment documenting failure modes, this is acceptable as-is. For production, the display should either show the range ("$3.83-$3.84") or present a per-person breakdown.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-a/task3/00-aha/critique.md` at line 37,
The spec/behavior that labels the line "Each person pays" and shows only the
base per-person (floor) amount when people > 1 is misleading when a remainder
exists; change the output logic so that when total % people != 0 you either
display a per-person range like "$X–$Y" (e.g., "$3.83–$3.84") or present an
explicit per-person breakdown, replacing the current single "Each person pays
$floor" text; update the code/path that emits the "Each person pays" label and
its amount formatting to compute remainder = total - floor*people and
conditionally render the range or breakdown instead of the floor-only value.


- No audio alert when a session ends (browser autoplay policy blocks it without user interaction; a future version could use the Web Audio API to generate a tone on the first user click).
- Session count is not persisted across page reloads (localStorage could be added trivially).
- No long-break support (every 4th session conventionally triggers a 15-minute break — omitted to match the prompt spec).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Spec interpretation discrepancy.

Line 23 states long-break was "omitted to match the prompt spec," but the gate-answers (task1.md) explicitly require: "Long break after 4 sessions? -> Yes. Standard 15-minute long break automatically after every 4 completed work sessions."

This represents a spec interpretation gap where the bare builder did not fully capture the stated requirements.

Note: This is an observational comment documenting a key experimental finding—that the alignment phase (AHA) successfully captured requirements that the bare approach missed. It is not a fix request for this research artifact.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/BUILD-NOTES.md` at line 23,
Update the documentation and implementation to reflect the required long-break
behavior: change the BUILD-NOTES.md line that currently says "No long-break
support" to state that a 15-minute long break is scheduled after every 4
completed work sessions, matching the gate-answers in task1.md; then check and,
if missing, implement long-break logic in the scheduling code by updating the
session tracking flow (functions like recordSessionCompletion,
getNextBreakDuration or scheduleBreaks/startSession) to increment a work-session
counter and return a 15-minute break after every 4th completed work session
while resetting the counter.

Comment on lines +265 to +275
function tick() {
if (remaining <= 0) {
clearInterval(intervalId);
running = false;
onPhaseEnd();
return;
}
remaining--;
renderCountdown();
renderRing();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Timing bug: phase runs ~1 second long.

The check if (remaining <= 0) occurs before decrementing, so when remaining=1, the timer decrements to 0, displays 00:00 for one second, then fires onPhaseEnd() on the next tick. Each phase runs approximately 1 second longer than specified.

This is an off-by-one error. The decrement should occur before the zero check, or the check should be < 0.

Note: This comment documents a behavioral difference between the bare and AHA arms that supports the experiment's findings. It is not a fix request for this research artifact.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html` around lines 265
- 275, The tick() function has an off-by-one: the remaining counter is checked
before decrementing so a phase shows 00:00 for one extra second; fix by
decrementing remaining before the zero check (or change the condition to
remaining < 0) inside tick(), then call clearInterval(intervalId), set running =
false and onPhaseEnd() when the decremented remaining has reached zero; ensure
renderCountdown() and renderRing() are called after updating remaining so the
displayed time matches the phase end timing.

Comment on lines +323 to +341
function setMode(m) {
if (running) resetTimer(); // stop current run when switching
mode = m;
totalSecs = (m === 'work') ? WORK_SECS : BREAK_SECS;
remaining = totalSecs;

phaseLabel.textContent = (m === 'work') ? 'Work' : 'Break';

tabWork.className = 'mode-tab' + (m === 'work' ? ' active-work' : '');
tabBreak.className = 'mode-tab' + (m === 'break' ? ' active-break' : '');

updateRingColor();
renderCountdown();
renderRing();

btnStart.textContent = 'Start';
btnStart.classList.remove('btn-pause');
btnStart.classList.add('btn-start');
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Destructive mode switch without confirmation.

Line 324 resets the running timer when the user clicks a mode tab, silently discarding progress without confirmation. This includes clicking the currently-active tab, which unexpectedly resets the session.

This violates the principle of least astonishment: users expect mode tabs to indicate state, not destructively reset progress.

Note: This comment documents a robustness gap that distinguishes the bare implementation from the AHA implementation (which explicitly clarified reset behavior during critique). It is part of the experiment's documented findings and not a fix request.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html` around lines 323
- 341, The setMode function currently calls resetTimer() whenever a tab is
clicked, which silently discards progress; change it so it does not reset when
the clicked tab is the already-active mode and, when switching modes while a
timer is running, require an explicit confirmation (e.g., showConfirm) before
calling resetTimer(); use the existing mode variable to compare the new mode (m)
to the current mode and only proceed with resetTimer() after user confirmation
when m !== mode and running is true, otherwise simply update UI state without
discarding remaining progress.

Comment on lines +362 to +369
function renderSessions() {
tomatoWrap.innerHTML = '';
for (let i = 0; i < sessions; i++) {
const span = document.createElement('span');
span.className = 'tomato';
span.textContent = '🍅';
tomatoWrap.appendChild(span);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Animation replays on all tomatoes, not just new ones.

renderSessions() recreates all tomato spans on each call, triggering the pop animation on every existing tomato instead of only the newly added one.

To fix, either:

  1. Only append the new tomato (track last rendered count)
  2. Remove and re-add animation class only on the new element

Note: Minor visual polish issue documented for completeness. Part of bare-arm evaluation, not a fix request.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task1/app/index.html` around lines 362
- 369, renderSessions() currently clears tomatoWrap and recreates every 'tomato'
span each call, causing the pop animation to replay for all items; change it to
only append newly added tomatoes by tracking the last rendered count (e.g., a
module-scoped lastRenderedSessions variable) and in renderSessions() loop from
lastRenderedSessions to sessions to create and append only new span elements
(set className 'tomato' and textContent '🍅'), then update lastRenderedSessions;
alternatively, if you prefer to keep full re-rendering, locate the newly created
element after adding and add/remove the animation class only on that element
instead of re-adding it to all existing spans.

Comment on lines +183 to +187
<button class="preset-btn" data-pct="10">10%</button>
<button class="preset-btn" data-pct="15">15%</button>
<button class="preset-btn active" data-pct="18">18%</button>
<button class="preset-btn" data-pct="20">20%</button>
<button class="preset-btn" data-pct="25">25%</button>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Preset buttons deviate from the experiment protocol.

The bare implementation renders 5 preset buttons (10%, 15%, 18%, 20%, 25%), but the source prompt specifies exactly 4 presets: 15%, 18%, 20%, and 25%. This inconsistency breaks the A/B test comparison, as arm-a implements 4 presets per spec while arm-b implements 5.

🐛 Proposed fix to match protocol
     <div class="presets" id="presets">
-      <button class="preset-btn" data-pct="10">10%</button>
       <button class="preset-btn" data-pct="15">15%</button>
       <button class="preset-btn active" data-pct="18">18%</button>
       <button class="preset-btn" data-pct="20">20%</button>
       <button class="preset-btn" data-pct="25">25%</button>
     </div>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<button class="preset-btn" data-pct="10">10%</button>
<button class="preset-btn" data-pct="15">15%</button>
<button class="preset-btn active" data-pct="18">18%</button>
<button class="preset-btn" data-pct="20">20%</button>
<button class="preset-btn" data-pct="25">25%</button>
<button class="preset-btn" data-pct="15">15%</button>
<button class="preset-btn active" data-pct="18">18%</button>
<button class="preset-btn" data-pct="20">20%</button>
<button class="preset-btn" data-pct="25">25%</button>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html` around lines 183
- 187, The preset buttons include an extra 10% button that violates the
experiment protocol; remove the <button> element with class "preset-btn" and
data-pct="10" so only the four presets remain (data-pct="15", "18", "20", "25"),
and ensure the "active" class is applied to the intended default preset (e.g.,
the element with data-pct="18" if that should be default). Locate elements by
class "preset-btn" and data-pct attributes to update the markup and any
JavaScript that selects/initializes the active preset.

<button class="preset-btn" data-pct="25">25%</button>
</div>
<div class="tip-row">
<input id="tip-pct" type="number" min="0" max="100" step="1" value="18" />
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Default input values contradict specification.

The tip input defaults to value="18" and people defaults to value="2", but the specification states "Inputs default to empty; calculations run on every keystroke once all three fields are valid numbers" (ledger.md line 12). This causes the results section to display calculated values on page load, before the user provides a bill amount.

While defaults may improve UX by showing an example calculation, they deviate from the experimental protocol and make the comparison with arm-a (which uses empty defaults) inconsistent.

🔄 Proposed fix to match protocol
-      <input id="tip-pct" type="number" min="0" max="100" step="1" value="18" />
+      <input id="tip-pct" type="number" min="0" max="100" step="1" placeholder="0" />
-    <input id="people" type="number" min="1" step="1" value="2" />
+    <input id="people" type="number" min="1" step="1" placeholder="1" />

Also remove the active state from the 18% preset button:

-      <button class="preset-btn active" data-pct="18">18%</button>
+      <button class="preset-btn" data-pct="18">18%</button>

And update the initial render logic to handle empty inputs:

   billInput.addEventListener('input', calculate);
   peopleInput.addEventListener('input', calculate);
 
-  // Initial render
-  calculate();
+  // Initial render (only if fields have values)
+  if (billInput.value || tipInput.value || peopleInput.value) {
+    calculate();
+  }

Also applies to: 199-199

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html` at line 190,
Remove the hardcoded default values that cause calculations on load: delete the
value="18" from the tip input (id="tip-pct") and delete the default value on the
people input (id for people input in the same form). Also remove the active
state/class from the 18% preset button so no preset appears selected on load.
Finally update the initial render/validation logic (the function that runs
calculations on input — e.g., calculateTip/renderResults/validateInputs) to
treat empty inputs as invalid and skip calculations until all three fields
contain valid numbers, so the results area stays blank on page load.

Comment on lines +250 to +253
if (isNaN(bill) || bill < 0) {
billErr.textContent = 'Enter a valid bill amount.';
valid = false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bill validation allows zero, violating specification.

The validation accepts bill >= 0, but the specification requires bill to be strictly greater than zero. A bill of $0.00 is nonsensical for a tip calculator.

🔧 Proposed fix
-    if (isNaN(bill) || bill < 0) {
+    if (isNaN(bill) || bill <= 0) {
       billErr.textContent = 'Enter a valid bill amount.';
       valid = false;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (isNaN(bill) || bill < 0) {
billErr.textContent = 'Enter a valid bill amount.';
valid = false;
}
if (isNaN(bill) || bill <= 0) {
billErr.textContent = 'Enter a valid bill amount.';
valid = false;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html` around lines 250
- 253, The bill validation currently allows zero; update the check that uses the
bill variable so it rejects zero as well (change the condition from "bill < 0"
to "bill <= 0") and adjust the error text on billErr to indicate the bill must
be greater than zero (e.g., "Enter a bill amount greater than zero."); ensure
the valid flag is still set to false when this check fails so the form behaves
correctly.

Comment on lines +258 to +261
if (isNaN(people) || people < 1) {
peopleErr.textContent = 'At least 1 person required.';
valid = false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

People validation lacks integer check.

The validation uses parseInt(peopleInput.value, 10) which truncates decimals. Input "1.5" would be accepted as 1 person. The specification requires a positive integer, and arm-a implements a robust integer check (lines 435-436) that rejects decimal inputs.

✅ Proposed fix to enforce integer input
     const people = parseInt(peopleInput.value, 10);
 
-    if (isNaN(people) || people < 1) {
+    if (isNaN(people) || people < 1 || people > 100 || 
+        String(people) !== peopleInput.value.trim() || 
+        peopleInput.value.includes('.')) {
       peopleErr.textContent = 'At least 1 person required.';
       valid = false;
     }

Update the error message to reflect the full validation:

-      peopleErr.textContent = 'At least 1 person required.';
+      peopleErr.textContent = 'Enter a whole number between 1 and 100.';
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (isNaN(people) || people < 1) {
peopleErr.textContent = 'At least 1 person required.';
valid = false;
}
const people = parseInt(peopleInput.value, 10);
if (isNaN(people) || people < 1 || people > 100 ||
String(people) !== peopleInput.value.trim() ||
peopleInput.value.includes('.')) {
peopleErr.textContent = 'Enter a whole number between 1 and 100.';
valid = false;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html` around lines 258
- 261, The people validation currently accepts truncated decimals because
parseInt was used; modify the check to reject non-integer values by validating
that the parsed number is an integer (e.g., use Number.isInteger(people) or a
/^\d+$/ check against peopleInput.value) in addition to existing isNaN/positive
checks, and update the peopleErr.textContent to state "At least 1 person
required and must be an integer." Ensure references to people, peopleInput, and
peopleErr are used so the change is applied in the same validation block.

Comment on lines +270 to +276
const tipAmt = bill * (tipPct / 100);
const total = bill + tipAmt;
const perPerson = total / people;

outTip.textContent = fmt(tipAmt);
outTotal.textContent = fmt(total);
outPerPerson.textContent = fmt(perPerson);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Floating-point division loses cents in per-person calculation.

The implementation uses total / people and formats with toFixed(2), which can create rounding discrepancies. For example, with bill=$10.00, tip=15% ($1.50), total=$11.50, people=3:

  • perPerson = 11.50 / 3 = 3.8333...
  • fmt(3.8333...) = "$3.83"
  • But 3 × $3.83 = $11.49 (1 cent lost)

This is noted as a "Known gap" in BUILD-NOTES.md line 29, and represents a key difference from arm-a's cent-based reconciliation approach. However, users splitting a real bill would notice the missing cent.

For the experiment as-documented, this is acceptable. For production, consider adopting arm-a's cent-based arithmetic or clearly communicating the rounding behavior to users.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Repos/ab-aha-vs-bare-2026-06-02/arm-b/task3/app/index.html` around lines 270
- 276, The per-person value uses floating-point division (perPerson = total /
people) which loses cents; change to cent-based integer arithmetic: convert bill
and tip to cents (use Math.round) to compute tipAmtCents, totalCents, then
compute baseShareCents = Math.floor(totalCents / people) and remainder =
totalCents % people and produce per-person shares that distribute the remainder
(e.g., add 1 cent to the first remainder people) so the sum of per-person shares
equals totalCents; update the code paths that reference tipAmt, total and
outPerPerson (and the fmt function output) to use these cent-based values
(divide by 100 only for display) so rounding discrepancies are eliminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants