[Platform] Research harness for A/B hackathon experiments by mariagorskikh · Pull Request #15 · projnanda/nandatown

mariagorskikh · 2026-05-26T20:08:29Z

Summary

Ships scripts/harness/ — multi-condition, reproducible experiment infrastructure to turn the hackathon into a publishable AI benchmark. Lets you run N>=100 agents per (model x brief_specificity x pre_push_checklist) cell and produce a dataset + plots the same way every time, from a plain shell or from inside Claude Code.

Everything is exercised by a default-suite pytest gate using mocked fixtures, so this PR adds zero live-agent cost while wiring up the path to spend that cost later.

What lands

file	purpose
`scripts/harness/conditions.yaml`	factor schema: `model` x `brief_specificity` x `pre_push_checklist`, plus a `skip:` list and per-cell defaults. Cartesian product produces 17 live cells (18 - 1 skipped). Each cell has a stable `cell_id` (12-char sha256 of canonical `(conditions_version, factors)` JSON), so the same factor combo always maps to the same id, on any machine, in any process.
`scripts/harness/conditions.py`	YAML loader + cell expansion + `compute_cell_id()`.
`scripts/harness/agent_runner.py`	Two transports: `FixtureAgentRunner` (dry-run, replays mocked submissions deterministically) and `ClaudeCLIAgentRunner` (live; shells out to `claude -p ... --output-format stream-json`). The CLI path was chosen over a hand-rolled Anthropic SDK tool-loop because it's simpler and reuses the CLI's existing file-edit / branch-hygiene plumbing — documented in `README.md`.
`scripts/harness/run_condition.py`	CLI: `--cell <cell_id> --n <N>`. Spawns N agents in isolated workdirs (`worktree` / `clone` / `ephemeral` strategies — `clone` is the most reproducible for use outside Claude Code), writes one JSONL line per submission to `data/hackathon-runs/<cell_id>.jsonl`, flushed and `fsync`'d after each row so a crash loses at most a partial line. Best-effort `gh`-CLI enrichment fills `head_sha`, `lines_added/removed`, `first_push_ci_status`, `iterations_to_green`.
`scripts/harness/collect.py`	Aggregates per-cell JSONLs into `data/hackathon-runs/all.jsonl`, deterministically sorted by `(cell_id, run_idx)`. Refuses to merge rows whose `schema_version` doesn't match the current harness — prevents silent schema drift.
`scripts/harness/analyze.py`	Three PNG plots: diversity collapse (top-1 / top-3 layer-cluster share per condition), calibration (claimed-CI-green vs actual on first push), iteration efficiency (pushes-to-green distribution). matplotlib is gated behind an optional `harness` extra in `pyproject.toml`, so the core repo stays matplotlib-free.
`scripts/harness/briefs/{vague,layer-enumerated,open-problems}.md`	The three brief templates wired to the `brief_specificity` factor. `open-problems.md` renders the `docs/hackathon/problems/` listing if it exists and falls back to the vague brief if the parallel open-problems track hasn't landed yet.
`scripts/harness/dry_run/fixtures/*.json`	5 mocked agent submissions covering green / claimed-green-but-actually-red / iterating-to-green / spawn-failure cases.
`scripts/harness/dry_run/test_dry_run.py`	Default-suite pytest gate: runs `run_condition` over fixtures for 2 cells x 4 replicates, asserts the JSONL row schema is exactly the 29 documented fields, runs `collect.py` and asserts sorted ordering + schema-mismatch refusal, runs `analyze.py` and asserts all three PNG files materialise (skipped cleanly when matplotlib isn't installed).
`scripts/harness/SCHEMA.md`	Versioned JSONL row schema, plus calibration-regex policy and clustering policy.
`scripts/harness/README.md`	Worked example end-to-end (dry-run + live), reproducibility notes (seed derivation, model id pinning, prompt hash), Claude-Code-vs-headless-shell matrix, "how to add a factor / metric" recipes.

Schema, versioned

Every JSONL row carries: schema_version, harness_version, conditions_version, cell_id, factors, run_idx, seed (derived from sha256(seed_base, cell_id, run_idx)), model_id (concrete version-pinned id like claude-opus-4-7), prompt_hash (sha256[:16] of the rendered brief), transport, timestamp_utc, plus all the per-submission outputs (pr_url, branch, head_sha, layer_picked, lines_added/removed, first_push_ci_status/green, iterations_to_green, claimed_ci_green, final_message, transcript_path, description, error). Full table in SCHEMA.md.

Dry-run vs live

Default = dry-run. run_condition.py uses the fixture transport unless you pass --live. There is no auto-detection — live mode must be explicit.
pytest exercises the full pipeline against fixtures. No real agent is spawned during CI; no network is touched. This is what gates schema drift before you ever spend a real dollar.
Live transport runs from a plain shell. ClaudeCLIAgentRunner shells out to the claude CLI; combined with --workdir-strategy clone it works from any shell on any machine, not just inside Claude Code. README.md documents both paths and the matrix of what works where.

Research questions this harness is designed to answer

Diversity collapse. How much does brief specificity (vague / layer-enumerated / open-problems) reduce variance in which layer an agent picks? analyze.diversity_collapse_metrics returns top-1 and top-3 cluster shares per condition; for finer-grained clustering, description (PR title) is recorded so future versions can swap in an embedding-based clusterer without changing the schema.
Calibration. Per (model, pre_push_checklist), what is the gap between claimed-CI-green (regex-matched against the agent's final message; pattern set frozen in _calibration.py) and actual-CI-green on first push (from gh rollup)?
Iteration efficiency. Pushes-to-green distribution per condition; does pre_push_checklist=on actually shift the mean?
Brief-specificity sensitivity. Cross-cut all three above by brief_specificity to identify which axis dominates each outcome.

Test plan

uv sync && uv run ruff check . && uv run ruff format --check . && uv run pyright && uv run pytest -v all exit 0 locally (264 passed, 1 skipped — the skipped test is the matplotlib plot-output test which auto-skips when the optional harness extra is not installed; passes when it is).
uv sync --extra harness && uv run pytest scripts/harness/dry_run/test_dry_run.py::test_analyze_produces_plots passes — confirms the plot path renders all three PNGs.
uv run python -m scripts.harness.run_condition --cell <id> --n 4 --dry-run writes a JSONL with the documented schema.
uv run python -m scripts.harness.collect aggregates, sorts, refuses mismatched schemas.
(deferred — track scope) Live run of one small cell on a dedicated host to confirm the clone workdir strategy + gh enrichment end-to-end. Not done in this PR by design — this track ships infrastructure only.

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

Generated by Claude Code

Ships scripts/harness/ — a multi-condition, reproducible experiment harness for turning the hackathon runs into a publishable AI benchmark. Includes: - conditions.yaml schema + cartesian-product cell expansion with stable hashed cell_ids - run_condition.py CLI that spawns N agents per cell into isolated workdirs (worktree, clone, or ephemeral) and writes one JSONL row per submission - agent_runner.py with a fixture transport (dry-run, no cost) and a claude CLI transport (live, runnable from a plain shell outside Claude Code) - collect.py aggregator that refuses to merge mismatched schema versions - analyze.py producing diversity-collapse, calibration, and iteration-efficiency PNG plots; matplotlib is gated behind an optional `harness` extra - briefs for vague / layer-enumerated / open-problems specificity levels - dry_run fixtures + pytest gate that exercises the full pipeline by default - SCHEMA.md documenting the versioned JSONL row layout

sourcery-ai

Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

Integration of 5 platform tracks built in parallel by specialist agents: - platform/ci-hygiene (PR #12): Makefile + pre-commit + idempotent CI feedback bot + CONTRIBUTING Definition of Done - platform/open-problems (PR #13): 10 differentiated open problems across 10 layers, charter, judging doc - platform/judge-panel (PR #14): rubric, anthropic + openai providers, run_all CLI, real-diff fixture, live gpt-5.5 scoreboard for PRs #2-#11 - platform/research-harness (PR #15): conditions matrix, claude-CLI live runner, collect + analyze, dry-run fixtures + tests - platform/marketplace-ui (PR #16): /hackathon Next.js section with author tags, judge scores, layer browser; Python data adapter Schema reconciled end-to-end (rubric -> scores.json -> adapter -> TS types -> UI) on the 6-dim 1-5 scale with totals in [6, 30]. Local CI: 341 passed, 1 skipped (matplotlib gated), 1 deselected (live marker). Live judge scoreboard top: #2 harvard-phd trust 26.0/30 (EigenTrust + checkable invariants) #7 coinbase-crypto payments 26.0/30 (HTLC escrow) #6 stanford-ml-phd trust 25.0/30 #11 google-staff transport 25.0/30

mariagorskikh · 2026-05-26T22:06:52Z

Superseded by #17 (now merged to main at 1771cdb). Closing — the content of this PR is part of that integration merge.

Generated by Claude Code

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

mariagorskikh mentioned this pull request May 26, 2026

[Platform] Integration v2 #17

Merged

mariagorskikh merged commit bf7bf1d into main May 26, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Platform] Research harness for A/B hackathon experiments#15

[Platform] Research harness for A/B hackathon experiments#15
mariagorskikh merged 1 commit into
mainfrom
platform/research-harness

mariagorskikh commented May 26, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

mariagorskikh commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mariagorskikh commented May 26, 2026

Summary

What lands

Schema, versioned

Dry-run vs live

Research questions this harness is designed to answer

Test plan

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mariagorskikh commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants