Skip to content

[Platform] Research harness for A/B hackathon experiments#15

Merged
mariagorskikh merged 1 commit into
mainfrom
platform/research-harness
May 26, 2026
Merged

[Platform] Research harness for A/B hackathon experiments#15
mariagorskikh merged 1 commit into
mainfrom
platform/research-harness

Conversation

@mariagorskikh

Copy link
Copy Markdown
Collaborator

Summary

Ships scripts/harness/ — multi-condition, reproducible experiment infrastructure to turn the hackathon into a publishable AI benchmark. Lets you run N>=100 agents per (model x brief_specificity x pre_push_checklist) cell and produce a dataset + plots the same way every time, from a plain shell or from inside Claude Code.

Everything is exercised by a default-suite pytest gate using mocked fixtures, so this PR adds zero live-agent cost while wiring up the path to spend that cost later.

What lands

file purpose
scripts/harness/conditions.yaml factor schema: model x brief_specificity x pre_push_checklist, plus a skip: list and per-cell defaults. Cartesian product produces 17 live cells (18 - 1 skipped). Each cell has a stable cell_id (12-char sha256 of canonical (conditions_version, factors) JSON), so the same factor combo always maps to the same id, on any machine, in any process.
scripts/harness/conditions.py YAML loader + cell expansion + compute_cell_id().
scripts/harness/agent_runner.py Two transports: FixtureAgentRunner (dry-run, replays mocked submissions deterministically) and ClaudeCLIAgentRunner (live; shells out to claude -p ... --output-format stream-json). The CLI path was chosen over a hand-rolled Anthropic SDK tool-loop because it's simpler and reuses the CLI's existing file-edit / branch-hygiene plumbing — documented in README.md.
scripts/harness/run_condition.py CLI: --cell <cell_id> --n <N>. Spawns N agents in isolated workdirs (worktree / clone / ephemeral strategies — clone is the most reproducible for use outside Claude Code), writes one JSONL line per submission to data/hackathon-runs/<cell_id>.jsonl, flushed and fsync'd after each row so a crash loses at most a partial line. Best-effort gh-CLI enrichment fills head_sha, lines_added/removed, first_push_ci_status, iterations_to_green.
scripts/harness/collect.py Aggregates per-cell JSONLs into data/hackathon-runs/all.jsonl, deterministically sorted by (cell_id, run_idx). Refuses to merge rows whose schema_version doesn't match the current harness — prevents silent schema drift.
scripts/harness/analyze.py Three PNG plots: diversity collapse (top-1 / top-3 layer-cluster share per condition), calibration (claimed-CI-green vs actual on first push), iteration efficiency (pushes-to-green distribution). matplotlib is gated behind an optional harness extra in pyproject.toml, so the core repo stays matplotlib-free.
scripts/harness/briefs/{vague,layer-enumerated,open-problems}.md The three brief templates wired to the brief_specificity factor. open-problems.md renders the docs/hackathon/problems/ listing if it exists and falls back to the vague brief if the parallel open-problems track hasn't landed yet.
scripts/harness/dry_run/fixtures/*.json 5 mocked agent submissions covering green / claimed-green-but-actually-red / iterating-to-green / spawn-failure cases.
scripts/harness/dry_run/test_dry_run.py Default-suite pytest gate: runs run_condition over fixtures for 2 cells x 4 replicates, asserts the JSONL row schema is exactly the 29 documented fields, runs collect.py and asserts sorted ordering + schema-mismatch refusal, runs analyze.py and asserts all three PNG files materialise (skipped cleanly when matplotlib isn't installed).
scripts/harness/SCHEMA.md Versioned JSONL row schema, plus calibration-regex policy and clustering policy.
scripts/harness/README.md Worked example end-to-end (dry-run + live), reproducibility notes (seed derivation, model id pinning, prompt hash), Claude-Code-vs-headless-shell matrix, "how to add a factor / metric" recipes.

Schema, versioned

Every JSONL row carries: schema_version, harness_version, conditions_version, cell_id, factors, run_idx, seed (derived from sha256(seed_base, cell_id, run_idx)), model_id (concrete version-pinned id like claude-opus-4-7), prompt_hash (sha256[:16] of the rendered brief), transport, timestamp_utc, plus all the per-submission outputs (pr_url, branch, head_sha, layer_picked, lines_added/removed, first_push_ci_status/green, iterations_to_green, claimed_ci_green, final_message, transcript_path, description, error). Full table in SCHEMA.md.

Dry-run vs live

  • Default = dry-run. run_condition.py uses the fixture transport unless you pass --live. There is no auto-detection — live mode must be explicit.
  • pytest exercises the full pipeline against fixtures. No real agent is spawned during CI; no network is touched. This is what gates schema drift before you ever spend a real dollar.
  • Live transport runs from a plain shell. ClaudeCLIAgentRunner shells out to the claude CLI; combined with --workdir-strategy clone it works from any shell on any machine, not just inside Claude Code. README.md documents both paths and the matrix of what works where.

Research questions this harness is designed to answer

  1. Diversity collapse. How much does brief specificity (vague / layer-enumerated / open-problems) reduce variance in which layer an agent picks? analyze.diversity_collapse_metrics returns top-1 and top-3 cluster shares per condition; for finer-grained clustering, description (PR title) is recorded so future versions can swap in an embedding-based clusterer without changing the schema.
  2. Calibration. Per (model, pre_push_checklist), what is the gap between claimed-CI-green (regex-matched against the agent's final message; pattern set frozen in _calibration.py) and actual-CI-green on first push (from gh rollup)?
  3. Iteration efficiency. Pushes-to-green distribution per condition; does pre_push_checklist=on actually shift the mean?
  4. Brief-specificity sensitivity. Cross-cut all three above by brief_specificity to identify which axis dominates each outcome.

Test plan

  • uv sync && uv run ruff check . && uv run ruff format --check . && uv run pyright && uv run pytest -v all exit 0 locally (264 passed, 1 skipped — the skipped test is the matplotlib plot-output test which auto-skips when the optional harness extra is not installed; passes when it is).
  • uv sync --extra harness && uv run pytest scripts/harness/dry_run/test_dry_run.py::test_analyze_produces_plots passes — confirms the plot path renders all three PNGs.
  • uv run python -m scripts.harness.run_condition --cell <id> --n 4 --dry-run writes a JSONL with the documented schema.
  • uv run python -m scripts.harness.collect aggregates, sorts, refuses mismatched schemas.
  • (deferred — track scope) Live run of one small cell on a dedicated host to confirm the clone workdir strategy + gh enrichment end-to-end. Not done in this PR by design — this track ships infrastructure only.

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW


Generated by Claude Code

Ships scripts/harness/ — a multi-condition, reproducible experiment harness
for turning the hackathon runs into a publishable AI benchmark.

Includes:
- conditions.yaml schema + cartesian-product cell expansion with stable hashed cell_ids
- run_condition.py CLI that spawns N agents per cell into isolated workdirs
  (worktree, clone, or ephemeral) and writes one JSONL row per submission
- agent_runner.py with a fixture transport (dry-run, no cost) and a claude CLI
  transport (live, runnable from a plain shell outside Claude Code)
- collect.py aggregator that refuses to merge mismatched schema versions
- analyze.py producing diversity-collapse, calibration, and iteration-efficiency
  PNG plots; matplotlib is gated behind an optional `harness` extra
- briefs for vague / layer-enumerated / open-problems specificity levels
- dry_run fixtures + pytest gate that exercises the full pipeline by default
- SCHEMA.md documenting the versioned JSONL row layout

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@mariagorskikh mariagorskikh merged commit bf7bf1d into main May 26, 2026
4 checks passed
mariagorskikh added a commit that referenced this pull request May 26, 2026
Integration of 5 platform tracks built in parallel by specialist agents:

- platform/ci-hygiene (PR #12): Makefile + pre-commit + idempotent CI feedback bot + CONTRIBUTING Definition of Done
- platform/open-problems (PR #13): 10 differentiated open problems across 10 layers, charter, judging doc
- platform/judge-panel (PR #14): rubric, anthropic + openai providers, run_all CLI, real-diff fixture, live gpt-5.5 scoreboard for PRs #2-#11
- platform/research-harness (PR #15): conditions matrix, claude-CLI live runner, collect + analyze, dry-run fixtures + tests
- platform/marketplace-ui (PR #16): /hackathon Next.js section with author tags, judge scores, layer browser; Python data adapter

Schema reconciled end-to-end (rubric -> scores.json -> adapter -> TS types -> UI) on the 6-dim 1-5 scale with totals in [6, 30].

Local CI: 341 passed, 1 skipped (matplotlib gated), 1 deselected (live marker).

Live judge scoreboard top:
  #2  harvard-phd     trust       26.0/30  (EigenTrust + checkable invariants)
  #7  coinbase-crypto payments    26.0/30  (HTLC escrow)
  #6  stanford-ml-phd trust       25.0/30
  #11 google-staff    transport   25.0/30

Copy link
Copy Markdown
Collaborator Author

Superseded by #17 (now merged to main at 1771cdb). Closing — the content of this PR is part of that integration merge.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants