[Platform] Research harness for A/B hackathon experiments#15
Merged
Conversation
Ships scripts/harness/ — a multi-condition, reproducible experiment harness for turning the hackathon runs into a publishable AI benchmark. Includes: - conditions.yaml schema + cartesian-product cell expansion with stable hashed cell_ids - run_condition.py CLI that spawns N agents per cell into isolated workdirs (worktree, clone, or ephemeral) and writes one JSONL row per submission - agent_runner.py with a fixture transport (dry-run, no cost) and a claude CLI transport (live, runnable from a plain shell outside Claude Code) - collect.py aggregator that refuses to merge mismatched schema versions - analyze.py producing diversity-collapse, calibration, and iteration-efficiency PNG plots; matplotlib is gated behind an optional `harness` extra - briefs for vague / layer-enumerated / open-problems specificity levels - dry_run fixtures + pytest gate that exercises the full pipeline by default - SCHEMA.md documenting the versioned JSONL row layout
There was a problem hiding this comment.
Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
mariagorskikh
added a commit
that referenced
this pull request
May 26, 2026
Integration of 5 platform tracks built in parallel by specialist agents: - platform/ci-hygiene (PR #12): Makefile + pre-commit + idempotent CI feedback bot + CONTRIBUTING Definition of Done - platform/open-problems (PR #13): 10 differentiated open problems across 10 layers, charter, judging doc - platform/judge-panel (PR #14): rubric, anthropic + openai providers, run_all CLI, real-diff fixture, live gpt-5.5 scoreboard for PRs #2-#11 - platform/research-harness (PR #15): conditions matrix, claude-CLI live runner, collect + analyze, dry-run fixtures + tests - platform/marketplace-ui (PR #16): /hackathon Next.js section with author tags, judge scores, layer browser; Python data adapter Schema reconciled end-to-end (rubric -> scores.json -> adapter -> TS types -> UI) on the 6-dim 1-5 scale with totals in [6, 30]. Local CI: 341 passed, 1 skipped (matplotlib gated), 1 deselected (live marker). Live judge scoreboard top: #2 harvard-phd trust 26.0/30 (EigenTrust + checkable invariants) #7 coinbase-crypto payments 26.0/30 (HTLC escrow) #6 stanford-ml-phd trust 25.0/30 #11 google-staff transport 25.0/30
Collaborator
Author
|
Superseded by #17 (now merged to main at Generated by Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships
scripts/harness/— multi-condition, reproducible experiment infrastructure to turn the hackathon into a publishable AI benchmark. Lets you run N>=100 agents per (model x brief_specificity x pre_push_checklist) cell and produce a dataset + plots the same way every time, from a plain shell or from inside Claude Code.Everything is exercised by a default-suite pytest gate using mocked fixtures, so this PR adds zero live-agent cost while wiring up the path to spend that cost later.
What lands
scripts/harness/conditions.yamlmodelxbrief_specificityxpre_push_checklist, plus askip:list and per-cell defaults. Cartesian product produces 17 live cells (18 - 1 skipped). Each cell has a stablecell_id(12-char sha256 of canonical(conditions_version, factors)JSON), so the same factor combo always maps to the same id, on any machine, in any process.scripts/harness/conditions.pycompute_cell_id().scripts/harness/agent_runner.pyFixtureAgentRunner(dry-run, replays mocked submissions deterministically) andClaudeCLIAgentRunner(live; shells out toclaude -p ... --output-format stream-json). The CLI path was chosen over a hand-rolled Anthropic SDK tool-loop because it's simpler and reuses the CLI's existing file-edit / branch-hygiene plumbing — documented inREADME.md.scripts/harness/run_condition.py--cell <cell_id> --n <N>. Spawns N agents in isolated workdirs (worktree/clone/ephemeralstrategies —cloneis the most reproducible for use outside Claude Code), writes one JSONL line per submission todata/hackathon-runs/<cell_id>.jsonl, flushed andfsync'd after each row so a crash loses at most a partial line. Best-effortgh-CLI enrichment fillshead_sha,lines_added/removed,first_push_ci_status,iterations_to_green.scripts/harness/collect.pydata/hackathon-runs/all.jsonl, deterministically sorted by(cell_id, run_idx). Refuses to merge rows whoseschema_versiondoesn't match the current harness — prevents silent schema drift.scripts/harness/analyze.pyharnessextra inpyproject.toml, so the core repo stays matplotlib-free.scripts/harness/briefs/{vague,layer-enumerated,open-problems}.mdbrief_specificityfactor.open-problems.mdrenders thedocs/hackathon/problems/listing if it exists and falls back to the vague brief if the parallel open-problems track hasn't landed yet.scripts/harness/dry_run/fixtures/*.jsonscripts/harness/dry_run/test_dry_run.pyrun_conditionover fixtures for 2 cells x 4 replicates, asserts the JSONL row schema is exactly the 29 documented fields, runscollect.pyand asserts sorted ordering + schema-mismatch refusal, runsanalyze.pyand asserts all three PNG files materialise (skipped cleanly when matplotlib isn't installed).scripts/harness/SCHEMA.mdscripts/harness/README.mdSchema, versioned
Every JSONL row carries:
schema_version,harness_version,conditions_version,cell_id,factors,run_idx,seed(derived fromsha256(seed_base, cell_id, run_idx)),model_id(concrete version-pinned id likeclaude-opus-4-7),prompt_hash(sha256[:16] of the rendered brief),transport,timestamp_utc, plus all the per-submission outputs (pr_url,branch,head_sha,layer_picked,lines_added/removed,first_push_ci_status/green,iterations_to_green,claimed_ci_green,final_message,transcript_path,description,error). Full table inSCHEMA.md.Dry-run vs live
run_condition.pyuses the fixture transport unless you pass--live. There is no auto-detection — live mode must be explicit.pytestexercises the full pipeline against fixtures. No real agent is spawned during CI; no network is touched. This is what gates schema drift before you ever spend a real dollar.ClaudeCLIAgentRunnershells out to theclaudeCLI; combined with--workdir-strategy cloneit works from any shell on any machine, not just inside Claude Code.README.mddocuments both paths and the matrix of what works where.Research questions this harness is designed to answer
analyze.diversity_collapse_metricsreturns top-1 and top-3 cluster shares per condition; for finer-grained clustering,description(PR title) is recorded so future versions can swap in an embedding-based clusterer without changing the schema._calibration.py) and actual-CI-green on first push (fromghrollup)?pre_push_checklist=onactually shift the mean?brief_specificityto identify which axis dominates each outcome.Test plan
uv sync && uv run ruff check . && uv run ruff format --check . && uv run pyright && uv run pytest -vall exit 0 locally (264 passed, 1 skipped — the skipped test is the matplotlib plot-output test which auto-skips when the optionalharnessextra is not installed; passes when it is).uv sync --extra harness && uv run pytest scripts/harness/dry_run/test_dry_run.py::test_analyze_produces_plotspasses — confirms the plot path renders all three PNGs.uv run python -m scripts.harness.run_condition --cell <id> --n 4 --dry-runwrites a JSONL with the documented schema.uv run python -m scripts.harness.collectaggregates, sorts, refuses mismatched schemas.cloneworkdir strategy +ghenrichment end-to-end. Not done in this PR by design — this track ships infrastructure only.https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
Generated by Claude Code