[Platform] Integration v2#17
Merged
Merged
Conversation
…n of Done Closes the gap where contributors run "ruff check" + pytest, call that "the tests", and ship PRs that fail CI on ruff format --check and/or pyright. - Makefile: `ci-local` runs the exact CI sequence (uv sync, ruff check, ruff format --check, pyright, pytest -v) and hard-fails on the first red command. `hooks` installs pre-commit. `help` is the default goal. - .pre-commit-config.yaml: ruff-check + ruff-format (auto-fix locally) and pyright in strict mode (versions pinned to what `uv sync` resolves). - CONTRIBUTING.md: Definition of Done section at the very top with the five required pre-push commands and a one-line rationale for each. - README.md: "Before you push" callout pointing at `make ci-local` and the Definition of Done. - .github/workflows/ci-feedback.yml: triggers on the existing CI workflow's failure, downloads logs, extracts per-check excerpts (ruff format diff, pyright errors, pytest summary; ~40 lines each), and posts/edits a single PR comment keyed off the marker `<!-- ci-feedback-bot -->`. Permissions scoped to pull-requests:write, actions:read, contents:read. Verified locally: `make ci-local` exits 0 on this branch (5/5 green).
Ship a participant-facing brief for the month-long NEST hackathon: a charter, 10 differentiated open problems spanning 10 of the 12 layers, and a six-dimension judging rubric. The 10 problems are chosen to avoid the obvious-pick collapse seen in the first round (3x EigenTrust, 4x latency transport): no plain EigenTrust problem, no plain in_memory-transport-latency problem, and every problem cites the specific reference file lines that prove the gap is real.
Builds the scoring system for the month-long NEST hackathon: - scripts/judge/rubric.md: versioned rubric (v1), six 1-5 dimensions (correctness, test_rigor, api_fit, docs_quality, novelty, persona_fidelity) with anchored 1/3/5 examples. - scripts/judge/judge_pr.py: async judge_pr(pr_number, n_judges=3, model="claude-opus-4-7"). N parallel judges via anthropic.AsyncAnthropic with the rubric block marked cache_control: ephemeral; temperature 0.0; GitHub PR fetched via stdlib urllib (optional GITHUB_TOKEN). Aggregator returns median per dimension (statistics.median_low tie-break), total median, and a deterministic 3-sentence consensus. Per-file diffs above 5000 lines are truncated with a marker. - scripts/judge/run_all.py: CLI that scores every open hackathon/* PR and writes docs/hackathon/scores.json. Idempotent on HEAD SHA. Falls back to a deterministic MockJudgeClient (seeded by head_sha + judge_id) when ANTHROPIC_API_KEY is unset or --mock is passed. --prs-cache reads a pre-fetched PR list for offline smoke runs. - docs/hackathon/scores.json: bootstrap scoreboard for PRs #2-#11 generated with mock judges (mock: true). Re-run with a live key to replace. - scripts/judge/tests/test_judge.py: 24 unit tests covering median aggregation (incl. tie-breaking + missing-judge handling), JSON schema round-trip, parse_verdict fault tolerance, diff truncation, persona inference, and a fake JudgeClient driving judge_pr end-to-end. Live API tests behind @pytest.mark.live, skipped by default. - pyproject.toml: register scripts/ under pytest testpaths and the "live" marker. - README.md: Hackathon section linking to docs/hackathon/scores.json. https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
Ships scripts/harness/ — a multi-condition, reproducible experiment harness for turning the hackathon runs into a publishable AI benchmark. Includes: - conditions.yaml schema + cartesian-product cell expansion with stable hashed cell_ids - run_condition.py CLI that spawns N agents per cell into isolated workdirs (worktree, clone, or ephemeral) and writes one JSONL row per submission - agent_runner.py with a fixture transport (dry-run, no cost) and a claude CLI transport (live, runnable from a plain shell outside Claude Code) - collect.py aggregator that refuses to merge mismatched schema versions - analyze.py producing diversity-collapse, calibration, and iteration-efficiency PNG plots; matplotlib is gated behind an optional `harness` extra - briefs for vague / layer-enumerated / open-problems specificity levels - dry_run fixtures + pytest gate that exercises the full pipeline by default - SCHEMA.md documenting the versioned JSONL row layout
- New Next.js routes under /hackathon: landing (stats + featured), /layers (12-card grid), /layers/[layer] (sortable list), and /submissions/[id] (full detail with judge breakdown). - Server components read a static dataset built by nest-marketplace, a new workspace package that loads docs/hackathon/scores.json, ingests the open hackathon/* PRs, tags agent vs human authors by branch slug, and writes apps/nest-dashboard/public/hackathon-data.json. - Adapter is pure-Python and fully typed (pyright strict). 34 new pytest tests cover the missing-scores fallback, agent-handle classification, layer routing, and route-file smoke checks.
There was a problem hiding this comment.
Sorry @mariagorskikh, your pull request is larger than the review limit of 150000 diff characters
The judge panel now supports an OpenAIProvider alongside the existing
AnthropicJudgeClient (aliased as AnthropicProvider). Selection is via
the new `--provider {anthropic,openai}` flag on run_all.py and a
`make_provider()` factory inside judge_pr.py. The Anthropic path is
unchanged when --provider is left at its default, so existing scores
are bit-for-bit identical. OpenAI calls chat.completions with JSON mode,
temperature 0.0, default model gpt-5.5, and an OPENAI_API_KEY env var.
The rubric, JSON output schema, scores.json shape, and median-low
aggregation are shared across both providers.
…dian `_build_consensus` was reporting `sum(per-dim medians)` as the score, while `total_median` in the JSON used `median_low(per-judge totals)`. In any non-degenerate case these diverge — e.g., PR #2 was showing "scored 20.0/30" in prose while `median: 21.0` in JSON, confusing downstream consumers. - Plumb `total_median` (the value that actually gets written to JSON) through to `_build_consensus` so the prose uses the same aggregation. - Add a regression test (`test_consensus_uses_total_median_not_sum_of_medians`) that constructs three judges with divergent score patterns where the old buggy computation (20) and the new correct computation (21) differ; it asserts the consensus string contains `f"{total_median:.1f}/30"`. - Re-bootstrap `docs/hackathon/scores.json` with the fixed code; all 10 entries now have the median field and consensus prose in agreement.
The adapter was reading an invented `{<pr>: {correctness, realism, design,
docs, total, notes}}` shape with totals on a 0-10 scale. The judge panel
(PR #14) actually writes the PR #14 scoreboard shape: a top-level
`{version, generated_at, mock, submissions: [{pr, scores: {6 dims},
median, consensus, ...}]}` with six dimensions (correctness, test_rigor,
api_fit, docs_quality, novelty, persona_fidelity) each on a 1-5 scale and
`median` in [6, 30]. The mismatch made every submission render as
"unscored" in the marketplace UI.
- Rewrite `load_scores` to walk `submissions[]`, project each entry into
`{pr_number: JudgeScore}`, copy the canonical `median` field into
`JudgeScore.total`, and stash `consensus` on `notes` for the detail
view to quote. The old flat shape now degrades to `{}` instead of
smuggling stale numbers.
- Update `JudgeScore` (Python + TS) to carry all six real dimensions on
the 1-5 scale; total stays in [6, 30].
- Update the submission detail page `ScoreBar` to render `N/5` per
dimension, and the headline total as `X/30`. Score badge tooltip and
hackathon-card titles updated to `/30`.
- Update the marketplace tests for the new shape + a regression test
that the old flat shape is rejected.
- Rebuild `apps/nest-dashboard/public/hackathon-data.json` against the
fixed `docs/hackathon/scores.json` from `platform/judge-panel`; the
trust layer (3 submissions) now reports a non-null `top_score` (23.0).
The participant-facing judging doc described a 0-10 per-dimension scale with `final = sum/6`, and an "adversarial validator fails → zero on correctness" hard floor. Neither matches what the judge panel (`scripts/judge/judge_pr.py` + `scripts/judge/rubric.md`) actually does. The rubric scores six dimensions each on a 1-5 integer scale; the headline total is the sum (in [6, 30]). The judges read the PR body, diff, and checks summary — they have no mechanism to evaluate an "adversarial validator", so the zero-on-correctness rule was writing a check the code can't cash. - Rewrite to declare `scripts/judge/rubric.md` as the source of truth and explicitly defer to it for anchor descriptions. - Replace the 0-10 / sum-divided-by-six description with the actual 1-5 per-dimension scale and `[6, 30]` sum total, matching what gets written to `docs/hackathon/scores.json` as the `median` field. - Drop the "adversarial validator zero" claim and the cross-runner / seed-bank description, both of which describe machinery that does not exist in the judging pipeline. - Document the real flow: CI gates → N independent LLM judges via `judge_pr.py` → per-dim medians + `median_low(per-judge totals)` → consensus narrative → `scores.json` → marketplace UI.
…l-diff fixture - judge_pr.py: skip temperature kwarg for gpt-5.x models (they only accept default=1) - fixtures/hackathon-prs-2026-05-26.json: replace stub diffs with real git diffs (688-2002 lines per PR, fetched via git diff origin/main...origin/<head_ref>) - docs/hackathon/scores.json: 10 PRs scored by 3 gpt-5.5 judges each via OpenAI.
The temperature gating in judge_pr.py omits the kwarg for gpt-5.x models (they reject temperature=0). Test was asserting the now-absent kwarg. Update to assert absence, and add a parallel test for gpt-4o to lock in that the determinism path still works for older models.
53036f7 to
a64bc29
Compare
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Rebuild post-schema-recon + post-live-judge. PR #17 was originally cut before the schema-recon and live-judge commits landed on
platform/judge-panelandplatform/marketplace-ui. This is a clean rebuild on top of the latest HEAD of each source branch so reviewers can see the integrated state of all five platform tracks as it would look onmain.Integration is built from
origin/mainand merges in PRs #12, #13, #14, #15, #16 via--no-ffmerge commits (no rebase, no history rewrite), with one optional README-dedupe polish commit on top.Source branch HEADs (rebuild input)
origin/platform/ci-hygiene0de32f2origin/platform/open-problems228a697origin/platform/judge-paneld1150b9origin/platform/research-harnessa530450origin/platform/marketplace-ui02d49f9New integration HEAD:
a64bc29(onplatform/integration, force-pushed).Merge log
Merged in this order, each as a
--no-ffmerge commit:origin/platform/ci-hygiene(PR [Platform] CI hygiene: Makefile, pre-commit, feedback bot, Definition of Done #12) — clean auto-merge.origin/platform/open-problems(PR [Platform] Open problems + charter + judging doc #13) — clean auto-merge (README appended).origin/platform/judge-panel(PR [Platform] Judge panel + scoreboard for hackathon PRs #14) — clean auto-merge (now includes the live-judge scoreboard + OpenAI provider + temperature gating + gpt-4o test coverage).origin/platform/research-harness(PR [Platform] Research harness for A/B hackathon experiments #15) — 1 conflict inpyproject.toml, resolved by union.origin/platform/marketplace-ui(PR [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16) — 1 conflict inpyproject.toml, resolved by union (now consumes the realscores.jsonshape post schema-recon).Plus one post-merge polish commit (
a64bc29) to dedupe a duplicate## Hackathonheader in README.md.Conflicts hit and how each was resolved
pyproject.toml(mergingplatform/research-harness) — UNION[tool.pytest.ini_options]and[project.optional-dependencies]conflicted. Both sides additive, resolved by union per the integration brief:testpaths = ["packages", "scripts"]—scriptsis a superset ofscripts/harness/dry_run, collects both judge and harness tests.addopts = ["--import-mode=importlib", "-m", "not live"]— kept judge-panel'snot livefilter so CI doesn't hit the network.pythonpath = ["."]— from research-harness.markers = ["live: ..."]— from judge-panel.[project.optional-dependencies]now contains bothjudge(anthropic + openai) andharness(matplotlib + PyYAML) extras.pyproject.toml(mergingplatform/marketplace-ui) — UNION[tool.pyright].extraPathsconflicted. HEAD had"."(from research-harness), incoming had"packages/nest-marketplace". Both kept:README.md duplicate
## Hackathonheader — UNION + dedupe (post-merge polish)Both
platform/open-problemsandplatform/judge-panelappended## Hackathonsections to README.md in different regions; git auto-merged both without conflict, leaving two identically-titled H2s. Per the brief's README rule (deduplicate literal overlap), the second section was renamed to## Scoreboard, the overview was lightly augmented to mention the scoreboard, and the TOC was updated to match.This is the only editorial change on the integration branch; everything else is a true merge commit.
Total conflicts resolved: 2 file-level conflicts (both in
pyproject.toml), plus 1 README dedupe in the polish commit.CI evidence
make ci-localis green on the head of this rebuilt branch:(1 skipped =
test_analyze_produces_plots, gated on the optionalharnessmatplotlib extra; 1 deselected = thelivemarker filter.)Test count rose from 328 (previous integration) to 341 because
platform/judge-panelpicked up new gpt-5.5 temperature, gpt-4o, and OpenAI-provider tests as part of the live-judge work.Merged scoreboard — top entries from
docs/hackathon/scores.jsonGenerated by
scripts/judge/run_all.pywith 3 live judges per PR (model:gpt-5.5, rubric v1):(Tie at the top: PRs #2 and #7 both score 26.0; next is PR #11 google-staff at 25.0 on transport.)
Recommended action
Merge PRs #12–#16 individually instead — this branch is for review only.
Rationale:
mainwill look like after all five PRs land, so reviewers can sanity-check the integrated surface (README ordering,pyproject.tomlunion, no cross-PR breakage, live scoreboard rendered correctly by the marketplace adapter) in one place.Close this PR after the five source PRs have all merged to
main.Branches included
platform/ci-hygiene@0de32f2(PR [Platform] CI hygiene: Makefile, pre-commit, feedback bot, Definition of Done #12)platform/open-problems@228a697(PR [Platform] Open problems + charter + judging doc #13)platform/judge-panel@d1150b9(PR [Platform] Judge panel + scoreboard for hackathon PRs #14)platform/research-harness@a530450(PR [Platform] Research harness for A/B hackathon experiments #15)platform/marketplace-ui@02d49f9(PR [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16)https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
Generated by Claude Code