[Platform] Integration v2 by mariagorskikh · Pull Request #17 · projnanda/nandatown

mariagorskikh · 2026-05-26T20:40:57Z

Purpose

Rebuild post-schema-recon + post-live-judge. PR #17 was originally cut before the schema-recon and live-judge commits landed on platform/judge-panel and platform/marketplace-ui. This is a clean rebuild on top of the latest HEAD of each source branch so reviewers can see the integrated state of all five platform tracks as it would look on main.

Integration is built from origin/main and merges in PRs #12, #13, #14, #15, #16 via --no-ff merge commits (no rebase, no history rewrite), with one optional README-dedupe polish commit on top.

Source branch HEADs (rebuild input)

Branch	HEAD	Latest commit
`origin/platform/ci-hygiene`	`0de32f2`	Add CI hygiene tooling: Makefile, pre-commit, feedback bot, Definition of Done
`origin/platform/open-problems`	`228a697`	docs(judging): align with canonical rubric and judge panel behavior
`origin/platform/judge-panel`	`d1150b9`	test(judge): update gpt-5.5 temperature assertion + add gpt-4o coverage (includes schema-recon + OpenAI provider + live judge scores + temperature fix)
`origin/platform/research-harness`	`a530450`	Add research harness for A/B hackathon experiments
`origin/platform/marketplace-ui`	`02d49f9`	fix(marketplace): consume judge panel's real scores.json shape (schema-recon adapter rewrite)

New integration HEAD: a64bc29 (on platform/integration, force-pushed).

Merge log

Merged in this order, each as a --no-ff merge commit:

origin/platform/ci-hygiene (PR [Platform] CI hygiene: Makefile, pre-commit, feedback bot, Definition of Done #12) — clean auto-merge.
origin/platform/open-problems (PR [Platform] Open problems + charter + judging doc #13) — clean auto-merge (README appended).
origin/platform/judge-panel (PR [Platform] Judge panel + scoreboard for hackathon PRs #14) — clean auto-merge (now includes the live-judge scoreboard + OpenAI provider + temperature gating + gpt-4o test coverage).
origin/platform/research-harness (PR [Platform] Research harness for A/B hackathon experiments #15) — 1 conflict in pyproject.toml, resolved by union.
origin/platform/marketplace-ui (PR [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16) — 1 conflict in pyproject.toml, resolved by union (now consumes the real scores.json shape post schema-recon).

Plus one post-merge polish commit (a64bc29) to dedupe a duplicate ## Hackathon header in README.md.

Conflicts hit and how each was resolved

`pyproject.toml` (merging `platform/research-harness`) — UNION

[tool.pytest.ini_options] and [project.optional-dependencies] conflicted. Both sides additive, resolved by union per the integration brief:

testpaths = ["packages", "scripts"] — scripts is a superset of scripts/harness/dry_run, collects both judge and harness tests.
addopts = ["--import-mode=importlib", "-m", "not live"] — kept judge-panel's not live filter so CI doesn't hit the network.
pythonpath = ["."] — from research-harness.
markers = ["live: ..."] — from judge-panel.
[project.optional-dependencies] now contains both judge (anthropic + openai) and harness (matplotlib + PyYAML) extras.

`pyproject.toml` (merging `platform/marketplace-ui`) — UNION

[tool.pyright].extraPaths conflicted. HEAD had "." (from research-harness), incoming had "packages/nest-marketplace". Both kept:

"packages/nest-plugins-reference",
"packages/nest-marketplace",
".",

README.md duplicate `## Hackathon` header — UNION + dedupe (post-merge polish)

Both platform/open-problems and platform/judge-panel appended ## Hackathon sections to README.md in different regions; git auto-merged both without conflict, leaving two identically-titled H2s. Per the brief's README rule (deduplicate literal overlap), the second section was renamed to ## Scoreboard, the overview was lightly augmented to mention the scoreboard, and the TOC was updated to match.

This is the only editorial change on the integration branch; everything else is a true merge commit.

Total conflicts resolved: 2 file-level conflicts (both in pyproject.toml), plus 1 README dedupe in the polish commit.

CI evidence

make ci-local is green on the head of this rebuilt branch:

=========== 341 passed, 1 skipped, 1 deselected, 1 warning in 10.72s ===========
ci-local: all 5 checks passed. Safe to push.

(1 skipped = test_analyze_produces_plots, gated on the optional harness matplotlib extra; 1 deselected = the live marker filter.)

Test count rose from 328 (previous integration) to 341 because platform/judge-panel picked up new gpt-5.5 temperature, gpt-4o, and OpenAI-provider tests as part of the live-judge work.

Merged scoreboard — top entries from `docs/hackathon/scores.json`

Generated by scripts/judge/run_all.py with 3 live judges per PR (model: gpt-5.5, rubric v1):

Rank	PR	Score	Layer	Persona	Title
1	#2	26.0/30	trust	harvard-phd	EigenTrust plugin with checkable invariants
1	#7	26.0/30	payments	coinbase-crypto	htlc_escrow payments plugin (hash- & time-locked)
3	#6	25.0/30	trust	stanford-ml-phd	EigenTrust plugin for the trust layer

(Tie at the top: PRs #2 and #7 both score 26.0; next is PR #11 google-staff at 25.0 on transport.)

Recommended action

Merge PRs #12–#16 individually instead — this branch is for review only.

Rationale:

This branch faithfully reproduces what main will look like after all five PRs land, so reviewers can sanity-check the integrated surface (README ordering, pyproject.toml union, no cross-PR breakage, live scoreboard rendered correctly by the marketplace adapter) in one place.
Each of the five tracks already has its own focused PR with its own review history and CI run. Merging this integration branch directly would land all five under a single squash/merge commit and lose that per-track history.
The pytest config and pyright extraPaths unions resolved here are mechanical and can be re-resolved trivially when whichever of [Platform] Research harness for A/B hackathon experiments #15 / [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16 lands second.

Close this PR after the five source PRs have all merged to main.

Branches included

platform/ci-hygiene @ 0de32f2 (PR [Platform] CI hygiene: Makefile, pre-commit, feedback bot, Definition of Done #12)
platform/open-problems @ 228a697 (PR [Platform] Open problems + charter + judging doc #13)
platform/judge-panel @ d1150b9 (PR [Platform] Judge panel + scoreboard for hackathon PRs #14)
platform/research-harness @ a530450 (PR [Platform] Research harness for A/B hackathon experiments #15)
platform/marketplace-ui @ 02d49f9 (PR [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16)

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

Generated by Claude Code

…n of Done Closes the gap where contributors run "ruff check" + pytest, call that "the tests", and ship PRs that fail CI on ruff format --check and/or pyright. - Makefile: `ci-local` runs the exact CI sequence (uv sync, ruff check, ruff format --check, pyright, pytest -v) and hard-fails on the first red command. `hooks` installs pre-commit. `help` is the default goal. - .pre-commit-config.yaml: ruff-check + ruff-format (auto-fix locally) and pyright in strict mode (versions pinned to what `uv sync` resolves). - CONTRIBUTING.md: Definition of Done section at the very top with the five required pre-push commands and a one-line rationale for each. - README.md: "Before you push" callout pointing at `make ci-local` and the Definition of Done. - .github/workflows/ci-feedback.yml: triggers on the existing CI workflow's failure, downloads logs, extracts per-check excerpts (ruff format diff, pyright errors, pytest summary; ~40 lines each), and posts/edits a single PR comment keyed off the marker ``. Permissions scoped to pull-requests:write, actions:read, contents:read. Verified locally: `make ci-local` exits 0 on this branch (5/5 green).

Ship a participant-facing brief for the month-long NEST hackathon: a charter, 10 differentiated open problems spanning 10 of the 12 layers, and a six-dimension judging rubric. The 10 problems are chosen to avoid the obvious-pick collapse seen in the first round (3x EigenTrust, 4x latency transport): no plain EigenTrust problem, no plain in_memory-transport-latency problem, and every problem cites the specific reference file lines that prove the gap is real.

Builds the scoring system for the month-long NEST hackathon: - scripts/judge/rubric.md: versioned rubric (v1), six 1-5 dimensions (correctness, test_rigor, api_fit, docs_quality, novelty, persona_fidelity) with anchored 1/3/5 examples. - scripts/judge/judge_pr.py: async judge_pr(pr_number, n_judges=3, model="claude-opus-4-7"). N parallel judges via anthropic.AsyncAnthropic with the rubric block marked cache_control: ephemeral; temperature 0.0; GitHub PR fetched via stdlib urllib (optional GITHUB_TOKEN). Aggregator returns median per dimension (statistics.median_low tie-break), total median, and a deterministic 3-sentence consensus. Per-file diffs above 5000 lines are truncated with a marker. - scripts/judge/run_all.py: CLI that scores every open hackathon/* PR and writes docs/hackathon/scores.json. Idempotent on HEAD SHA. Falls back to a deterministic MockJudgeClient (seeded by head_sha + judge_id) when ANTHROPIC_API_KEY is unset or --mock is passed. --prs-cache reads a pre-fetched PR list for offline smoke runs. - docs/hackathon/scores.json: bootstrap scoreboard for PRs #2-#11 generated with mock judges (mock: true). Re-run with a live key to replace. - scripts/judge/tests/test_judge.py: 24 unit tests covering median aggregation (incl. tie-breaking + missing-judge handling), JSON schema round-trip, parse_verdict fault tolerance, diff truncation, persona inference, and a fake JudgeClient driving judge_pr end-to-end. Live API tests behind @pytest.mark.live, skipped by default. - pyproject.toml: register scripts/ under pytest testpaths and the "live" marker. - README.md: Hackathon section linking to docs/hackathon/scores.json. https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

Ships scripts/harness/ — a multi-condition, reproducible experiment harness for turning the hackathon runs into a publishable AI benchmark. Includes: - conditions.yaml schema + cartesian-product cell expansion with stable hashed cell_ids - run_condition.py CLI that spawns N agents per cell into isolated workdirs (worktree, clone, or ephemeral) and writes one JSONL row per submission - agent_runner.py with a fixture transport (dry-run, no cost) and a claude CLI transport (live, runnable from a plain shell outside Claude Code) - collect.py aggregator that refuses to merge mismatched schema versions - analyze.py producing diversity-collapse, calibration, and iteration-efficiency PNG plots; matplotlib is gated behind an optional `harness` extra - briefs for vague / layer-enumerated / open-problems specificity levels - dry_run fixtures + pytest gate that exercises the full pipeline by default - SCHEMA.md documenting the versioned JSONL row layout

- New Next.js routes under /hackathon: landing (stats + featured), /layers (12-card grid), /layers/[layer] (sortable list), and /submissions/[id] (full detail with judge breakdown). - Server components read a static dataset built by nest-marketplace, a new workspace package that loads docs/hackathon/scores.json, ingests the open hackathon/* PRs, tags agent vs human authors by branch slug, and writes apps/nest-dashboard/public/hackathon-data.json. - Adapter is pure-Python and fully typed (pyright strict). 34 new pytest tests cover the missing-scores fallback, agent-handle classification, layer routing, and route-file smoke checks.

sourcery-ai

Sorry @mariagorskikh, your pull request is larger than the review limit of 150000 diff characters

The judge panel now supports an OpenAIProvider alongside the existing AnthropicJudgeClient (aliased as AnthropicProvider). Selection is via the new `--provider {anthropic,openai}` flag on run_all.py and a `make_provider()` factory inside judge_pr.py. The Anthropic path is unchanged when --provider is left at its default, so existing scores are bit-for-bit identical. OpenAI calls chat.completions with JSON mode, temperature 0.0, default model gpt-5.5, and an OPENAI_API_KEY env var. The rubric, JSON output schema, scores.json shape, and median-low aggregation are shared across both providers.

…dian `_build_consensus` was reporting `sum(per-dim medians)` as the score, while `total_median` in the JSON used `median_low(per-judge totals)`. In any non-degenerate case these diverge — e.g., PR #2 was showing "scored 20.0/30" in prose while `median: 21.0` in JSON, confusing downstream consumers. - Plumb `total_median` (the value that actually gets written to JSON) through to `_build_consensus` so the prose uses the same aggregation. - Add a regression test (`test_consensus_uses_total_median_not_sum_of_medians`) that constructs three judges with divergent score patterns where the old buggy computation (20) and the new correct computation (21) differ; it asserts the consensus string contains `f"{total_median:.1f}/30"`. - Re-bootstrap `docs/hackathon/scores.json` with the fixed code; all 10 entries now have the median field and consensus prose in agreement.

The adapter was reading an invented `{<pr>: {correctness, realism, design, docs, total, notes}}` shape with totals on a 0-10 scale. The judge panel (PR #14) actually writes the PR #14 scoreboard shape: a top-level `{version, generated_at, mock, submissions: [{pr, scores: {6 dims}, median, consensus, ...}]}` with six dimensions (correctness, test_rigor, api_fit, docs_quality, novelty, persona_fidelity) each on a 1-5 scale and `median` in [6, 30]. The mismatch made every submission render as "unscored" in the marketplace UI. - Rewrite `load_scores` to walk `submissions[]`, project each entry into `{pr_number: JudgeScore}`, copy the canonical `median` field into `JudgeScore.total`, and stash `consensus` on `notes` for the detail view to quote. The old flat shape now degrades to `{}` instead of smuggling stale numbers. - Update `JudgeScore` (Python + TS) to carry all six real dimensions on the 1-5 scale; total stays in [6, 30]. - Update the submission detail page `ScoreBar` to render `N/5` per dimension, and the headline total as `X/30`. Score badge tooltip and hackathon-card titles updated to `/30`. - Update the marketplace tests for the new shape + a regression test that the old flat shape is rejected. - Rebuild `apps/nest-dashboard/public/hackathon-data.json` against the fixed `docs/hackathon/scores.json` from `platform/judge-panel`; the trust layer (3 submissions) now reports a non-null `top_score` (23.0).

The participant-facing judging doc described a 0-10 per-dimension scale with `final = sum/6`, and an "adversarial validator fails → zero on correctness" hard floor. Neither matches what the judge panel (`scripts/judge/judge_pr.py` + `scripts/judge/rubric.md`) actually does. The rubric scores six dimensions each on a 1-5 integer scale; the headline total is the sum (in [6, 30]). The judges read the PR body, diff, and checks summary — they have no mechanism to evaluate an "adversarial validator", so the zero-on-correctness rule was writing a check the code can't cash. - Rewrite to declare `scripts/judge/rubric.md` as the source of truth and explicitly defer to it for anchor descriptions. - Replace the 0-10 / sum-divided-by-six description with the actual 1-5 per-dimension scale and `[6, 30]` sum total, matching what gets written to `docs/hackathon/scores.json` as the `median` field. - Drop the "adversarial validator zero" claim and the cross-runner / seed-bank description, both of which describe machinery that does not exist in the judging pipeline. - Document the real flow: CI gates → N independent LLM judges via `judge_pr.py` → per-dim medians + `median_low(per-judge totals)` → consensus narrative → `scores.json` → marketplace UI.

…l-diff fixture - judge_pr.py: skip temperature kwarg for gpt-5.x models (they only accept default=1) - fixtures/hackathon-prs-2026-05-26.json: replace stub diffs with real git diffs (688-2002 lines per PR, fetched via git diff origin/main...origin/<head_ref>) - docs/hackathon/scores.json: 10 PRs scored by 3 gpt-5.5 judges each via OpenAI.

The temperature gating in judge_pr.py omits the kwarg for gpt-5.x models (they reject temperature=0). Test was asserting the now-absent kwarg. Update to assert absence, and add a parallel test for gpt-4o to lock in that the determinism path still works for older models.

claude added 5 commits May 26, 2026 19:56

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

claude and others added 12 commits May 26, 2026 20:45

Merge platform/ci-hygiene into integration

14e59ed

Merge platform/open-problems into integration

15d8d1d

Merge platform/judge-panel into integration

4cde126

Merge platform/research-harness into integration

bf7bf1d

Merge platform/marketplace-ui into integration

e5a8beb

docs(readme): dedupe duplicate Hackathon section post-merge

a64bc29

mariagorskikh force-pushed the platform/integration branch from 53036f7 to a64bc29 Compare May 26, 2026 21:46

mariagorskikh changed the title ~~[Platform] Integration: all 5 platform tracks merged~~ [Platform] Integration v2 May 26, 2026

mariagorskikh merged commit 1771cdb into main May 26, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Platform] Integration v2#17

[Platform] Integration v2#17
mariagorskikh merged 17 commits into
mainfrom
platform/integration

mariagorskikh commented May 26, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mariagorskikh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Source branch HEADs (rebuild input)

Merge log

Conflicts hit and how each was resolved

pyproject.toml (merging platform/research-harness) — UNION

pyproject.toml (merging platform/marketplace-ui) — UNION

README.md duplicate ## Hackathon header — UNION + dedupe (post-merge polish)

CI evidence

Merged scoreboard — top entries from docs/hackathon/scores.json

Recommended action

Branches included

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mariagorskikh commented May 26, 2026 •

edited

Loading

`pyproject.toml` (merging `platform/research-harness`) — UNION

`pyproject.toml` (merging `platform/marketplace-ui`) — UNION

README.md duplicate `## Hackathon` header — UNION + dedupe (post-merge polish)

Merged scoreboard — top entries from `docs/hackathon/scores.json`