Skip to content

[Platform] Integration v2#17

Merged
mariagorskikh merged 17 commits into
mainfrom
platform/integration
May 26, 2026
Merged

[Platform] Integration v2#17
mariagorskikh merged 17 commits into
mainfrom
platform/integration

Conversation

@mariagorskikh

@mariagorskikh mariagorskikh commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Rebuild post-schema-recon + post-live-judge. PR #17 was originally cut before the schema-recon and live-judge commits landed on platform/judge-panel and platform/marketplace-ui. This is a clean rebuild on top of the latest HEAD of each source branch so reviewers can see the integrated state of all five platform tracks as it would look on main.

Integration is built from origin/main and merges in PRs #12, #13, #14, #15, #16 via --no-ff merge commits (no rebase, no history rewrite), with one optional README-dedupe polish commit on top.

Source branch HEADs (rebuild input)

Branch HEAD Latest commit
origin/platform/ci-hygiene 0de32f2 Add CI hygiene tooling: Makefile, pre-commit, feedback bot, Definition of Done
origin/platform/open-problems 228a697 docs(judging): align with canonical rubric and judge panel behavior
origin/platform/judge-panel d1150b9 test(judge): update gpt-5.5 temperature assertion + add gpt-4o coverage (includes schema-recon + OpenAI provider + live judge scores + temperature fix)
origin/platform/research-harness a530450 Add research harness for A/B hackathon experiments
origin/platform/marketplace-ui 02d49f9 fix(marketplace): consume judge panel's real scores.json shape (schema-recon adapter rewrite)

New integration HEAD: a64bc29 (on platform/integration, force-pushed).

Merge log

Merged in this order, each as a --no-ff merge commit:

  1. origin/platform/ci-hygiene (PR [Platform] CI hygiene: Makefile, pre-commit, feedback bot, Definition of Done #12) — clean auto-merge.
  2. origin/platform/open-problems (PR [Platform] Open problems + charter + judging doc #13) — clean auto-merge (README appended).
  3. origin/platform/judge-panel (PR [Platform] Judge panel + scoreboard for hackathon PRs #14) — clean auto-merge (now includes the live-judge scoreboard + OpenAI provider + temperature gating + gpt-4o test coverage).
  4. origin/platform/research-harness (PR [Platform] Research harness for A/B hackathon experiments #15) — 1 conflict in pyproject.toml, resolved by union.
  5. origin/platform/marketplace-ui (PR [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16) — 1 conflict in pyproject.toml, resolved by union (now consumes the real scores.json shape post schema-recon).

Plus one post-merge polish commit (a64bc29) to dedupe a duplicate ## Hackathon header in README.md.

Conflicts hit and how each was resolved

pyproject.toml (merging platform/research-harness) — UNION

[tool.pytest.ini_options] and [project.optional-dependencies] conflicted. Both sides additive, resolved by union per the integration brief:

  • testpaths = ["packages", "scripts"]scripts is a superset of scripts/harness/dry_run, collects both judge and harness tests.
  • addopts = ["--import-mode=importlib", "-m", "not live"] — kept judge-panel's not live filter so CI doesn't hit the network.
  • pythonpath = ["."] — from research-harness.
  • markers = ["live: ..."] — from judge-panel.
  • [project.optional-dependencies] now contains both judge (anthropic + openai) and harness (matplotlib + PyYAML) extras.

pyproject.toml (merging platform/marketplace-ui) — UNION

[tool.pyright].extraPaths conflicted. HEAD had "." (from research-harness), incoming had "packages/nest-marketplace". Both kept:

"packages/nest-plugins-reference",
"packages/nest-marketplace",
".",

README.md duplicate ## Hackathon header — UNION + dedupe (post-merge polish)

Both platform/open-problems and platform/judge-panel appended ## Hackathon sections to README.md in different regions; git auto-merged both without conflict, leaving two identically-titled H2s. Per the brief's README rule (deduplicate literal overlap), the second section was renamed to ## Scoreboard, the overview was lightly augmented to mention the scoreboard, and the TOC was updated to match.

This is the only editorial change on the integration branch; everything else is a true merge commit.

Total conflicts resolved: 2 file-level conflicts (both in pyproject.toml), plus 1 README dedupe in the polish commit.

CI evidence

make ci-local is green on the head of this rebuilt branch:

=========== 341 passed, 1 skipped, 1 deselected, 1 warning in 10.72s ===========
ci-local: all 5 checks passed. Safe to push.

(1 skipped = test_analyze_produces_plots, gated on the optional harness matplotlib extra; 1 deselected = the live marker filter.)

Test count rose from 328 (previous integration) to 341 because platform/judge-panel picked up new gpt-5.5 temperature, gpt-4o, and OpenAI-provider tests as part of the live-judge work.

Merged scoreboard — top entries from docs/hackathon/scores.json

Generated by scripts/judge/run_all.py with 3 live judges per PR (model: gpt-5.5, rubric v1):

Rank PR Score Layer Persona Title
1 #2 26.0/30 trust harvard-phd EigenTrust plugin with checkable invariants
1 #7 26.0/30 payments coinbase-crypto htlc_escrow payments plugin (hash- & time-locked)
3 #6 25.0/30 trust stanford-ml-phd EigenTrust plugin for the trust layer

(Tie at the top: PRs #2 and #7 both score 26.0; next is PR #11 google-staff at 25.0 on transport.)

Recommended action

Merge PRs #12#16 individually instead — this branch is for review only.

Rationale:

  • This branch faithfully reproduces what main will look like after all five PRs land, so reviewers can sanity-check the integrated surface (README ordering, pyproject.toml union, no cross-PR breakage, live scoreboard rendered correctly by the marketplace adapter) in one place.
  • Each of the five tracks already has its own focused PR with its own review history and CI run. Merging this integration branch directly would land all five under a single squash/merge commit and lose that per-track history.
  • The pytest config and pyright extraPaths unions resolved here are mechanical and can be re-resolved trivially when whichever of [Platform] Research harness for A/B hackathon experiments #15 / [Platform] Hackathon marketplace UI: /hackathon section with submissions, layers, authors #16 lands second.

Close this PR after the five source PRs have all merged to main.

Branches included

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW


Generated by Claude Code

claude added 5 commits May 26, 2026 19:56
…n of Done

Closes the gap where contributors run "ruff check" + pytest, call that "the
tests", and ship PRs that fail CI on ruff format --check and/or pyright.

- Makefile: `ci-local` runs the exact CI sequence (uv sync, ruff check,
  ruff format --check, pyright, pytest -v) and hard-fails on the first red
  command. `hooks` installs pre-commit. `help` is the default goal.
- .pre-commit-config.yaml: ruff-check + ruff-format (auto-fix locally) and
  pyright in strict mode (versions pinned to what `uv sync` resolves).
- CONTRIBUTING.md: Definition of Done section at the very top with the five
  required pre-push commands and a one-line rationale for each.
- README.md: "Before you push" callout pointing at `make ci-local` and the
  Definition of Done.
- .github/workflows/ci-feedback.yml: triggers on the existing CI workflow's
  failure, downloads logs, extracts per-check excerpts (ruff format diff,
  pyright errors, pytest summary; ~40 lines each), and posts/edits a single
  PR comment keyed off the marker `<!-- ci-feedback-bot -->`. Permissions
  scoped to pull-requests:write, actions:read, contents:read.

Verified locally: `make ci-local` exits 0 on this branch (5/5 green).
Ship a participant-facing brief for the month-long NEST hackathon:
a charter, 10 differentiated open problems spanning 10 of the 12
layers, and a six-dimension judging rubric. The 10 problems are
chosen to avoid the obvious-pick collapse seen in the first round
(3x EigenTrust, 4x latency transport): no plain EigenTrust problem,
no plain in_memory-transport-latency problem, and every problem
cites the specific reference file lines that prove the gap is real.
Builds the scoring system for the month-long NEST hackathon:

- scripts/judge/rubric.md: versioned rubric (v1), six 1-5 dimensions
  (correctness, test_rigor, api_fit, docs_quality, novelty,
  persona_fidelity) with anchored 1/3/5 examples.
- scripts/judge/judge_pr.py: async judge_pr(pr_number, n_judges=3,
  model="claude-opus-4-7"). N parallel judges via anthropic.AsyncAnthropic
  with the rubric block marked cache_control: ephemeral; temperature 0.0;
  GitHub PR fetched via stdlib urllib (optional GITHUB_TOKEN). Aggregator
  returns median per dimension (statistics.median_low tie-break), total
  median, and a deterministic 3-sentence consensus. Per-file diffs above
  5000 lines are truncated with a marker.
- scripts/judge/run_all.py: CLI that scores every open hackathon/* PR
  and writes docs/hackathon/scores.json. Idempotent on HEAD SHA. Falls
  back to a deterministic MockJudgeClient (seeded by head_sha + judge_id)
  when ANTHROPIC_API_KEY is unset or --mock is passed. --prs-cache reads
  a pre-fetched PR list for offline smoke runs.
- docs/hackathon/scores.json: bootstrap scoreboard for PRs #2-#11
  generated with mock judges (mock: true). Re-run with a live key to
  replace.
- scripts/judge/tests/test_judge.py: 24 unit tests covering median
  aggregation (incl. tie-breaking + missing-judge handling), JSON schema
  round-trip, parse_verdict fault tolerance, diff truncation, persona
  inference, and a fake JudgeClient driving judge_pr end-to-end. Live
  API tests behind @pytest.mark.live, skipped by default.
- pyproject.toml: register scripts/ under pytest testpaths and the
  "live" marker.
- README.md: Hackathon section linking to docs/hackathon/scores.json.

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
Ships scripts/harness/ — a multi-condition, reproducible experiment harness
for turning the hackathon runs into a publishable AI benchmark.

Includes:
- conditions.yaml schema + cartesian-product cell expansion with stable hashed cell_ids
- run_condition.py CLI that spawns N agents per cell into isolated workdirs
  (worktree, clone, or ephemeral) and writes one JSONL row per submission
- agent_runner.py with a fixture transport (dry-run, no cost) and a claude CLI
  transport (live, runnable from a plain shell outside Claude Code)
- collect.py aggregator that refuses to merge mismatched schema versions
- analyze.py producing diversity-collapse, calibration, and iteration-efficiency
  PNG plots; matplotlib is gated behind an optional `harness` extra
- briefs for vague / layer-enumerated / open-problems specificity levels
- dry_run fixtures + pytest gate that exercises the full pipeline by default
- SCHEMA.md documenting the versioned JSONL row layout
- New Next.js routes under /hackathon: landing (stats + featured),
  /layers (12-card grid), /layers/[layer] (sortable list), and
  /submissions/[id] (full detail with judge breakdown).
- Server components read a static dataset built by nest-marketplace,
  a new workspace package that loads docs/hackathon/scores.json,
  ingests the open hackathon/* PRs, tags agent vs human authors by
  branch slug, and writes apps/nest-dashboard/public/hackathon-data.json.
- Adapter is pure-Python and fully typed (pyright strict). 34 new
  pytest tests cover the missing-scores fallback, agent-handle
  classification, layer routing, and route-file smoke checks.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @mariagorskikh, your pull request is larger than the review limit of 150000 diff characters

claude and others added 12 commits May 26, 2026 20:45
The judge panel now supports an OpenAIProvider alongside the existing
AnthropicJudgeClient (aliased as AnthropicProvider). Selection is via
the new `--provider {anthropic,openai}` flag on run_all.py and a
`make_provider()` factory inside judge_pr.py. The Anthropic path is
unchanged when --provider is left at its default, so existing scores
are bit-for-bit identical. OpenAI calls chat.completions with JSON mode,
temperature 0.0, default model gpt-5.5, and an OPENAI_API_KEY env var.

The rubric, JSON output schema, scores.json shape, and median-low
aggregation are shared across both providers.
…dian

`_build_consensus` was reporting `sum(per-dim medians)` as the score, while
`total_median` in the JSON used `median_low(per-judge totals)`. In any
non-degenerate case these diverge — e.g., PR #2 was showing
"scored 20.0/30" in prose while `median: 21.0` in JSON, confusing
downstream consumers.

- Plumb `total_median` (the value that actually gets written to JSON)
  through to `_build_consensus` so the prose uses the same aggregation.
- Add a regression test
  (`test_consensus_uses_total_median_not_sum_of_medians`) that constructs
  three judges with divergent score patterns where the old buggy
  computation (20) and the new correct computation (21) differ; it
  asserts the consensus string contains `f"{total_median:.1f}/30"`.
- Re-bootstrap `docs/hackathon/scores.json` with the fixed code; all 10
  entries now have the median field and consensus prose in agreement.
The adapter was reading an invented `{<pr>: {correctness, realism, design,
docs, total, notes}}` shape with totals on a 0-10 scale. The judge panel
(PR #14) actually writes the PR #14 scoreboard shape: a top-level
`{version, generated_at, mock, submissions: [{pr, scores: {6 dims},
median, consensus, ...}]}` with six dimensions (correctness, test_rigor,
api_fit, docs_quality, novelty, persona_fidelity) each on a 1-5 scale and
`median` in [6, 30]. The mismatch made every submission render as
"unscored" in the marketplace UI.

- Rewrite `load_scores` to walk `submissions[]`, project each entry into
  `{pr_number: JudgeScore}`, copy the canonical `median` field into
  `JudgeScore.total`, and stash `consensus` on `notes` for the detail
  view to quote. The old flat shape now degrades to `{}` instead of
  smuggling stale numbers.
- Update `JudgeScore` (Python + TS) to carry all six real dimensions on
  the 1-5 scale; total stays in [6, 30].
- Update the submission detail page `ScoreBar` to render `N/5` per
  dimension, and the headline total as `X/30`. Score badge tooltip and
  hackathon-card titles updated to `/30`.
- Update the marketplace tests for the new shape + a regression test
  that the old flat shape is rejected.
- Rebuild `apps/nest-dashboard/public/hackathon-data.json` against the
  fixed `docs/hackathon/scores.json` from `platform/judge-panel`; the
  trust layer (3 submissions) now reports a non-null `top_score` (23.0).
The participant-facing judging doc described a 0-10 per-dimension scale
with `final = sum/6`, and an "adversarial validator fails → zero on
correctness" hard floor. Neither matches what the judge panel
(`scripts/judge/judge_pr.py` + `scripts/judge/rubric.md`) actually
does. The rubric scores six dimensions each on a 1-5 integer scale; the
headline total is the sum (in [6, 30]). The judges read the PR body,
diff, and checks summary — they have no mechanism to evaluate an
"adversarial validator", so the zero-on-correctness rule was writing a
check the code can't cash.

- Rewrite to declare `scripts/judge/rubric.md` as the source of truth
  and explicitly defer to it for anchor descriptions.
- Replace the 0-10 / sum-divided-by-six description with the actual
  1-5 per-dimension scale and `[6, 30]` sum total, matching what gets
  written to `docs/hackathon/scores.json` as the `median` field.
- Drop the "adversarial validator zero" claim and the cross-runner /
  seed-bank description, both of which describe machinery that does
  not exist in the judging pipeline.
- Document the real flow: CI gates → N independent LLM judges via
  `judge_pr.py` → per-dim medians + `median_low(per-judge totals)` →
  consensus narrative → `scores.json` → marketplace UI.
…l-diff fixture

- judge_pr.py: skip temperature kwarg for gpt-5.x models (they only accept default=1)
- fixtures/hackathon-prs-2026-05-26.json: replace stub diffs with real git diffs
  (688-2002 lines per PR, fetched via git diff origin/main...origin/<head_ref>)
- docs/hackathon/scores.json: 10 PRs scored by 3 gpt-5.5 judges each via OpenAI.
The temperature gating in judge_pr.py omits the kwarg for gpt-5.x
models (they reject temperature=0). Test was asserting the now-absent
kwarg. Update to assert absence, and add a parallel test for gpt-4o
to lock in that the determinism path still works for older models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants