Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,37 @@ Two things to know that aren't obvious:

---

## Hackathon

NEST runs a month-long hackathon where engineers (and agents) submit
plugins, scenarios, and platform improvements as PRs against a
`hackathon/*` branch. Every submission is scored by an automated judge
panel along six dimensions (correctness, test rigor, API fit, docs,
novelty, persona fidelity), each on a 1-5 scale.

### Scoreboard

> Live scoreboard: [`docs/hackathon/scores.json`](docs/hackathon/scores.json) — machine-readable scores for every open hackathon PR. A marketplace UI on top of this file is coming.

The judge panel lives in [`scripts/judge/`](scripts/judge/):

- [`scripts/judge/rubric.md`](scripts/judge/rubric.md) — the rubric prompt (versioned).
- [`scripts/judge/judge_pr.py`](scripts/judge/judge_pr.py) — score one PR with N parallel judges via the Anthropic API, with prompt caching on the rubric.
- [`scripts/judge/run_all.py`](scripts/judge/run_all.py) — CLI that scores every open `hackathon/*` PR and writes the scoreboard JSON. Idempotent on HEAD SHA.

Re-run the full scoreboard with three live Opus judges per PR:

```bash
export ANTHROPIC_API_KEY=...
uv run python -m scripts.judge.run_all --output docs/hackathon/scores.json
```

Without an API key the CLI falls back to deterministic mock judges so the
schema is exercised even in CI. See the in-repo PR description for the
full cost model and how to reproduce.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding
Expand Down
667 changes: 667 additions & 0 deletions docs/hackathon/scores.json

Large diffs are not rendered by default.

18 changes: 16 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,17 @@ dependencies = [
"nest-plugins-reference",
]

[project.optional-dependencies]
# Dependencies for the hackathon judge panel under scripts/judge/.
# Install with: uv sync --extra judge
# or install a single provider directly: uv pip install "anthropic>=0.30"
# / "openai>=1.0". The judge_pr module imports providers lazily, so a
# missing SDK only fails at the point a live judge is invoked.
judge = [
"anthropic>=0.30",
"openai>=1.0",
]

[project.urls]
Homepage = "https://github.com/mariagorskikh/nest"
Repository = "https://github.com/mariagorskikh/nest"
Expand Down Expand Up @@ -87,5 +98,8 @@ extraPaths = [

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["packages"]
addopts = ["--import-mode=importlib"]
testpaths = ["packages", "scripts"]
addopts = ["--import-mode=importlib", "-m", "not live"]
markers = [
"live: end-to-end tests that hit live external services (skipped by default)",
]
18 changes: 18 additions & 0 deletions scripts/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# SPDX-License-Identifier: Apache-2.0
"""Pytest configuration for ``scripts/`` tests.

Makes the project root importable so tests can ``from scripts.judge import ...``.

Example::

# No-op for the user; pytest picks this up automatically.
"""

from __future__ import annotations

import sys
from pathlib import Path

_ROOT = Path(__file__).resolve().parent.parent
if str(_ROOT) not in sys.path:
sys.path.insert(0, str(_ROOT))
76 changes: 76 additions & 0 deletions scripts/judge/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
<!-- SPDX-License-Identifier: Apache-2.0 -->

# NEST Hackathon Judge Panel

`scripts/judge/` runs an N-judge LLM panel against every open
`hackathon/*` PR and writes a deterministic scoreboard to
`docs/hackathon/scores.json`.

The panel is provider-pluggable. Two live providers are supported today,
plus a deterministic mock for CI smoke runs.

## Providers

| `--provider` | Default model | API-key env var | SDK package |
| ------------ | ------------------ | -------------------- | -------------- |
| `anthropic` | `claude-opus-4-7` | `ANTHROPIC_API_KEY` | `anthropic` |
| `openai` | `gpt-5.5` | `OPENAI_API_KEY` | `openai` |

- `anthropic` is the default. The rubric is sent as a
`cache_control: ephemeral` system block so rubric tokens are billed
once per 5-min window across N judges.
- `openai` uses `openai.AsyncOpenAI` against `chat.completions` with
`response_format={"type": "json_object"}` so the JSON contract is
enforced server-side. OpenAI's caching is implicit per the docs — we
don't try to be clever.
- Both providers share the same rubric, the same six dimensions, the
same JSON output schema, and the same median-low aggregation. The
`scores.json` shape is identical regardless of provider.

If the selected provider's API key is unset, the CLI falls back to a
deterministic `MockJudgeClient` so the scoreboard shape is exercised
end-to-end without spending budget. Use `--mock` to force that path
regardless of env.

## Install

```bash
uv sync --extra judge # pulls in both anthropic and openai SDKs
# or one provider at a time:
uv pip install "anthropic>=0.30"
uv pip install "openai>=1.0"
```

## Usage

```bash
# Default: Anthropic with claude-opus-4-7
ANTHROPIC_API_KEY=sk-ant-... \
uv run python -m scripts.judge.run_all --output docs/hackathon/scores.json

# OpenAI with the default gpt-5.5
OPENAI_API_KEY=sk-... \
uv run python -m scripts.judge.run_all --provider openai

# OpenAI, pinning a specific model
OPENAI_API_KEY=sk-... \
uv run python -m scripts.judge.run_all --provider openai --model gpt-5.5-pro

# Force mock judges (no API keys required)
uv run python -m scripts.judge.run_all --mock

# Subset of PRs
uv run python -m scripts.judge.run_all --pr 2 --pr 3
```

The scoreboard is idempotent: re-running only re-scores PRs whose HEAD
SHA changed since the last write. Pass `--force` to re-score every PR.

## Tests

```bash
uv run pytest scripts/judge/tests/ -v
```

Live API-touching tests live behind `@pytest.mark.live` and are skipped
unless you pass `-m live` and set the relevant API key.
9 changes: 9 additions & 0 deletions scripts/judge/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# SPDX-License-Identifier: Apache-2.0
"""NEST hackathon judge panel.

Modules:
judge_pr -- score a single PR with N independent judges
run_all -- batch-score every open hackathon PR into a scoreboard JSON
"""

from __future__ import annotations
102 changes: 102 additions & 0 deletions scripts/judge/fixtures/hackathon-prs-2026-05-26.json

Large diffs are not rendered by default.

Loading
Loading