feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337
feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337easel wants to merge 5 commits into
Conversation
5ae6880 to
421f852
Compare
## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
da4e316 to
c20a0f2
Compare
There was a problem hiding this comment.
16 issues found across 116 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md">
<violation number="1" location="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md:325">
P2: Grammar for `Hunk` does not allow multiple `@@` context-scoping lines, contradicting the documented feature and examples. The production `Hunk := "@@" [ header ] NEWLINE { HunkLine }` expects only HunkLines (starting with space, `-`, or `+`) after the `@@` header, but the text explicitly says "use multiple `@@` statements to jump to the right context" and shows consecutive `@@` lines (e.g., `@@ class BaseClass` then `@@ def method():`). This inconsistency will confuse agent implementors or cause strict parsers to reject valid patch syntax.</violation>
</file>
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.
Re-trigger cubic
c20a0f2 to
6f27131
Compare
## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
6f27131 to
27dd7f1
Compare
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
…i-turn agent_recorded + LLM judge
27dd7f1 to
ac972b7
Compare
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
ac972b7 to
126694b
Compare
Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
…R + personal refs Strip forward-references to lucebox-cli (Luce-Org#335) and luce-bench (Luce-Org#337) plus the contributor's personal repos so the docker stack stands alone: - delete .github/workflows/release-luce-bench.yml (luce-bench PyPI publish; fires only on luce-bench-v* tags, needs a luce-bench/ dir not in this repo) - Makefile: drop test/smoke/bench/profile targets (invoke lucebench/lucebox modules absent here) and their now-unused vars - .gitignore: drop luce-bench/snapshots and external baseline-repo URLs - pyproject.toml / Dockerfile / docker.yml: de-reference Luce-Org#335/Luce-Org#337/luce-bench in comments No functional change: deps, workspace members, ruff config, and every build instruction are untouched, so CI stays green. The siblings re-add their own scaffolding when they land. Co-Authored-By: WOZCODE <contact@withwoz.com>
126694b to
4dbe3a9
Compare
… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
bce0f51 to
5d24221
Compare
… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
5d24221 to
9d00651
Compare
There was a problem hiding this comment.
2 issues found across 6 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="luce-bench/src/lucebench/areas/agent_recorded.py">
<violation number="1" location="luce-bench/src/lucebench/areas/agent_recorded.py:583">
P2: Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid `LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT` should degrade to a restart failure/default timeout, not terminate execution.</violation>
<violation number="2" location="luce-bench/src/lucebench/areas/agent_recorded.py:915">
P2: Sequential-mode row stores final response in `model_response_cold`, so cold/final response fields are inconsistent. Use the first turn response for `model_response_cold` and keep final output in `model_response_final`.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| "cache_eligible_turns": len(eligible_turns), | ||
| "cache_hit_turns": cache_hit_turns, | ||
| "cache_speedup_x": None, | ||
| "model_response_cold": final_content, |
There was a problem hiding this comment.
P2: Sequential-mode row stores final response in model_response_cold, so cold/final response fields are inconsistent. Use the first turn response for model_response_cold and keep final output in model_response_final.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/areas/agent_recorded.py, line 915:
<comment>Sequential-mode row stores final response in `model_response_cold`, so cold/final response fields are inconsistent. Use the first turn response for `model_response_cold` and keep final output in `model_response_final`.</comment>
<file context>
@@ -683,14 +753,211 @@ def _run_one_multi_turn_case(
+ "cache_eligible_turns": len(eligible_turns),
+ "cache_hit_turns": cache_hit_turns,
+ "cache_speedup_x": None,
+ "model_response_cold": final_content,
+ "model_response_final": final_content,
+ "judge": judge_result,
</file context>
| "model_response_cold": final_content, | |
| "model_response_cold": (cold_row.get("content") or "") + (cold_row.get("reasoning_content") or ""), |
| """ | ||
| if shutil.which("systemctl") is None: | ||
| return (False, "systemctl not on PATH") | ||
| timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180")) |
There was a problem hiding this comment.
P2: Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT should degrade to a restart failure/default timeout, not terminate execution.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/areas/agent_recorded.py, line 583:
<comment>Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid `LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT` should degrade to a restart failure/default timeout, not terminate execution.</comment>
<file context>
@@ -504,15 +580,16 @@ def _restart_lucebox_service(*, log_prefix: str = "[agent_recorded]") -> tuple[b
"""
if shutil.which("systemctl") is None:
return (False, "systemctl not on PATH")
+ timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180"))
try:
result = subprocess.run( # noqa: S603 - operator-approved invocation
</file context>
| timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180")) | |
| try: | |
| timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180")) | |
| except ValueError: | |
| return (False, "invalid LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT") |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="luce-bench/src/lucebench/areas/agent_recorded.py">
<violation number="1" location="luce-bench/src/lucebench/areas/agent_recorded.py:583">
P2: Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid `LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT` should degrade to a restart failure/default timeout, not terminate execution.</violation>
<violation number="2" location="luce-bench/src/lucebench/areas/agent_recorded.py:915">
P2: Sequential-mode row stores final response in `model_response_cold`, so cold/final response fields are inconsistent. Use the first turn response for `model_response_cold` and keep final output in `model_response_final`.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
194fdb1 to
c4d7a7c
Compare
… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
…ge grading Adds the luce-bench/ Python package as a standalone bench harness: - Core areas: smoke, gsm8k, hellaswag, humaneval, truthfulqa_mc1, longctx, agent, agent_recorded (multi-turn replay), forge, ds4_eval. - Multi-turn agent_recorded replay with an LLM-judge grader and per-turn metrics; forge_eval fixture imported under fixtures/forge_eval/_forge/. - Card sampling, snapshot/submit_baseline, model-card schema, thinking-budget client, normalize+regrade pipeline, hostinfo, and CLI entrypoint. - Fixtures: agent_cases, agent_recorded (single + multi_turn), ds4_eval_cases, agent_prompts (codex variants). - Tests: ~25 pytest modules covering every area plus thinking control, normalize/regrade, snapshot, runner, and host-info paths. - scripts/extract-agentic-fixture.py (loaded by test_extract_agentic_fixture.py via path) and the scripts/check_card_bundle_drift.sh CI gate. Splits the bench harness out of lucebox-hub's main tree so it can ship as its own installable package and be consumed by lucebox bench without pulling in the C++ server build context. Multi-turn replay + LLM-judge grading is what unblocks the coding-agent-loop sweep workflow. None - this PR is independent.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Behavior-preserving cleanup of the luce-bench harness surfaced by review: - Split the 1.9k-line cli.py into focused modules: _display.py (pure formatters, format_row decomposed into throughput/timing/tokens), _preflight.py (model resolution + server probing), _orchestrate.py (run/sweep orchestration). cli.py is now the arg-parse + dispatch shell. - Dedupe: _resolve_auth_header (6 inline sites), _write_area_json / _terse_rows (7 envelope sites), schema._row_passed (3 pass-rule sites), and the two near-identical agent-replay drivers (_maybe_restart / _judge_or_pending). - Fix datetime.utcnow() deprecation in probe.py (output unchanged). - Add unit tests: _display, cli helpers, and probe.py (a shipped entry point that had zero coverage). New test pins the cost-containment invariant that agent_recorded never runs under --areas all. - Wire the suite into CI (it previously never ran pytest) and add the luce-bench uv.lock that was omitted by oversight. All public import paths preserved. 320 -> 358 tests, ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
c4d7a7c to
429420a
Compare
Summary
Adds the standalone
luce-bench/Python eval harness package (vendored into the repo), plus a multi-turnagent_recordedreplay area graded by a Claude Sonnet LLM judge. Also wires up a card-bundle drift check script and CI workflow updates.Files
luce-bench/— new top-level package (pyproject, src layout, tests, fixtures, docs)src/lucebench/areas/— eval areas: smoke, agent, agent_recorded, ds4_eval, forge, gsm8k, hellaswag, humaneval, longctx, truthfulqa_mc1src/lucebench/grading/llm_judge.py— Sonnet judge with on-disk cachesrc/lucebench/fixtures/— vendored eval fixtures (incl. forge_eval scenarios)tests/— ~20 test modules covering graders, runners, snapshot, cardsscripts/check_card_bundle_drift.sh— new drift CI helper.github/workflows/ci.yml— wires luce-bench tests + card bundle drift checkDependencies
None. This PR is self-contained: pure-Python harness with no source-level references to the server, lucebox CLI, or docker-stack PRs. Independent of #334, #335, #336, #338, #339, #340, #341.
Test plan
cd luce-bench && pytestpasses locallyagent_recordedagainst a live server with judge mocked--areas alldoes not enable the LLM-judge path (cost containment)Judge cost estimate: ~$0.30-$1.50 per full 48-case pass on Anthropic Sonnet.
Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com