feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge by easel · Pull Request #337 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:31:41Z

Summary

Adds the standalone luce-bench/ Python eval harness package (vendored into the repo), plus a multi-turn agent_recorded replay area graded by a Claude Sonnet LLM judge. Also wires up a card-bundle drift check script and CI workflow updates.

Files

luce-bench/ — new top-level package (pyproject, src layout, tests, fixtures, docs)
- src/lucebench/areas/ — eval areas: smoke, agent, agent_recorded, ds4_eval, forge, gsm8k, hellaswag, humaneval, longctx, truthfulqa_mc1
- src/lucebench/grading/llm_judge.py — Sonnet judge with on-disk cache
- src/lucebench/fixtures/ — vendored eval fixtures (incl. forge_eval scenarios)
- tests/ — ~20 test modules covering graders, runners, snapshot, cards
scripts/check_card_bundle_drift.sh — new drift CI helper
.github/workflows/ci.yml — wires luce-bench tests + card bundle drift check

Dependencies

None. This PR is self-contained: pure-Python harness with no source-level references to the server, lucebox CLI, or docker-stack PRs. Independent of #334, #335, #336, #338, #339, #340, #341.

Test plan

cd luce-bench && pytest passes locally
CI green on the new card-bundle-drift job
Dry-run multi-turn agent_recorded against a live server with judge mocked
Confirm default --areas all does not enable the LLM-judge path (cost containment)

Judge cost estimate: ~$0.30-$1.50 per full 48-case pass on Anthropic Sonnet.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

cubic-dev-ai

16 issues found across 116 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md">

<violation number="1" location="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md:325">
P2: Grammar for `Hunk` does not allow multiple `@@` context-scoping lines, contradicting the documented feature and examples. The production `Hunk := "@@" [ header ] NEWLINE { HunkLine }` expects only HunkLines (starting with space, `-`, or `+`) after the `@@` header, but the text explicitly says "use multiple `@@` statements to jump to the right context" and shows consecutive `@@` lines (e.g., `@@ class BaseClass` then `@@ def method():`). This inconsistency will confuse agent implementors or cause strict parsers to reject valid patch syntax.</violation>
</file>

_{Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.

On a pro plan you can use ultrareview for larger PRs.
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic}

## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

…i-turn agent_recorded + LLM judge

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

…R + personal refs Strip forward-references to lucebox-cli (Luce-Org#335) and luce-bench (Luce-Org#337) plus the contributor's personal repos so the docker stack stands alone: - delete .github/workflows/release-luce-bench.yml (luce-bench PyPI publish; fires only on luce-bench-v* tags, needs a luce-bench/ dir not in this repo) - Makefile: drop test/smoke/bench/profile targets (invoke lucebench/lucebox modules absent here) and their now-unused vars - .gitignore: drop luce-bench/snapshots and external baseline-repo URLs - pyproject.toml / Dockerfile / docker.yml: de-reference Luce-Org#335/Luce-Org#337/luce-bench in comments No functional change: deps, workspace members, ruff config, and every build instruction are untouched, so CI stays green. The siblings re-add their own scaffolding when they land. Co-Authored-By: WOZCODE <contact@withwoz.com>

… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

cubic-dev-ai

2 issues found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="luce-bench/src/lucebench/areas/agent_recorded.py">

<violation number="1" location="luce-bench/src/lucebench/areas/agent_recorded.py:583">
P2: Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid `LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT` should degrade to a restart failure/default timeout, not terminate execution.</violation>

<violation number="2" location="luce-bench/src/lucebench/areas/agent_recorded.py:915">
P2: Sequential-mode row stores final response in `model_response_cold`, so cold/final response fields are inconsistent. Use the first turn response for `model_response_cold` and keep final output in `model_response_final`.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-15T01:13:46Z

+        "cache_eligible_turns": len(eligible_turns),
+        "cache_hit_turns": cache_hit_turns,
+        "cache_speedup_x": None,
+        "model_response_cold": final_content,


P2: Sequential-mode row stores final response in model_response_cold, so cold/final response fields are inconsistent. Use the first turn response for model_response_cold and keep final output in model_response_final.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/areas/agent_recorded.py, line 915: <comment>Sequential-mode row stores final response in `model_response_cold`, so cold/final response fields are inconsistent. Use the first turn response for `model_response_cold` and keep final output in `model_response_final`.</comment> <file context> @@ -683,14 +753,211 @@ def _run_one_multi_turn_case( + "cache_eligible_turns": len(eligible_turns), + "cache_hit_turns": cache_hit_turns, + "cache_speedup_x": None, + "model_response_cold": final_content, + "model_response_final": final_content, + "judge": judge_result, </file context>

Suggested change

"model_response_cold": final_content,

"model_response_cold": (cold_row.get("content") or "") + (cold_row.get("reasoning_content") or ""),

cubic-dev-ai · 2026-06-15T01:13:47Z

+    """
+    if shutil.which("systemctl") is None:
+        return (False, "systemctl not on PATH")
+    timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180"))


P2: Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT should degrade to a restart failure/default timeout, not terminate execution.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/areas/agent_recorded.py, line 583: <comment>Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid `LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT` should degrade to a restart failure/default timeout, not terminate execution.</comment> <file context> @@ -504,15 +580,16 @@ def _restart_lucebox_service(*, log_prefix: str = "[agent_recorded]") -> tuple[b """ if shutil.which("systemctl") is None: return (False, "systemctl not on PATH") + timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180")) try: result = subprocess.run( # noqa: S603 - operator-approved invocation </file context>

Suggested change

timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180"))

try:

timeout_s = float(os.environ.get("LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT", "180"))

except ValueError:

return (False, "invalid LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT")

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="luce-bench/src/lucebench/areas/agent_recorded.py">

<violation number="1" location="luce-bench/src/lucebench/areas/agent_recorded.py:583">
P2: Unvalidated restart-timeout env parsing can crash the area with ValueError. Invalid `LUCEBENCH_AGENT_RECORDED_RESTART_TIMEOUT` should degrade to a restart failure/default timeout, not terminate execution.</violation>

<violation number="2" location="luce-bench/src/lucebench/areas/agent_recorded.py:915">
P2: Sequential-mode row stores final response in `model_response_cold`, so cold/final response fields are inconsistent. Use the first turn response for `model_response_cold` and keep final output in `model_response_final`.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

…ge grading Adds the luce-bench/ Python package as a standalone bench harness: - Core areas: smoke, gsm8k, hellaswag, humaneval, truthfulqa_mc1, longctx, agent, agent_recorded (multi-turn replay), forge, ds4_eval. - Multi-turn agent_recorded replay with an LLM-judge grader and per-turn metrics; forge_eval fixture imported under fixtures/forge_eval/_forge/. - Card sampling, snapshot/submit_baseline, model-card schema, thinking-budget client, normalize+regrade pipeline, hostinfo, and CLI entrypoint. - Fixtures: agent_cases, agent_recorded (single + multi_turn), ds4_eval_cases, agent_prompts (codex variants). - Tests: ~25 pytest modules covering every area plus thinking control, normalize/regrade, snapshot, runner, and host-info paths. - scripts/extract-agentic-fixture.py (loaded by test_extract_agentic_fixture.py via path) and the scripts/check_card_bundle_drift.sh CI gate. Splits the bench harness out of lucebox-hub's main tree so it can ship as its own installable package and be consumed by lucebox bench without pulling in the C++ server build context. Multi-turn replay + LLM-judge grading is what unblocks the coding-agent-loop sweep workflow. None - this PR is independent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Behavior-preserving cleanup of the luce-bench harness surfaced by review: - Split the 1.9k-line cli.py into focused modules: _display.py (pure formatters, format_row decomposed into throughput/timing/tokens), _preflight.py (model resolution + server probing), _orchestrate.py (run/sweep orchestration). cli.py is now the arg-parse + dispatch shell. - Dedupe: _resolve_auth_header (6 inline sites), _write_area_json / _terse_rows (7 envelope sites), schema._row_passed (3 pass-rule sites), and the two near-identical agent-replay drivers (_maybe_restart / _judge_or_pending). - Fix datetime.utcnow() deprecation in probe.py (output unchanged). - Add unit tests: _display, cli helpers, and probe.py (a shipped entry point that had zero coverage). New test pins the cost-containment invariant that agent_recorded never runs under --areas all. - Wire the suite into CI (it previously never ran pytest) and add the luce-bench uv.lock that was omitted by oversight. All public import paths preserved. 320 -> 358 tests, ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… adapters New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

easel mentioned this pull request Jun 3, 2026

feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading #333

Closed

easel changed the title ~~feat(luce-bench): standalone bench harness package + forge eval area~~ feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge Jun 3, 2026

easel force-pushed the feat/lucebench-harness branch from 5ae6880 to 421f852 Compare June 4, 2026 02:50

This was referenced Jun 4, 2026

build(docker): lucebox-hub container image + CI release pipeline #334

Merged

feat(lucebox): hub CLI + autotune/sweep/profile + harness adapters + shell wrapper #335

Open

easel force-pushed the feat/lucebench-harness branch from da4e316 to c20a0f2 Compare June 4, 2026 05:03

easel marked this pull request as ready for review June 4, 2026 05:03

cubic-dev-ai Bot reviewed Jun 4, 2026

View reviewed changes

easel mentioned this pull request Jun 4, 2026

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

easel force-pushed the feat/lucebench-harness branch from c20a0f2 to 6f27131 Compare June 4, 2026 17:14

easel force-pushed the feat/lucebench-harness branch from 6f27131 to 27dd7f1 Compare June 4, 2026 23:23

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026

Merge PR Luce-Org#337: feat(luce-bench): in-tree bench harness + mult…

518b629

…i-turn agent_recorded + LLM judge

easel force-pushed the feat/lucebench-harness branch from 27dd7f1 to ac972b7 Compare June 5, 2026 20:01

easel force-pushed the feat/lucebench-harness branch from ac972b7 to 126694b Compare June 8, 2026 22:08

davide221 mentioned this pull request Jun 9, 2026

Running benchmarks #330

Open

easel force-pushed the feat/lucebench-harness branch from 126694b to 4dbe3a9 Compare June 11, 2026 01:54

easel force-pushed the feat/lucebench-harness branch from bce0f51 to 5d24221 Compare June 11, 2026 15:15

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 12, 2026

docs: refresh auto-integration manifest after PR Luce-Org#337 merge

0a8580d

easel force-pushed the feat/lucebench-harness branch from 5d24221 to 9d00651 Compare June 13, 2026 04:19

cubic-dev-ai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread luce-bench/src/lucebench/areas/agent_recorded.py

easel force-pushed the feat/lucebench-harness branch from 194fdb1 to c4d7a7c Compare June 15, 2026 03:04

easel and others added 5 commits June 16, 2026 11:18

fix(luce-bench): use collections.abc.Iterable to satisfy ruff UP035

6c2ed03

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(luce-bench): replay agent sessions sequentially

bac48b2

fix(luce-bench): summarize exact-repeat cache hits

d1325de

easel force-pushed the feat/lucebench-harness branch from c4d7a7c to 429420a Compare June 16, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337
easel wants to merge 5 commits into
Luce-Org:mainfrom
easel:feat/lucebench-harness

easel commented Jun 3, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 15, 2026

Uh oh!

cubic-dev-ai Bot Jun 15, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"model_response_cold": final_content,
	"model_response_cold": (cold_row.get("content") or "") + (cold_row.get("reasoning_content") or ""),

Conversation

easel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Dependencies

Test plan

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented Jun 3, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading