fix(forge): synthesize tool_use from call:<verb>{} plain-text emissions#320
Closed
easel wants to merge 34 commits into
Closed
fix(forge): synthesize tool_use from call:<verb>{} plain-text emissions#320easel wants to merge 34 commits into
easel wants to merge 34 commits into
Conversation
…nch in-tree Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main. Net: 189 files changed. Major workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard, container-state preflight), cmd_systemctl_passthrough (already- active short-circuit, restart-loop detection), cmd_update (bootstrap- installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default precedence), shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so lucebox update keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → 183 lines). * luce-bench: snapshot subcommand + canonical HostInfo schema v2 + levels (level0/1/2/3) + report subcommand + submit-baseline + regrade. * Server (C++): /props.host block + props_schema=4 + host_info read at startup, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. * Harness: client implementations for claude/codex/opencode/hermes/pi. * Strict 11-field config.toml allowlist for dflash.* runtime tunables. Deleted (rolled into new structure): * server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by luce-bench snapshot + areas. * lucebox configure, lucebox download-models, lucebox benchmark — replaced by config sub-app, models sub-app, autotune --sweep. * luce-bench --sweep flag — moved to argv-sniff subcommand dispatch. Conflict resolution: * server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion (feat/lucebox-docker moved bench machinery into luce-bench). * README.md — took feat-branch version. origin/main had 19 commits worth of minor README tweaks since the branch base; those need to be folded back in as a follow-up PR. * docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took feat-branch version. origin/main had 1 link-fix commit; feat-branch has the schema-4 + host-block additions that strictly supersede. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That contradicted the precedence lucebox.sh documents (env > toml > default) and bit sindri in production: its config.toml had `[image]` without a `registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub` beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`. Symptom: `lucebox start` brought up the wrong (stale luce-org) image even after explicit `lucebox install` + `lucebox pull` against easel. Fix: overlay env on top of whatever `load()` returns (or `live_config()` falls back to). Only the five top-level scalars have env hooks (LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model intentionally don't. Adds two regression tests: - env beats config.toml when toml has no explicit value for that key, - env still wins when toml is absent (covers the live_config fallback). 102 lucebox tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_autotune_candidate_configs.py: sort imports (ruff I001). - download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo (correct — we only query model repos here), resolving the mypy union-attr error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job-level `permissions` block replaces the workflow-level default entirely, so `actions/checkout` was running without `contents: read` and would fail on protected refs. Add `contents: read` back alongside the existing `id-token: write`. Addresses cubic #1 on PR Luce-Org#285.
- Dockerfile: keep --frozen on the uv sync fallback so the layer can't silently resolve outside the lockfile. - harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty (per-area card defaults govern; --no-think only when explicitly set) and default LUCEBENCH_AREA to the level1 capability gate (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad for routine harness runs. Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.
…appers
- .github/workflows/{ci,docker,release-luce-bench}.yml: pin
actions/checkout, docker/{setup-buildx,login,metadata,bake}-action,
and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments
so the supply chain is reproducible (Luce-Org#4).
- harness/src/harness/clients/_common.py: replace the external `timeout`
shell-out with `subprocess.run(..., timeout=N)`, return 124 on
TimeoutExpired to match GNU timeout's exit code (Luce-Org#5).
- scripts/build_image.sh: normalize REGISTRY to end in `/` instead of
silently producing `ghcr.io/luce-orglucebox-hub` when the trailing
slash is missing (Luce-Org#6).
- harness/src/harness/clients/pi.py: non-interactive launch now mirrors
run_pi.sh's validated invocation (--provider, --print, --mode json,
--tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR /
PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7).
- docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags,
since `+` is not a valid Docker tag character (Luce-Org#8).
- harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of
run_hermes.sh's env wiring and call `chat --provider --model
--accept-hooks --yolo --max-turns --source --query` instead of a bare
positional prompt (Luce-Org#9, Luce-Org#10).
- harness/src/harness/clients/openclaw.py: apply the OpenClaw config
patch via `openclaw config patch --file` before the run, and call
`agent --local --json --model lucebox/<model> --session-id --timeout
--message` instead of a bare positional prompt (Luce-Org#11).
- pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server,
tool_memory}.py ruff include pins (those paths were renamed during
the dflash→server rename and then deleted upstream) (Luce-Org#12).
- lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to
`**/*.sh` so scripts under nested dirs (harness/clients/*.sh,
scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13).
Addresses cubic Luce-Org#4–Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in
the previous commit alongside the LUCEBENCH_THINK default fix.
- lucebox/README.md: fix the relative link to `cli.py`; resolves to `src/lucebox/cli.py` (the actual location), not the nonexistent `lucebox/cli.py` (Luce-Org#15). - luce-bench/NOTICE: the bundled forge_eval LICENSE says "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE with the actual upstream LICENSE (Luce-Org#16). - luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing agent / agent_recorded / forge / longctx / smoke. Add the imports + list entries so `from lucebench.areas import *` matches the actual area surface (Luce-Org#17). Addresses cubic Luce-Org#15–Luce-Org#17 (P3) on PR Luce-Org#285.
…nch in-tree Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single commit on top of origin/main (8782d07). Net: 189 files changed. Workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile with reproducible `uv sync --frozen`, docker-bake.hcl with VERSION sanitization for Docker tag charset, .github/workflows/docker.yml with SHA-pinned external actions and GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard against systemd self-defeat, container-state preflight), cmd_systemctl_passthrough (already-active short-circuit, restart-loop detection), cmd_update (bootstrap-installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default), all shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so `lucebox update` keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → ~150 lines). _load_or_build now respects env > toml > default precedence. * luce-bench: snapshot subcommand + canonical HostInfo schema v2 (multi-GPU lineup, WSL detection, source/collector trust metadata) + levels (level0/1/2/3) + report subcommand (host column + cross-host confounder warnings) + submit-baseline (level3-gated) + regrade. * Server (C++): /props.host block + props_schema=4 + host_info loader, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench machinery moved into luce-bench. * Harness: client implementations for claude/codex/opencode/hermes/pi pointed at the running lucebox server, matched against the validated run_*.sh shell wrappers. Cubic AI code review (17 findings) addressed in full: P0: contents: read on luce-bench release job permissions. P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty so per-area defaults apply. P2: 6 external actions pinned to immutable SHAs; non-interactive timeout via subprocess.run; REGISTRY trailing-slash normalize; VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw mirrored against run_*.sh wrappers; ruff scan paths corrected to server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default level1. P3: lucebox/README.md cli.py link fixed; NOTICE copyright year 2025-2026; areas/__init__.py __all__ exposes all 10 areas. CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12 prebuild, cubic reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontent channel
The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.
The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.
Fix:
1. chat_template.h: render_chat_template / render_chat_template_jinja now
return a PromptRenderResult { text, started_in_thinking }. The built-in
QWEN3 and LAGUNA branches set started_in_thinking deterministically when
enable_thinking && add_generation_prompt; GEMMA4 stays false (its
reasoning channel is opened by the model emitting `<|channel>`, which
http_server forwards into the emitter as `<think>`). The Jinja path
suffix-sniffs the rendered prompt for a trailing `<think>` opener and
emits a [WARN] log when sniffing decides true so a template/model-card
mismatch surfaces at runtime.
2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
When constructed with REASONING, active_kind_ initializes to "thinking"
so the Anthropic first content_block is `thinking` instead of `text`
(avoids a spurious empty text-block stop+restart on the first reasoning
delta). Deliberately leaves checked_think_prefix_ at its default (false)
so the existing one-time `<think>` strip guard still trips if a
template/model-card mismatch causes the model to emit a redundant opener.
3. http_server.cpp: thread render_result.started_in_thinking through
ParsedRequest into the SseEmitter's initial_mode. Both streaming and
non-streaming paths feed tokens through the same emitter, so the fix
covers both response shapes.
Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.
Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three C++ tests that chain render_chat_template + SseEmitter so the wiring between the renderer's started_in_thinking flag and the emitter's initial_mode is exercised end-to-end, not just at each end. The per-unit tests above each verify their half of the contract, but the original bug was a missing call-site wire — both halves were correct in isolation. Also tighten the Python integration test assertions for enable_thinking and reasoning.effort: require non-empty reasoning_content and no raw <think>/</think> in either channel. The prior 'doesn't crash' assertion would have passed on the broken code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.
…budget
Increment 1 (Tier 1): model-card registry resolvable by normalized model id
(/props.model_card → bundled cards → family fallback), per-model thinking tokens
via the card with a thinking-capability gate, configurable --reasoning-effort
{low,medium,high} (was hardcoded high) and --thinking-budget-tokens N, plus
card_source/card_stem provenance on every row. Cards bundled into the wheel via
hatch force-include from share/model_cards (single source; CI drift guard TODO).
Tier 2: --client-thinking-budget N — client-side thinking termination for
providers that ignore native budget hints. Streams the response, estimates
reasoning tokens (char/4), and when over budget aborts and issues a forced-
</think> re-prompt (a fresh conditioned sample, not decoder continuation) using
the card's terminator + reply reserve, producing a gradable answer. Gated on
reasoning being identifiable in the stream (reasoning_content deltas or <think>
tags); unmarked output is left untouched. client_abort rows are a separate
benchmark mode (never pooled with single-pass), with continuation-failure and
answer-started-before-abort rows excluded from the aggregate and coverage
reported.
Verified live: OpenRouter qwen3.6-27b ignores reasoning_effort/budget_tokens
(reasoning unbounded), but --client-thinking-budget 2000 bounds it precisely
(~2001 reasoning tokens/row, continuation=ok, 8/8 pass on the head subset).
234 tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `--multi-turn` mode to scripts/extract-agentic-fixture.py for the coding-agent-loop autotune profile: walk one session in record order, emit a replay case at each target-token bucket (default 8K/16K/32K/64K/100K/128K). Each case ships an OpenAI-shaped `messages` list and a `prefill-and-decode` verifier so the sweep can score "does this max_ctx cell actually serve a trace of n − reply_budget tokens." Snapshot semantics: case `context_tokens_approx <= target_bucket_tokens` is guaranteed (snapshot taken pre-append for the message that would cross). Also fix a latent bug in `_is_claude_session`: it returned False on the first non-user record, which misrouted any Claude session that led with `permission-mode`, `system`, or `queue-operation` (most real sessions do) — including the one this commit was developed against. Tests cover bucket fit, role collapsing, thinking-block drop, PII scrub on HOME paths + token-looking secrets, Codex record decoding, and the leading-meta-record regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erifier Add three small surfaces to the ``agent_recorded`` area to support the coding-agent-loop autotune sweep: * ``load_agent_recorded_multi_turn_cases()`` — reads the bucketed replay fixture produced by ``extract-agentic-fixture.py --multi-turn`` and returns cases sorted ascending by ``target_bucket_tokens``. Distinct from the v1 single-prompt fixture; the two coexist. * ``pick_multi_turn_case_for_budget()`` — given a prompt-token budget (typically ``max_ctx − reply_budget``), returns the largest case that fits. ``None`` when no case fits. * ``grade_prefill_and_decode()`` — pass/fail verifier for the sweep: non-empty response within wall budget, no server error. Lighter than tool-schema-coverage on purpose — the sweep is asking "did this max_ctx setting serve a trace of this length", not "did the model do the task well." Ship a harvested fixture: one Claude Code session sliced into 6 bucketed cases (8K through 128K tokens). Per repo guidance, one long session is enough to cycle with until something breaks; the broader corpus can land later if signal demands. Tests cover the loader contract (cases fit under their bucket, sorted by bucket), the budget picker (largest-fit, None-on-empty), and the verifier's three failure modes (server error, wall-budget overrun, response-too-short) plus the reasoning_content fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop Add an autotune Profile abstraction so different workloads can sweep different axes with different scorers. Two profiles ship: * ``heuristic`` (default, backward-compatible) — preset-agnostic bracket, scores by mean ``decode_tokens_per_sec`` from a luce-bench level1 snapshot. Identical to the prior behavior. * ``coding-agent-loop`` — architecture-aware. Gemma4's bracket is ``max_ctx × fa_window × budget × pflash_mode`` (KV-quant axis omitted because the gemma4 backend hardcodes F16 — verified at gemma4_loader.cpp). Qwen3.6 / laguna keep cache_type as an axis since their loader actually respects it. Scoring is composite: pass-rate on the agent_recorded multi-turn fixture first, then ``completion_tokens / wall_seconds`` as a tps proxy (the longctx-area snapshots ship empty ``decode_tokens_per_sec``). Wire ``--fa-window`` through to the server end-to-end: * ``DflashRuntime.fa_window`` (0 = full attention, server default) * ``DFLASH_FA_WINDOW`` emitted by docker_run.py when nonzero * entrypoint.sh appends ``--fa-window N`` to the server CLI iff ``DFLASH_FA_WINDOW > 0`` — unset env still reproduces stock behavior * ``dflash.fa_window`` round-trips through config.toml CLI: ``lucebox autotune --sweep --profile coding-agent-loop``. New ``--list-profiles`` flag prints the registered profile table. Tests: 318/318 green. New coverage: * Profile registry + ``get_profile`` error path * gemma bracket excludes the KV-quant axis (regression for the no-op axis bug) * gemma bracket varies max_ctx × fa_window × budget * qwen bracket includes tq3_0 + q8_0 * sub-22 GB tiers fall back to base-only (OOM safety) * ``_pick_winner`` ranks agent-replay results by pass→speed→ctx * ``fa_window`` is in the sweep allowlist Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the sweep is invoked directly (e.g. `uv run python -m lucebox autotune --sweep` for development, or any path that bypasses the lucebox.sh wrapper), the LUCEBOX_HOST_* env vars aren't set and ``host_facts.from_env()`` returns a zero-VRAM HostFacts. Every profile bracket then falls through to the <22 GB "base only" branch and the sweep silently degrades to a 1-cell smoke test that overwrites the operator's real config (e.g. dropping max_ctx from 131072 to the DflashRuntime default 16384). Fall back to ``cfg.host`` (populated by an earlier `lucebox check` via the wrapper) when ``from_env()`` yields no signal. Test regresses the original symptom: with LUCEBOX_HOST_* unset, the coding-agent-loop bracket on a 24 GB persisted host must produce a multi-cell sweep, not collapse to one base cell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First end-to-end coding-agent-loop sweep on sindri's gemma-4-26b ran 12 cells. Top-line findings (full detail in docs/experiments/gemma4-26b-coding-agent-loop-sweep-2026-05-30.md): * All six 98K cells pass at 2.8–3.5 tok/s on 90K real prompt tokens * All six 131K cells fail HTTP 400 — the picker's chars/4 estimate undercounts real gemma tokenization by 1.39×, so the 102K-bucket case overshoots the 126976-token effort-tier ceiling at max_ctx= 131072 and the server rejects every cell identically * fa_window=0 (full attention) marginally beat fa_window=2048; budget axis was flat (3.5 / 3.4 / 3.2 at 16 / 22 / 32) Two changes ride with the doc: 1. Bump the gemma WSL 24 GB heuristic max_ctx from 65536 → 98304. The original 65K cap cited unverified CUDA VMM failures; the empirical run proves 98K runs 90K-token prompts with ~3 GB VRAM headroom. 131K remains plausible as a manual operator override but stays out of the default until we have a fixture sized for the real 126976-token budget. 2. Add a 0.7 safety_factor to ``pick_multi_turn_case_for_budget``. The factor closes the chars/4 → real-tokenizer gap so the sweep no longer picks a case whose actual prompt would overshoot. Operators can pass safety_factor=1.0 when fixtures are accurately tokenized. Tests updated to reflect the new heuristic ceiling + the safety-factor guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…think-channel) into local # Conflicts: # luce-bench/pyproject.toml # luce-bench/src/lucebench/areas/agent_recorded.py # luce-bench/src/lucebench/cli.py # luce-bench/src/lucebench/report.py # luce-bench/src/lucebench/runner.py # luce-bench/src/lucebench/schema.py # luce-bench/tests/test_agent_recorded.py # luce-bench/tests/test_fixtures.py # luce-bench/tests/test_runner.py # luce-bench/tests/test_smoke_area.py # lucebox/src/lucebox/autotune.py # lucebox/src/lucebox/cli.py # lucebox/src/lucebox/config.py # lucebox/src/lucebox/docker_run.py # lucebox/src/lucebox/sweep.py # lucebox/src/lucebox/types.py # lucebox/tests/test_autotune.py # lucebox/tests/test_autotune_candidate_configs.py # lucebox/tests/test_sweep.py # scripts/extract-agentic-fixture.py # server/scripts/entrypoint.sh
Two operator-facing docs to land alongside the 2026-05-30 gemma experiment write-up: * autotune-profile-sweep-protocol.md — the procedural how-to for the profile-driven sweep machinery (preconditions, invocation, result reading, known gotchas including the chars/4 undercount and the wrapper-localhost issue). Generalizes the gemma run into something someone can follow without re-deriving steps. * qwen3.6-27b-sweep-runbook-bragi.md — the concrete sequence to repeat the coding-agent-loop sweep against qwen on the RTX 5090 Laptop. Calls out the KV-quant axis difference (qwen35 respects cache_type, gemma4 doesn't) and what to expect / how to roll back / how to document findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bcommands Routes config/smoke/models/check/profile/print-run/print-serve-argv and read-only autotune (no --sweep) to `docker exec` into the running lucebox container instead of `docker run --rm`. Two wins: 1. Shares the live server's network namespace — `lucebox config` / `smoke` etc. can reach localhost:8080 on the running server, which the isolated docker-run container can't. 2. Skips the ~1-3s cold-start of `docker run --rm` per call (config get drops from ~4s to ~1.8s in the field on a mid-sweep sindri). Service-restarting workloads (`autotune --sweep`, `serve`, `pull`, `update`, install/uninstall, systemctl passthrough, client launchers) stay on the host-side / docker-run path — exec'ing those into the very container we'd be restarting would self-destruct. Falls back to docker run when the container is not running so first-run / pre-install flows still work. Add `--no-exec` (and `LUCEBOX_NO_EXEC=1`) escape hatch for debugging the wrapper or when the in-container Python is stale relative to the image. The exec invocation goes through `/opt/lucebox-hub/server/scripts/entrypoint.sh lucebox <args>` because the image has no top-level `lucebox` binary on PATH — the `lucebox` token is a SUBCMD the entrypoint dispatches to `uv run ... python -m lucebox`. Calling the entrypoint explicitly keeps the exec path bit-for-bit equivalent to the docker-run dispatch. Tests: 8 new cases in scripts/test_lucebox_sh.sh covering route-to-exec when running, fall-back-to-run when not running, autotune --sweep sticking to docker run, autotune --list-profiles routing to exec, --no-exec + LUCEBOX_NO_EXEC=1 overrides, smoke routing, and the usage help mentioning the new behavior. Mocks docker via PATH shim that prints its argv so the test asserts on the actual invocation. Total: 54 pass (up from 46). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The longctx grader rejected any response that didn't literally begin
with "Risk:" at offset zero. Thinking-mode models on the
``longctx --think`` snapshot routinely emit a one-sentence transition
phrase ("Considering the limited time by the user, I have to give
the solution based on the thinking directly now.") *before* the
required ``Risk:`` line, so the 2026-05-27 gemma longctx-think run
saw 2/6 false fails and the 2026-05-30 qwen3.6 thinking benchmark
saw a 0/1 false fail on frontier-2k.
Switch the primary `graded_pass` and `format_pass` metrics to use a
multiline regex (``Risk:`` at the start of any line in the
response), and surface the literal-prefix result alongside as
``strict_pass`` so snapshots can still distinguish "model complied
exactly" from "model preambled but eventually complied." No change
to the prompt — the instruction still asks for a single sentence —
just the grader stops penalizing models for narrating their
thinking-budget pivot.
Tests: 7 new cases covering pure-prefix pass, leading-whitespace
strict, thinking-preamble lenient (regression for the qwen run),
no-risk-anywhere fail, too-short stub, and the case-wrapper
surfacing strict_pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kerfile Two bugs found during bragi (RTX 5090 Laptop) autotune sweep setup: 1. `lucebox autotune --sweep --profile coding-agent-loop` failed with "No module named 'lucebench'" because sweep.py's agent_replay scorer imports `lucebench.areas.agent_recorded` but `luce-bench` was not declared as a dependency of the `lucebox` workspace member. Scored cells all returned fail, so the sweep would restore the backup config and exit with no winner. Fix: add `luce-bench` (workspace dep) to `lucebox/pyproject.toml` so `uv run --project lucebox` always has `lucebench` importable. 2. Dockerfile was copying `share/model_cards` to two paths (`/opt/lucebox-hub/server/share/model_cards` for the C++ server and `/opt/lucebox-hub/share/model_cards` for luce-bench's hatchling force-include). Replace the duplicate with a single copy + `ln -s` so the image carries one copy at the canonical luce-bench path and the C++ server resolves it via symlink. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… at 98K, 131K confirmed for gemma Two coding-agent-loop sweeps on bragi (RTX 5090 Laptop MaxQ, 23 GB VRAM, sm_120): Qwen3.6-27B sweep findings: - q8_0 OOMs at max_ctx=98304 on 23 GB (model ~18-19 GB + KV ~5-6 GB = 24-25 GB) - tq3_0 required at 98K: KV only ~2-3 GB, leaving ~1-2 GB headroom - budget=32 unreliable at 65K (edge VRAM); fine at 98K with tq3_0 - q8_0 is faster at 65K/b16 (4.0 vs 3.1 tok/s) but not viable for production Gemma 4 26B sweep findings: - All 12 cells pass including 131K (sindri's 131K failures were a fixture-picker artifact — the picker selected a 100K case that expanded to >126976 real tokens, triggering HTTP 400; not a VRAM limit) - fa_window and budget axes flat (~2.0 tok/s across all cells) - Winner: budget=22, max_ctx=131072, fa_window=0 Code changes: - autotune.py: 22-31 GB heuristic explicitly sets tq3_0 for qwen (prevents OOM on fresh installs); qwen bracket skips q8_0 at max_ctx>=98304 (saves 3 cells) - sweep.py: fix winner selection — sort by -max_ctx first, then -speed_metric; prevents the metric artifact where smaller fixture inflates 65K cell speeds - docs/experiments: two new sweep docs + correction note on sindri gemma doc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rdering Bragi's commit 3dffb30 inverted the agent_replay winner-pick sort: max_ctx is now the primary key, speed_metric is the tiebreaker within the same max_ctx. The test that asserted speed-first ordering was inherited from the old (buggy) behavior and started failing after the merge. Replace with two tests that pin the new contract: * cross-max_ctx: larger max_ctx wins even when a smaller-ctx cell reports higher speed_metric. The speed gap is a fixture artifact (smaller ctx picks shorter fixture cases). * within same max_ctx: speed_metric breaks the tie as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After bragi proved 131K viable on a 23 GB laptop, sindri (3090 Ti, 24 GB) got `max_ctx=131072, budget=22, fa_window=0` and re-ran level2. No quality regression vs the prior 98K config; longctx stays 100% (6/6) through frontier-64k. VRAM 21.1 / 24.6 GiB used at boot, ~3 GiB headroom. The sindri gemma sweep doc now carries the verification table alongside the existing correction note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… vs 150-175W full perf)
…ted verb synonyms
The 2026-05-30 gemma full bench surfaced a grader-strictness issue:
agent_recorded scored 23% (6/26) because gemma emitted real tool
engagement in a non-Claude format —
``call:execute-bead:read-file{path:...}`` over and over. The grader's
``_tool_mentioned`` looked for ``\bRead\b`` / Claude-named synonyms,
none of which matched ``read-file`` or ``read_file``.
Two changes:
* Expand ``_TOOL_SYNONYMS`` with the hyphen/underscore verb forms
that models emit when given a custom tool namespace in the prompt:
``read_file/read-file``, ``list_files/list-files/ls_files/ls``,
``edit_file/edit-file``, ``write_file/write-file``,
``grep_files/search_code``, ``exec_command/shell-exec``, etc.
* Add a ``call:[namespace:]<verb>{...}`` regex that pulls verbs out
of structured-tool-call emissions and feeds them through the
synonym check. Catches the case where the model never narrates the
tool name in English but does invoke it via the structured format.
Re-grading the 2026-05-30 gemma snapshot: pass rate climbs
**23.1% → 30.8%** (4 cases newly recognized as tool engagement).
Intentionally conservative — ``execute-bead`` as a bare namespace is
NOT a Bash synonym because it wraps many verbs, each of which maps
to its own Claude tool.
Tests: 6 new cases pinning the hyphenated-call-verb, snake_case,
no-namespace, case-insensitive-verb, and end-to-end pass paths.
Suite goes 370 → 376 (all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synthetic agent area's grader passed responses containing any of:
code fence, JSON tool_use envelope, or apply_patch envelope. The
2026-05-30 gemma full bench showed it missing real agent engagement
from the codex-large-explore case: response was ``call:update_plan{...}\n
call:shell{command: ...}\n`` with no code fence or OpenAI-style
``"name": "Read"`` envelope. That's exactly as agent-shaped as a JSON
tool_use block, just a different serialization.
Add a fourth pass class: ``call:<verb>{`` or ``call:<ns>:<verb>{``.
The agent area pass rate on the same snapshot lifts from 2/4 (50%)
to 3/4 (75%); the remaining fail (codex-mini-read-task) is a
genuine narrative-only response and stays failed.
New tests cover all four pass classes plus the narrative-only and
inline-backtick negative cases. 376 → 384 tests, all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring fail The 2026-05-30 gemma full bench scored code at 10% (1/10). Inspection showed almost every "fail" was actually a valid function body followed by chat-template artifacts the model leaked at the tail (``return Falsestring\n``, ``thought\n``, Chinese transition phrases, bad-indent fragments). ``ast.parse(prompt + completion)`` rejected the whole thing on the trailing noise even though the actual code in the middle parses cleanly. Extend the grader's existing "try a few separators" loop with a "try progressive trim from the end" outer loop. Budget capped at 32 truncations so a degenerate 1000-line response can't blow grader wall-time. Real cases need 0-3 truncations. Pass rate on the same snapshot: **10% → 80%** (1/10 → 8/10). The remaining 2 fails are genuinely broken code (no parseable prefix at all), as intended. Tests cover clean, trailing-garbage (regression for the gemma ``return Falsestring`` artifact), broken-everywhere, empty, and the ``thought\n`` chat-template leak. 384 → 389 tests, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-30 gemma full bench scored forge 0/30 cases with ``error_type=ValidationError`` on every row. Two stacked bugs: 1. The recording client called ``TextResponse(text=...)`` but the forge ``TextResponse`` field is named ``content`` — every send() raised a pydantic ValidationError, which surfaced as the per-row error_type. (Independent bug, fixed in one line: text=→content=.) 2. Even with #1 fixed, gemma emits ``call:get_country_info{country: "France"}call:summarize{text: "..."}`` as plain text in a ``text`` content block — not as Anthropic ``tool_use`` structured blocks — so the old client surfaced text-only responses and forge would have nudged forever waiting for a tool call. This patch scans the assistant text for ``call:<verb>{args}`` invocations, parses the args as relaxed JSON (json.loads first, then a permissive pass that quotes bare keys), and synthesizes ``ToolCall`` entries that forge's WorkflowRunner consumes natively. Malformed args are dropped (per-call, not per-response) so a single mangled invocation doesn't crash the bench. The forge LLMResponse contract is ``list[ToolCall] | TextResponse`` (forge_eval._forge.core.workflow), so synthesis stays within the existing types — no anthropic.types.Message construction needed. Why client-side: the server's chat_template / SSE emitter could translate the plain-text shape into Anthropic tool_use blocks upstream (cleaner long-term), but that's a C++ change with broader scope. The client-side path also future-proofs the bench for any other model that uses the same plain-text tool serialization (codex-mini, DDX bead executor, etc.) — same intent already recognized in lucebench.areas.agent's _CALL_INVOCATION pattern. Tests cover the parsing/synthesis helper in isolation: empty input, single calls, back-to-back calls, snake_case + kebab-case + ns:verb names, nested braces, strings containing } chars, unbalanced braces, and unparseable args. Full test suite remains green (291 passed, +16 from this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restores forge area from 0% pass-rate on the 2026-05-30 gemma full bench by fixing two stacked bugs in
luce-bench/src/lucebench/areas/forge.py:TextResponse(text=...)→ValidationError— the forge-internalTextResponsepydantic model's field iscontent, nottext. Everysend()call raised this immediately, so every row'serror_typewasValidationErrorregardless of what the server actually returned. One-character fix (text=→content=).call:<verb>{...}emissions discarded — gemma emits structured tool calls as inline text inside thetextcontent block (e.g.call:get_country_info{country: "France"}call:summarize{text: "..."}) rather than as Anthropictool_useblocks. Even with Add Discord community link to README #1 fixed, forge would never have seen a tool call and would have nudged until max_iterations.Empirical signal
From
forge.jsonin thed9ecba6cc105-…-gemma-full-2026-05-30-67f4snapshot: 30/30 rows failed witherror_type = "ValidationError". Every iteration'stool_callslist is empty andoutputcarries the rawcall:<verb>{...}text. After this fix, that same response shape produces alist[ToolCall]that forge'sWorkflowRunnerconsumes natively.Approach
Client-side synthesis: when the response contains no
tool_usecontent blocks but the text payload containscall:<verb>{...}invocations, parse the args as relaxed JSON (strictjson.loadsfirst, then a permissive pass that quotes bare keys) and synthesizeToolCallobjects.Why client-side, not server-side: the cleaner long-term fix is a server-side translation in
chat_template.cpp/sse_emitter.cppthat convertscall:<verb>{...}text intotool_usecontent blocks before they leave the server. That's a larger C++ change and out of scope for this PR — filed as follow-up for a future PR. The client-side path also future-proofs the bench for any other model that uses the same plain-text tool serialization (codex-mini, DDX bead executor, etc.) — same intent already recognized inlucebench.areas.agent's_CALL_INVOCATIONregex.Types
The forge
LLMResponsecontract islist[ToolCall] | TextResponse(seeforge_eval/_forge/core/workflow.py). Synthesis stays within those existing types — noanthropic.types.Messageconstruction needed, no SDK shape mimicry.Stacked on PR #285
This PR's only changed files (
luce-bench/src/lucebench/areas/forge.pyand the newluce-bench/tests/test_forge_grader.py) live entirely inside #285's newluce-bench/tree, which doesn't exist onmainyet. The diff againstmaintherefore includes all of #285's content; the substantive delta on top of #285 is two files.Opened as a draft for now. Once #285 lands, this can be marked ready (the diff will collapse to just the two files). Alternative: change base to
feat/lucebox-dockerfor a clean stacked-PR view.Test plan
luce-bench/tests/test_forge_grader.pycover: no pattern, single call, back-to-back calls preserving order, snake_case + kebab-case + namespaced verbs, strict-JSON args, malformed args dropped (no crash), unbalanced braces terminating scan, nested braces, strings containing}chars, and the reasoning-text stripper.sindrionce this lands and confirm a non-zero pass rate.