Skip to content

fix(forge): synthesize tool_use from call:<verb>{} plain-text emissions#320

Closed
easel wants to merge 34 commits into
Luce-Org:mainfrom
easel:fix/forge-plain-text-tool-call-synthesis
Closed

fix(forge): synthesize tool_use from call:<verb>{} plain-text emissions#320
easel wants to merge 34 commits into
Luce-Org:mainfrom
easel:fix/forge-plain-text-tool-call-synthesis

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented May 31, 2026

Summary

Restores forge area from 0% pass-rate on the 2026-05-30 gemma full bench by fixing two stacked bugs in luce-bench/src/lucebench/areas/forge.py:

  1. TextResponse(text=...)ValidationError — the forge-internal TextResponse pydantic model's field is content, not text. Every send() call raised this immediately, so every row's error_type was ValidationError regardless of what the server actually returned. One-character fix (text=content=).
  2. Plain-text call:<verb>{...} emissions discarded — gemma emits structured tool calls as inline text inside the text content block (e.g. call:get_country_info{country: "France"}call:summarize{text: "..."}) rather than as Anthropic tool_use blocks. Even with Add Discord community link to README #1 fixed, forge would never have seen a tool call and would have nudged until max_iterations.

Empirical signal

From forge.json in the d9ecba6cc105-…-gemma-full-2026-05-30-67f4 snapshot: 30/30 rows failed with error_type = "ValidationError". Every iteration's tool_calls list is empty and output carries the raw call:<verb>{...} text. After this fix, that same response shape produces a list[ToolCall] that forge's WorkflowRunner consumes natively.

Approach

Client-side synthesis: when the response contains no tool_use content blocks but the text payload contains call:<verb>{...} invocations, parse the args as relaxed JSON (strict json.loads first, then a permissive pass that quotes bare keys) and synthesize ToolCall objects.

Why client-side, not server-side: the cleaner long-term fix is a server-side translation in chat_template.cpp / sse_emitter.cpp that converts call:<verb>{...} text into tool_use content blocks before they leave the server. That's a larger C++ change and out of scope for this PR — filed as follow-up for a future PR. The client-side path also future-proofs the bench for any other model that uses the same plain-text tool serialization (codex-mini, DDX bead executor, etc.) — same intent already recognized in lucebench.areas.agent's _CALL_INVOCATION regex.

Types

The forge LLMResponse contract is list[ToolCall] | TextResponse (see forge_eval/_forge/core/workflow.py). Synthesis stays within those existing types — no anthropic.types.Message construction needed, no SDK shape mimicry.

Stacked on PR #285

This PR's only changed files (luce-bench/src/lucebench/areas/forge.py and the new luce-bench/tests/test_forge_grader.py) live entirely inside #285's new luce-bench/ tree, which doesn't exist on main yet. The diff against main therefore includes all of #285's content; the substantive delta on top of #285 is two files.

Opened as a draft for now. Once #285 lands, this can be marked ready (the diff will collapse to just the two files). Alternative: change base to feat/lucebox-docker for a clean stacked-PR view.

Test plan

  • New unit tests in luce-bench/tests/test_forge_grader.py cover: no pattern, single call, back-to-back calls preserving order, snake_case + kebab-case + namespaced verbs, strict-JSON args, malformed args dropped (no crash), unbalanced braces terminating scan, nested braces, strings containing } chars, and the reasoning-text stripper.
  • Full pytest suite: 291 passed (up from 275 pre-change).
  • Ruff lint clean on both touched files.
  • Manual end-to-end: re-run forge area against gemma on sindri once this lands and confirm a non-zero pass rate.

easel and others added 30 commits May 29, 2026 01:13
…nch in-tree

Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main.
Net: 189 files changed.

Major workstreams folded in:

* Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage
  Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache,
  build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO.
* Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID
  guard, container-state preflight), cmd_systemctl_passthrough (already-
  active short-circuit, restart-loop detection), cmd_update (bootstrap-
  installer pattern), cmd_completion (bash/zsh/fish), config.toml reader
  (env > toml > default precedence), shellcheck-clean.
* Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the
  installed copy so lucebox update keeps tracking the channel; refuses
  SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL.
* In-container Python CLI (lucebox/): sparse config.toml persistence,
  config get/set/unset sub-app, models list/download sub-app (replaces
  download-models), autotune with --apply / --json / --sweep, profile
  collapsed onto luce-bench snapshot (1701 → 183 lines).
* luce-bench: snapshot subcommand + canonical HostInfo schema v2 +
  levels (level0/1/2/3) + report subcommand + submit-baseline + regrade.
* Server (C++): /props.host block + props_schema=4 + host_info read at
  startup, /props.build identity, GGUF metadata + sha256 sidecars,
  model card sidecars.
* Harness: client implementations for claude/codex/opencode/hermes/pi.
* Strict 11-field config.toml allowlist for dflash.* runtime tunables.

Deleted (rolled into new structure):
* server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by
  luce-bench snapshot + areas.
* lucebox configure, lucebox download-models, lucebox benchmark — replaced
  by config sub-app, models sub-app, autotune --sweep.
* luce-bench --sweep flag — moved to argv-sniff subcommand dispatch.

Conflict resolution:
* server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion
  (feat/lucebox-docker moved bench machinery into luce-bench).
* README.md — took feat-branch version. origin/main had 19 commits worth
  of minor README tweaks since the branch base; those need to be folded
  back in as a follow-up PR.
* docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took
  feat-branch version. origin/main had 1 link-fix commit; feat-branch
  has the schema-4 + host-block additions that strictly supersede.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when
config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That
contradicted the precedence lucebox.sh documents (env > toml > default)
and bit sindri in production: its config.toml had `[image]` without a
`registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub`
beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`.
Symptom: `lucebox start` brought up the wrong (stale luce-org) image
even after explicit `lucebox install` + `lucebox pull` against easel.

Fix: overlay env on top of whatever `load()` returns (or `live_config()`
falls back to). Only the five top-level scalars have env hooks
(LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model
intentionally don't.

Adds two regression tests:
- env beats config.toml when toml has no explicit value for that key,
- env still wins when toml is absent (covers the live_config fallback).

102 lucebox tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g#285 CI

CI's "Lint Python surfaces touched by lucebox tooling" job ran
`ruff check .` and found 11 errors across surfaces this branch touches.
Ruff --fix handled 6 (import sorting, unused imports); 5 needed
hand-edits:

  luce-bench/src/lucebench/report.py:172  E741  rename `for l in` → `for lineup in`
  lucebox/tests/test_check.py:39, 95      E731  lambda → def stub() for the two HostFacts stubs
  lucebox/tests/test_cli.py:95            E501  wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv
  lucebox/tests/test_sweep.py:174, 177    E501  wrap two CellResult constructors

22 lucebox tests touched still pass; ruff is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_autotune_candidate_configs.py: sort imports (ruff I001).
- download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo
  and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo
  (correct — we only query model repos here), resolving the mypy union-attr error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job-level `permissions` block replaces the workflow-level default
entirely, so `actions/checkout` was running without `contents: read`
and would fail on protected refs. Add `contents: read` back alongside
the existing `id-token: write`.

Addresses cubic #1 on PR Luce-Org#285.
- Dockerfile: keep --frozen on the uv sync fallback so the layer can't
  silently resolve outside the lockfile.
- harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty
  (per-area card defaults govern; --no-think only when explicitly set)
  and default LUCEBENCH_AREA to the level1 capability gate
  (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad
  for routine harness runs.

Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.
…appers

- .github/workflows/{ci,docker,release-luce-bench}.yml: pin
  actions/checkout, docker/{setup-buildx,login,metadata,bake}-action,
  and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments
  so the supply chain is reproducible (Luce-Org#4).
- harness/src/harness/clients/_common.py: replace the external `timeout`
  shell-out with `subprocess.run(..., timeout=N)`, return 124 on
  TimeoutExpired to match GNU timeout's exit code (Luce-Org#5).
- scripts/build_image.sh: normalize REGISTRY to end in `/` instead of
  silently producing `ghcr.io/luce-orglucebox-hub` when the trailing
  slash is missing (Luce-Org#6).
- harness/src/harness/clients/pi.py: non-interactive launch now mirrors
  run_pi.sh's validated invocation (--provider, --print, --mode json,
  --tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR /
  PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7).
- docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags,
  since `+` is not a valid Docker tag character (Luce-Org#8).
- harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of
  run_hermes.sh's env wiring and call `chat --provider --model
  --accept-hooks --yolo --max-turns --source --query` instead of a bare
  positional prompt (Luce-Org#9, Luce-Org#10).
- harness/src/harness/clients/openclaw.py: apply the OpenClaw config
  patch via `openclaw config patch --file` before the run, and call
  `agent --local --json --model lucebox/<model> --session-id --timeout
  --message` instead of a bare positional prompt (Luce-Org#11).
- pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server,
  tool_memory}.py ruff include pins (those paths were renamed during
  the dflash→server rename and then deleted upstream) (Luce-Org#12).
- lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to
  `**/*.sh` so scripts under nested dirs (harness/clients/*.sh,
  scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13).

Addresses cubic Luce-Org#4Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in
the previous commit alongside the LUCEBENCH_THINK default fix.
- lucebox/README.md: fix the relative link to `cli.py`; resolves to
  `src/lucebox/cli.py` (the actual location), not the nonexistent
  `lucebox/cli.py` (Luce-Org#15).
- luce-bench/NOTICE: the bundled forge_eval LICENSE says
  "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE
  with the actual upstream LICENSE (Luce-Org#16).
- luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing
  agent / agent_recorded / forge / longctx / smoke. Add the imports +
  list entries so `from lucebench.areas import *` matches the actual
  area surface (Luce-Org#17).

Addresses cubic Luce-Org#15Luce-Org#17 (P3) on PR Luce-Org#285.
…nch in-tree

Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single
commit on top of origin/main (8782d07). Net: 189 files changed.

Workstreams folded in:

* Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image,
  multi-stage Dockerfile with reproducible `uv sync --frozen`,
  docker-bake.hcl with VERSION sanitization for Docker tag charset,
  .github/workflows/docker.yml with SHA-pinned external actions and
  GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO +
  HOST_INFO.

* Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID
  guard against systemd self-defeat, container-state preflight),
  cmd_systemctl_passthrough (already-active short-circuit, restart-loop
  detection), cmd_update (bootstrap-installer pattern), cmd_completion
  (bash/zsh/fish), config.toml reader (env > toml > default), all
  shellcheck-clean.

* Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into
  the installed copy so `lucebox update` keeps tracking the channel;
  refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL.

* In-container Python CLI (lucebox/): sparse config.toml persistence,
  config get/set/unset sub-app, models list/download sub-app
  (replaces download-models), autotune with --apply / --json / --sweep,
  profile collapsed onto luce-bench snapshot (1701 → ~150 lines).
  _load_or_build now respects env > toml > default precedence.

* luce-bench: snapshot subcommand + canonical HostInfo schema v2
  (multi-GPU lineup, WSL detection, source/collector trust metadata) +
  levels (level0/1/2/3) + report subcommand (host column + cross-host
  confounder warnings) + submit-baseline (level3-gated) + regrade.

* Server (C++): /props.host block + props_schema=4 + host_info loader,
  /props.build identity, GGUF metadata + sha256 sidecars, model card
  sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench
  machinery moved into luce-bench.

* Harness: client implementations for claude/codex/opencode/hermes/pi
  pointed at the running lucebox server, matched against the validated
  run_*.sh shell wrappers.

Cubic AI code review (17 findings) addressed in full:
  P0: contents: read on luce-bench release job permissions.
  P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty
      so per-area defaults apply.
  P2: 6 external actions pinned to immutable SHAs; non-interactive
      timeout via subprocess.run; REGISTRY trailing-slash normalize;
      VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw
      mirrored against run_*.sh wrappers; ruff scan paths corrected to
      server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default
      level1.
  P3: lucebox/README.md cli.py link fixed; NOTICE copyright year
      2025-2026; areas/__init__.py __all__ exposes all 10 areas.

CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12
prebuild, cubic reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontent channel

The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.

The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.

Fix:

1. chat_template.h: render_chat_template / render_chat_template_jinja now
   return a PromptRenderResult { text, started_in_thinking }. The built-in
   QWEN3 and LAGUNA branches set started_in_thinking deterministically when
   enable_thinking && add_generation_prompt; GEMMA4 stays false (its
   reasoning channel is opened by the model emitting `<|channel>`, which
   http_server forwards into the emitter as `<think>`). The Jinja path
   suffix-sniffs the rendered prompt for a trailing `<think>` opener and
   emits a [WARN] log when sniffing decides true so a template/model-card
   mismatch surfaces at runtime.

2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
   When constructed with REASONING, active_kind_ initializes to "thinking"
   so the Anthropic first content_block is `thinking` instead of `text`
   (avoids a spurious empty text-block stop+restart on the first reasoning
   delta). Deliberately leaves checked_think_prefix_ at its default (false)
   so the existing one-time `<think>` strip guard still trips if a
   template/model-card mismatch causes the model to emit a redundant opener.

3. http_server.cpp: thread render_result.started_in_thinking through
   ParsedRequest into the SseEmitter's initial_mode. Both streaming and
   non-streaming paths feed tokens through the same emitter, so the fix
   covers both response shapes.

Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.

Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three C++ tests that chain render_chat_template + SseEmitter so the
wiring between the renderer's started_in_thinking flag and the emitter's
initial_mode is exercised end-to-end, not just at each end. The per-unit
tests above each verify their half of the contract, but the original bug
was a missing call-site wire — both halves were correct in isolation.

Also tighten the Python integration test assertions for enable_thinking
and reasoning.effort: require non-empty reasoning_content and no raw
<think>/</think> in either channel. The prior 'doesn't crash' assertion
would have passed on the broken code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…box-docker)

Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into
reasoning_content channel instead of content) into the lucebox-docker stack.
…budget

Increment 1 (Tier 1): model-card registry resolvable by normalized model id
(/props.model_card → bundled cards → family fallback), per-model thinking tokens
via the card with a thinking-capability gate, configurable --reasoning-effort
{low,medium,high} (was hardcoded high) and --thinking-budget-tokens N, plus
card_source/card_stem provenance on every row. Cards bundled into the wheel via
hatch force-include from share/model_cards (single source; CI drift guard TODO).

Tier 2: --client-thinking-budget N — client-side thinking termination for
providers that ignore native budget hints. Streams the response, estimates
reasoning tokens (char/4), and when over budget aborts and issues a forced-
</think> re-prompt (a fresh conditioned sample, not decoder continuation) using
the card's terminator + reply reserve, producing a gradable answer. Gated on
reasoning being identifiable in the stream (reasoning_content deltas or <think>
tags); unmarked output is left untouched. client_abort rows are a separate
benchmark mode (never pooled with single-pass), with continuation-failure and
answer-started-before-abort rows excluded from the aggregate and coverage
reported.

Verified live: OpenRouter qwen3.6-27b ignores reasoning_effort/budget_tokens
(reasoning unbounded), but --client-thinking-budget 2000 bounds it precisely
(~2001 reasoning tokens/row, continuation=ok, 8/8 pass on the head subset).
234 tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `--multi-turn` mode to scripts/extract-agentic-fixture.py for the
coding-agent-loop autotune profile: walk one session in record order,
emit a replay case at each target-token bucket (default
8K/16K/32K/64K/100K/128K). Each case ships an OpenAI-shaped `messages`
list and a `prefill-and-decode` verifier so the sweep can score
"does this max_ctx cell actually serve a trace of n − reply_budget
tokens." Snapshot semantics: case `context_tokens_approx <=
target_bucket_tokens` is guaranteed (snapshot taken pre-append for
the message that would cross).

Also fix a latent bug in `_is_claude_session`: it returned False on
the first non-user record, which misrouted any Claude session that
led with `permission-mode`, `system`, or `queue-operation` (most
real sessions do) — including the one this commit was developed
against.

Tests cover bucket fit, role collapsing, thinking-block drop, PII
scrub on HOME paths + token-looking secrets, Codex record decoding,
and the leading-meta-record regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erifier

Add three small surfaces to the ``agent_recorded`` area to support
the coding-agent-loop autotune sweep:

* ``load_agent_recorded_multi_turn_cases()`` — reads the bucketed
  replay fixture produced by ``extract-agentic-fixture.py --multi-turn``
  and returns cases sorted ascending by ``target_bucket_tokens``.
  Distinct from the v1 single-prompt fixture; the two coexist.
* ``pick_multi_turn_case_for_budget()`` — given a prompt-token budget
  (typically ``max_ctx − reply_budget``), returns the largest case
  that fits. ``None`` when no case fits.
* ``grade_prefill_and_decode()`` — pass/fail verifier for the sweep:
  non-empty response within wall budget, no server error. Lighter
  than tool-schema-coverage on purpose — the sweep is asking "did
  this max_ctx setting serve a trace of this length", not "did the
  model do the task well."

Ship a harvested fixture: one Claude Code session sliced into 6
bucketed cases (8K through 128K tokens). Per repo guidance, one
long session is enough to cycle with until something breaks; the
broader corpus can land later if signal demands.

Tests cover the loader contract (cases fit under their bucket,
sorted by bucket), the budget picker (largest-fit, None-on-empty),
and the verifier's three failure modes (server error, wall-budget
overrun, response-too-short) plus the reasoning_content fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop

Add an autotune Profile abstraction so different workloads can sweep
different axes with different scorers. Two profiles ship:

* ``heuristic`` (default, backward-compatible) — preset-agnostic
  bracket, scores by mean ``decode_tokens_per_sec`` from a luce-bench
  level1 snapshot. Identical to the prior behavior.
* ``coding-agent-loop`` — architecture-aware. Gemma4's bracket is
  ``max_ctx × fa_window × budget × pflash_mode`` (KV-quant axis
  omitted because the gemma4 backend hardcodes F16 — verified at
  gemma4_loader.cpp). Qwen3.6 / laguna keep cache_type as an axis
  since their loader actually respects it. Scoring is composite:
  pass-rate on the agent_recorded multi-turn fixture first, then
  ``completion_tokens / wall_seconds`` as a tps proxy (the
  longctx-area snapshots ship empty ``decode_tokens_per_sec``).

Wire ``--fa-window`` through to the server end-to-end:

* ``DflashRuntime.fa_window`` (0 = full attention, server default)
* ``DFLASH_FA_WINDOW`` emitted by docker_run.py when nonzero
* entrypoint.sh appends ``--fa-window N`` to the server CLI iff
  ``DFLASH_FA_WINDOW > 0`` — unset env still reproduces stock behavior
* ``dflash.fa_window`` round-trips through config.toml

CLI: ``lucebox autotune --sweep --profile coding-agent-loop``. New
``--list-profiles`` flag prints the registered profile table.

Tests: 318/318 green. New coverage:

* Profile registry + ``get_profile`` error path
* gemma bracket excludes the KV-quant axis (regression for the
  no-op axis bug)
* gemma bracket varies max_ctx × fa_window × budget
* qwen bracket includes tq3_0 + q8_0
* sub-22 GB tiers fall back to base-only (OOM safety)
* ``_pick_winner`` ranks agent-replay results by pass→speed→ctx
* ``fa_window`` is in the sweep allowlist

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the sweep is invoked directly (e.g. `uv run python -m lucebox
autotune --sweep` for development, or any path that bypasses the
lucebox.sh wrapper), the LUCEBOX_HOST_* env vars aren't set and
``host_facts.from_env()`` returns a zero-VRAM HostFacts. Every profile
bracket then falls through to the <22 GB "base only" branch and the
sweep silently degrades to a 1-cell smoke test that overwrites the
operator's real config (e.g. dropping max_ctx from 131072 to the
DflashRuntime default 16384).

Fall back to ``cfg.host`` (populated by an earlier `lucebox check`
via the wrapper) when ``from_env()`` yields no signal. Test regresses
the original symptom: with LUCEBOX_HOST_* unset, the coding-agent-loop
bracket on a 24 GB persisted host must produce a multi-cell sweep,
not collapse to one base cell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First end-to-end coding-agent-loop sweep on sindri's gemma-4-26b ran
12 cells. Top-line findings (full detail in
docs/experiments/gemma4-26b-coding-agent-loop-sweep-2026-05-30.md):

* All six 98K cells pass at 2.8–3.5 tok/s on 90K real prompt tokens
* All six 131K cells fail HTTP 400 — the picker's chars/4 estimate
  undercounts real gemma tokenization by 1.39×, so the 102K-bucket
  case overshoots the 126976-token effort-tier ceiling at max_ctx=
  131072 and the server rejects every cell identically
* fa_window=0 (full attention) marginally beat fa_window=2048;
  budget axis was flat (3.5 / 3.4 / 3.2 at 16 / 22 / 32)

Two changes ride with the doc:

1. Bump the gemma WSL 24 GB heuristic max_ctx from 65536 → 98304.
   The original 65K cap cited unverified CUDA VMM failures; the
   empirical run proves 98K runs 90K-token prompts with ~3 GB VRAM
   headroom. 131K remains plausible as a manual operator override
   but stays out of the default until we have a fixture sized for
   the real 126976-token budget.

2. Add a 0.7 safety_factor to ``pick_multi_turn_case_for_budget``.
   The factor closes the chars/4 → real-tokenizer gap so the sweep
   no longer picks a case whose actual prompt would overshoot.
   Operators can pass safety_factor=1.0 when fixtures are
   accurately tokenized.

Tests updated to reflect the new heuristic ceiling + the
safety-factor guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…think-channel) into local

# Conflicts:
#	luce-bench/pyproject.toml
#	luce-bench/src/lucebench/areas/agent_recorded.py
#	luce-bench/src/lucebench/cli.py
#	luce-bench/src/lucebench/report.py
#	luce-bench/src/lucebench/runner.py
#	luce-bench/src/lucebench/schema.py
#	luce-bench/tests/test_agent_recorded.py
#	luce-bench/tests/test_fixtures.py
#	luce-bench/tests/test_runner.py
#	luce-bench/tests/test_smoke_area.py
#	lucebox/src/lucebox/autotune.py
#	lucebox/src/lucebox/cli.py
#	lucebox/src/lucebox/config.py
#	lucebox/src/lucebox/docker_run.py
#	lucebox/src/lucebox/sweep.py
#	lucebox/src/lucebox/types.py
#	lucebox/tests/test_autotune.py
#	lucebox/tests/test_autotune_candidate_configs.py
#	lucebox/tests/test_sweep.py
#	scripts/extract-agentic-fixture.py
#	server/scripts/entrypoint.sh
Two operator-facing docs to land alongside the 2026-05-30 gemma
experiment write-up:

* autotune-profile-sweep-protocol.md — the procedural how-to for the
  profile-driven sweep machinery (preconditions, invocation, result
  reading, known gotchas including the chars/4 undercount and the
  wrapper-localhost issue). Generalizes the gemma run into something
  someone can follow without re-deriving steps.

* qwen3.6-27b-sweep-runbook-bragi.md — the concrete sequence to repeat
  the coding-agent-loop sweep against qwen on the RTX 5090 Laptop.
  Calls out the KV-quant axis difference (qwen35 respects cache_type,
  gemma4 doesn't) and what to expect / how to roll back / how to
  document findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bcommands

Routes config/smoke/models/check/profile/print-run/print-serve-argv and
read-only autotune (no --sweep) to `docker exec` into the running lucebox
container instead of `docker run --rm`. Two wins:

1. Shares the live server's network namespace — `lucebox config` / `smoke`
   etc. can reach localhost:8080 on the running server, which the isolated
   docker-run container can't.
2. Skips the ~1-3s cold-start of `docker run --rm` per call (config get
   drops from ~4s to ~1.8s in the field on a mid-sweep sindri).

Service-restarting workloads (`autotune --sweep`, `serve`, `pull`,
`update`, install/uninstall, systemctl passthrough, client launchers) stay
on the host-side / docker-run path — exec'ing those into the very
container we'd be restarting would self-destruct.

Falls back to docker run when the container is not running so first-run /
pre-install flows still work. Add `--no-exec` (and `LUCEBOX_NO_EXEC=1`)
escape hatch for debugging the wrapper or when the in-container Python is
stale relative to the image.

The exec invocation goes through `/opt/lucebox-hub/server/scripts/entrypoint.sh
lucebox <args>` because the image has no top-level `lucebox` binary on
PATH — the `lucebox` token is a SUBCMD the entrypoint dispatches to
`uv run ... python -m lucebox`. Calling the entrypoint explicitly keeps
the exec path bit-for-bit equivalent to the docker-run dispatch.

Tests: 8 new cases in scripts/test_lucebox_sh.sh covering route-to-exec
when running, fall-back-to-run when not running, autotune --sweep
sticking to docker run, autotune --list-profiles routing to exec,
--no-exec + LUCEBOX_NO_EXEC=1 overrides, smoke routing, and the usage
help mentioning the new behavior. Mocks docker via PATH shim that prints
its argv so the test asserts on the actual invocation. Total: 54 pass
(up from 46).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The longctx grader rejected any response that didn't literally begin
with "Risk:" at offset zero. Thinking-mode models on the
``longctx --think`` snapshot routinely emit a one-sentence transition
phrase ("Considering the limited time by the user, I have to give
the solution based on the thinking directly now.") *before* the
required ``Risk:`` line, so the 2026-05-27 gemma longctx-think run
saw 2/6 false fails and the 2026-05-30 qwen3.6 thinking benchmark
saw a 0/1 false fail on frontier-2k.

Switch the primary `graded_pass` and `format_pass` metrics to use a
multiline regex (``Risk:`` at the start of any line in the
response), and surface the literal-prefix result alongside as
``strict_pass`` so snapshots can still distinguish "model complied
exactly" from "model preambled but eventually complied." No change
to the prompt — the instruction still asks for a single sentence —
just the grader stops penalizing models for narrating their
thinking-budget pivot.

Tests: 7 new cases covering pure-prefix pass, leading-whitespace
strict, thinking-preamble lenient (regression for the qwen run),
no-risk-anywhere fail, too-short stub, and the case-wrapper
surfacing strict_pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kerfile

Two bugs found during bragi (RTX 5090 Laptop) autotune sweep setup:

1. `lucebox autotune --sweep --profile coding-agent-loop` failed with
   "No module named 'lucebench'" because sweep.py's agent_replay scorer
   imports `lucebench.areas.agent_recorded` but `luce-bench` was not
   declared as a dependency of the `lucebox` workspace member. Scored
   cells all returned fail, so the sweep would restore the backup config
   and exit with no winner. Fix: add `luce-bench` (workspace dep) to
   `lucebox/pyproject.toml` so `uv run --project lucebox` always has
   `lucebench` importable.

2. Dockerfile was copying `share/model_cards` to two paths
   (`/opt/lucebox-hub/server/share/model_cards` for the C++ server and
   `/opt/lucebox-hub/share/model_cards` for luce-bench's hatchling
   force-include). Replace the duplicate with a single copy +
   `ln -s` so the image carries one copy at the canonical luce-bench
   path and the C++ server resolves it via symlink.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… at 98K, 131K confirmed for gemma

Two coding-agent-loop sweeps on bragi (RTX 5090 Laptop MaxQ, 23 GB VRAM, sm_120):

Qwen3.6-27B sweep findings:
- q8_0 OOMs at max_ctx=98304 on 23 GB (model ~18-19 GB + KV ~5-6 GB = 24-25 GB)
- tq3_0 required at 98K: KV only ~2-3 GB, leaving ~1-2 GB headroom
- budget=32 unreliable at 65K (edge VRAM); fine at 98K with tq3_0
- q8_0 is faster at 65K/b16 (4.0 vs 3.1 tok/s) but not viable for production

Gemma 4 26B sweep findings:
- All 12 cells pass including 131K (sindri's 131K failures were a fixture-picker
  artifact — the picker selected a 100K case that expanded to >126976 real tokens,
  triggering HTTP 400; not a VRAM limit)
- fa_window and budget axes flat (~2.0 tok/s across all cells)
- Winner: budget=22, max_ctx=131072, fa_window=0

Code changes:
- autotune.py: 22-31 GB heuristic explicitly sets tq3_0 for qwen (prevents OOM
  on fresh installs); qwen bracket skips q8_0 at max_ctx>=98304 (saves 3 cells)
- sweep.py: fix winner selection — sort by -max_ctx first, then -speed_metric;
  prevents the metric artifact where smaller fixture inflates 65K cell speeds
- docs/experiments: two new sweep docs + correction note on sindri gemma doc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rdering

Bragi's commit 3dffb30 inverted the agent_replay winner-pick sort:
max_ctx is now the primary key, speed_metric is the tiebreaker within
the same max_ctx. The test that asserted speed-first ordering was
inherited from the old (buggy) behavior and started failing after the
merge.

Replace with two tests that pin the new contract:

* cross-max_ctx: larger max_ctx wins even when a smaller-ctx cell
  reports higher speed_metric. The speed gap is a fixture artifact
  (smaller ctx picks shorter fixture cases).
* within same max_ctx: speed_metric breaks the tie as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After bragi proved 131K viable on a 23 GB laptop, sindri (3090 Ti, 24 GB)
got `max_ctx=131072, budget=22, fa_window=0` and re-ran level2. No
quality regression vs the prior 98K config; longctx stays 100% (6/6)
through frontier-64k. VRAM 21.1 / 24.6 GiB used at boot, ~3 GiB
headroom. The sindri gemma sweep doc now carries the verification
table alongside the existing correction note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ted verb synonyms

The 2026-05-30 gemma full bench surfaced a grader-strictness issue:
agent_recorded scored 23% (6/26) because gemma emitted real tool
engagement in a non-Claude format —
``call:execute-bead:read-file{path:...}`` over and over. The grader's
``_tool_mentioned`` looked for ``\bRead\b`` / Claude-named synonyms,
none of which matched ``read-file`` or ``read_file``.

Two changes:

* Expand ``_TOOL_SYNONYMS`` with the hyphen/underscore verb forms
  that models emit when given a custom tool namespace in the prompt:
  ``read_file/read-file``, ``list_files/list-files/ls_files/ls``,
  ``edit_file/edit-file``, ``write_file/write-file``,
  ``grep_files/search_code``, ``exec_command/shell-exec``, etc.
* Add a ``call:[namespace:]<verb>{...}`` regex that pulls verbs out
  of structured-tool-call emissions and feeds them through the
  synonym check. Catches the case where the model never narrates the
  tool name in English but does invoke it via the structured format.

Re-grading the 2026-05-30 gemma snapshot: pass rate climbs
**23.1% → 30.8%** (4 cases newly recognized as tool engagement).
Intentionally conservative — ``execute-bead`` as a bare namespace is
NOT a Bash synonym because it wraps many verbs, each of which maps
to its own Claude tool.

Tests: 6 new cases pinning the hyphenated-call-verb, snake_case,
no-namespace, case-insensitive-verb, and end-to-end pass paths.
Suite goes 370 → 376 (all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel and others added 4 commits May 30, 2026 19:03
The synthetic agent area's grader passed responses containing any of:
code fence, JSON tool_use envelope, or apply_patch envelope. The
2026-05-30 gemma full bench showed it missing real agent engagement
from the codex-large-explore case: response was ``call:update_plan{...}\n
call:shell{command: ...}\n`` with no code fence or OpenAI-style
``"name": "Read"`` envelope. That's exactly as agent-shaped as a JSON
tool_use block, just a different serialization.

Add a fourth pass class: ``call:<verb>{`` or ``call:<ns>:<verb>{``.
The agent area pass rate on the same snapshot lifts from 2/4 (50%)
to 3/4 (75%); the remaining fail (codex-mini-read-task) is a
genuine narrative-only response and stays failed.

New tests cover all four pass classes plus the narrative-only and
inline-backtick negative cases. 376 → 384 tests, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring fail

The 2026-05-30 gemma full bench scored code at 10% (1/10). Inspection
showed almost every "fail" was actually a valid function body
followed by chat-template artifacts the model leaked at the tail
(``return Falsestring\n``, ``thought\n``, Chinese transition phrases,
bad-indent fragments). ``ast.parse(prompt + completion)`` rejected
the whole thing on the trailing noise even though the actual code
in the middle parses cleanly.

Extend the grader's existing "try a few separators" loop with a
"try progressive trim from the end" outer loop. Budget capped at 32
truncations so a degenerate 1000-line response can't blow grader
wall-time. Real cases need 0-3 truncations.

Pass rate on the same snapshot: **10% → 80%** (1/10 → 8/10).
The remaining 2 fails are genuinely broken code (no parseable
prefix at all), as intended.

Tests cover clean, trailing-garbage (regression for the gemma
``return Falsestring`` artifact), broken-everywhere, empty, and the
``thought\n`` chat-template leak. 384 → 389 tests, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-30 gemma full bench scored forge 0/30 cases with
``error_type=ValidationError`` on every row. Two stacked bugs:

1. The recording client called ``TextResponse(text=...)`` but the
   forge ``TextResponse`` field is named ``content`` — every send()
   raised a pydantic ValidationError, which surfaced as the per-row
   error_type. (Independent bug, fixed in one line: text=→content=.)

2. Even with #1 fixed, gemma emits ``call:get_country_info{country:
   "France"}call:summarize{text: "..."}`` as plain text in a ``text``
   content block — not as Anthropic ``tool_use`` structured blocks —
   so the old client surfaced text-only responses and forge would
   have nudged forever waiting for a tool call.

This patch scans the assistant text for ``call:<verb>{args}``
invocations, parses the args as relaxed JSON (json.loads first, then
a permissive pass that quotes bare keys), and synthesizes
``ToolCall`` entries that forge's WorkflowRunner consumes natively.
Malformed args are dropped (per-call, not per-response) so a single
mangled invocation doesn't crash the bench.

The forge LLMResponse contract is ``list[ToolCall] | TextResponse``
(forge_eval._forge.core.workflow), so synthesis stays within the
existing types — no anthropic.types.Message construction needed.

Why client-side: the server's chat_template / SSE emitter could
translate the plain-text shape into Anthropic tool_use blocks
upstream (cleaner long-term), but that's a C++ change with broader
scope. The client-side path also future-proofs the bench for any
other model that uses the same plain-text tool serialization
(codex-mini, DDX bead executor, etc.) — same intent already
recognized in lucebench.areas.agent's _CALL_INVOCATION pattern.

Tests cover the parsing/synthesis helper in isolation: empty input,
single calls, back-to-back calls, snake_case + kebab-case + ns:verb
names, nested braces, strings containing } chars, unbalanced
braces, and unparseable args. Full test suite remains green (291
passed, +16 from this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel
Copy link
Copy Markdown
Collaborator Author

easel commented May 31, 2026

Folded into PR #285 — the forge fix (commit deba2fd) is now part of feat/lucebox-docker (current tip deb5adb). Closing this stacked PR; reviewable as part of #285.

@easel easel closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant