fix(server): split soft-close probe ids from inject ids by easel · Pull Request #331 · Luce-Org/lucebox-hub

easel · 2026-06-03T03:24:37Z

Summary

Soft-close (PR #326) shipped with an empirically inert configuration on qwen3.6-27b. Root cause: BudgetHook::close_token_ids was used for both the soft-close peek probe AND the inject sequence. For qwen3.6-27b, the configured thinking_terminator_hint is a 16+ token English directive starting with "Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n" — so the peek was checking the logit of token 79939 ("Considering"), a mid-sentence content token the model rarely promotes. Trajectory data showed prob_ratio < 1e-8 across 12,888 reasoning steps; the dial was dead at any sampled ratio in {0.1, 0.3, 0.5, 0.7, 0.9}.

Fix

Split the probe ids (short marker sequence) from the inject ids (full directive). New BudgetHook::soft_close_probe_ids field; soft_close_probe_token() accessor with empty-falls-back-to-close_token_ids legacy behavior so models without the split see zero diff.

server_main.cpp now tokenizes the marker substring (\</think\>) separately when the hint contains it; logs both probe and inject id vectors at startup. qwen35_backend.cpp::maybe_soft_close peeks probe_ids.front() instead of close_token_ids.front(); [soft-trace] close0= field reports the probe id so trajectory CSVs stay interpretable. Hard-close path (maybe_force_close) untouched — still injects the full directive.

Empirical validation

Re-ran /tmp/probe_soft_close_trajectory.sh against an image built from this branch (lucebox-hub:175c8a72-cuda12) on sindri (qwen3.6-27b, RTX 3090 Ti).

Phase 2 trajectory (ratio=0, debug logits on): </think> (id 248069) reliably becomes argmax-competitive (diff >= log(0.1) = -2.30) at 66-94% of natural reasoning length across 5 diverse prompts. max_diff reaches 0.000 (prob_ratio = 1.0) on every prompt — vs prior baseline max_diff = -9.69 on token 79939. 9.7 nat improvement, restoring the mechanism to its designed regime.

prompt	n_steps	fire@0.1	fire@0.9
0 (arithmetic)	1723	step 1135 (66%)	step 1393 (81%)
1 (Python)	2081	step 1950 (94%)	step 1950 (94%)
2 (logic puzzle)	6232	step 5714 (92%)	step 5714 (92%)
3 (train meet)	3894	step 3341 (86%)	step 3341 (86%)
4 (influenza)	5771	step 4993 (87%)	step 4993 (87%)

Phase 1 live firing: soft-close fires reliably at ratios 0.1-0.9 with stop_reason=end_turn and coherent text outputs across all configurations. Single-sample thinking-token savings are noisy (sampling non-determinism is ±30%); multi-seed sweeps are deferred to a follow-up.

Tests

3 new unit tests in test_server_unit.cpp: probe-uses-probe-ids-not-inject-ids, probe-ids-empty-falls-back-to-close-token-ids, inject-sequence-unchanged-when-fires.
Fixed pre-existing OOB write in test_soft_close_determinism_when_disabled (vocab 1000 → 250000) — UB-silent until new tests perturbed heap layout.
Suite: 1985 assertions, 2 pre-existing failures unrelated to soft-close (PR fix(server): support gemma-4's plain-text call:<verb>{} tool-call format #329 emitter parser tests reproduce on unmodified feat/lucebox-docker tip).

Files changed (+259/-28)

server/src/common/model_backend.h — new soft_close_probe_ids + accessor.
server/src/qwen35/qwen35_backend.cpp — peek probe, inject full sequence.
server/src/server/http_server.h — ServerConfig::think_close_probe_token_ids.
server/src/server/http_server.cpp — wire probe ids into per-request BudgetHook.
server/src/server/server_main.cpp — split-tokenize marker substring; startup logging.
server/test/test_server_unit.cpp — 3 new tests + OOB fix.

Test plan

Local unit tests pass (all soft-close tests green).
Smoke test: close0 in [soft-trace] now reports 248069 (\</think\>), not 79939.
Phase 2 trajectory validates \</think\> reaches argmax across 5 prompts.
CI build + cmake + cubic review.

🤖 Generated with Claude Code

…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.

…de env vars)

At >=32K context the needle text is more likely to straddle multiple chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks ~160 tokens) loses the back half of the needle digits — the model retrieves '...is 4' but truncates/hallucinates the continuation. Adaptive scaling based on n_chunks: <32K context (<1024 chunks): radius=2, max_anchor_hits=8 (unchanged) 32-64K (1024-2047 chunks): radius=4, max_anchor_hits=16 >=64K (>=2048 chunks): radius=8, max_anchor_hits=32 Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS env vars (legacy DFLASH_COMPRESS_* names still accepted). Validated at 49K context: NIAH needle 'kowefada 1596346' correctly retrieved (was: '1594' or hallucinated 'is 048394839483' before fix). Resolves the long-standing 'project_64k_quality_cliff' memory entry.

Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35 target loader and the dflash decode draft loader. After loading weight tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] / wk->ne[1] and compare against GGUF-declared values; set_last_error and return false on mismatch. Makes the 'stale scalar at graph-build time' bug class structurally impossible. Load-time only, no runtime cost. Existing well-formed GGUFs are unaffected (smoke verified).

When pflash compresses, set gen_req.fa_window_override = effective_prompt + 256 so spec-decode verify sees the entire compressed prompt. Pflash already paid compute to pick which tokens matter; verify never throws any of them away. When the override would exceed 2 * cfg_.fa_window (spec-decode's drafter cost stops earning its tok/J), the C2 gate in qwen35_backend's generate() falls back to AR (fa_window=0, full attention). AR sees every kept token at every context; we choose mechanism, not visibility. Zero new CLI flags. --draft remains the only knob for composition; all per-request adaptation is internal.

…scade default-on Adds backwards-compat fallback wrappers for 6 cascade env vars in both standard and bandit code paths, so harness scripts using either spelling work against this binary. Emits one-time WARN to stderr when the legacy DFLASH_* spelling is honored. Also flips the default for `use_transitive` from `false` to `true` because the gated rare-token bridge improves multi-hop F1 with zero downside in the cascade-already-firing case.

…th drift Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both qwen35_score_and_compress and drafter_score_and_compress call it. Removes two 70-LOC duplicate env-reading blocks and the duplicated anchor-radius comment. Also removes dead force_chunk_neighborhood (no callers) and collapses the 4-overload load_drafter pyramid to one canonical implementation + 3 thin forwarders.

- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks), remove early_exit_n alias (inline early_exit_pre at call site) - qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref; do_ar_decode budget-hook 15-line → 4-line + docs ref - http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref - model_backend.h: BudgetHook 23-line essay → 3-line + docs ref - gguf_target_loader.cpp: 4-line prose tail → 1-line - .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts - docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md, anchor-transitive.md (consolidated rationale)

…nking is off The hard-coded renderer appends a closed think prefill when thinking is disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix, leaving the model in the wrong decoding state for tool use. Mirror the hard-coded behavior here when the rendered prompt ends with a bare assistant generation prompt; tolerate trailing-whitespace variants (single \n, double \n\n, trailing space). Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%: the drafter was distilled with the closed-think suffix in its training distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target and drafter disagree on what comes after <|im_start|>assistant\n.

… only The previous commit applied the closed-think suffix to all Jinja-rendered prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja, defaulting to QWEN3, and guard the post-processing block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe inherits ChatFormat::QWEN3 by design (matches drafter distillation). 5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3 arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append when template already closes the think block. Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.

Extract the C2 spec-decode gate from an inline expression in qwen35_backend.cpp into a pure predicate header c2_gate.h. Zero behavior change. Identical math: (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg) The new header documents the empirically-derived rationale: at compressed KV sizes (pflash compression of long prompts), T_draft/T_target ratio approaches 1, eliminating spec-decode's profit margin over AR. Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when eff_fa_window > 2*fa_window_cfg. Adds 5 unit tests locking in the predicate's behavior with explicit Round 5 4-arm matrix bench citations. Files: - server/src/qwen35/c2_gate.h (new) - server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call) - server/test/test_server_unit.cpp (+60 LOC, 5 tests)

…nch in-tree Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main. Net: 189 files changed. Major workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard, container-state preflight), cmd_systemctl_passthrough (already- active short-circuit, restart-loop detection), cmd_update (bootstrap- installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default precedence), shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so lucebox update keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → 183 lines). * luce-bench: snapshot subcommand + canonical HostInfo schema v2 + levels (level0/1/2/3) + report subcommand + submit-baseline + regrade. * Server (C++): /props.host block + props_schema=4 + host_info read at startup, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. * Harness: client implementations for claude/codex/opencode/hermes/pi. * Strict 11-field config.toml allowlist for dflash.* runtime tunables. Deleted (rolled into new structure): * server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by luce-bench snapshot + areas. * lucebox configure, lucebox download-models, lucebox benchmark — replaced by config sub-app, models sub-app, autotune --sweep. * luce-bench --sweep flag — moved to argv-sniff subcommand dispatch. Conflict resolution: * server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion (feat/lucebox-docker moved bench machinery into luce-bench). * README.md — took feat-branch version. origin/main had 19 commits worth of minor README tweaks since the branch base; those need to be folded back in as a follow-up PR. * docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took feat-branch version. origin/main had 1 link-fix commit; feat-branch has the schema-4 + host-block additions that strictly supersede. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_load_or_build()` returned `config_mod.load()`'s result verbatim when config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That contradicted the precedence lucebox.sh documents (env > toml > default) and bit sindri in production: its config.toml had `[image]` without a `registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub` beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`. Symptom: `lucebox start` brought up the wrong (stale luce-org) image even after explicit `lucebox install` + `lucebox pull` against easel. Fix: overlay env on top of whatever `load()` returns (or `live_config()` falls back to). Only the five top-level scalars have env hooks (LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model intentionally don't. Adds two regression tests: - env beats config.toml when toml has no explicit value for that key, - env still wins when toml is absent (covers the live_config fallback). 102 lucebox tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- test_autotune_candidate_configs.py: sort imports (ruff I001). - download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo (correct — we only query model repos here), resolving the mypy union-attr error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The job-level `permissions` block replaces the workflow-level default entirely, so `actions/checkout` was running without `contents: read` and would fail on protected refs. Add `contents: read` back alongside the existing `id-token: write`. Addresses cubic #1 on PR Luce-Org#285.

- Dockerfile: keep --frozen on the uv sync fallback so the layer can't silently resolve outside the lockfile. - harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty (per-area card defaults govern; --no-think only when explicitly set) and default LUCEBENCH_AREA to the level1 capability gate (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad for routine harness runs. Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.

…appers - .github/workflows/{ci,docker,release-luce-bench}.yml: pin actions/checkout, docker/{setup-buildx,login,metadata,bake}-action, and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments so the supply chain is reproducible (Luce-Org#4). - harness/src/harness/clients/_common.py: replace the external `timeout` shell-out with `subprocess.run(..., timeout=N)`, return 124 on TimeoutExpired to match GNU timeout's exit code (Luce-Org#5). - scripts/build_image.sh: normalize REGISTRY to end in `/` instead of silently producing `ghcr.io/luce-orglucebox-hub` when the trailing slash is missing (Luce-Org#6). - harness/src/harness/clients/pi.py: non-interactive launch now mirrors run_pi.sh's validated invocation (--provider, --print, --mode json, --tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR / PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7). - docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags, since `+` is not a valid Docker tag character (Luce-Org#8). - harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of run_hermes.sh's env wiring and call `chat --provider --model --accept-hooks --yolo --max-turns --source --query` instead of a bare positional prompt (Luce-Org#9, Luce-Org#10). - harness/src/harness/clients/openclaw.py: apply the OpenClaw config patch via `openclaw config patch --file` before the run, and call `agent --local --json --model lucebox/<model> --session-id --timeout --message` instead of a bare positional prompt (Luce-Org#11). - pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server, tool_memory}.py ruff include pins (those paths were renamed during the dflash→server rename and then deleted upstream) (Luce-Org#12). - lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to `**/*.sh` so scripts under nested dirs (harness/clients/*.sh, scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13). Addresses cubic Luce-Org#4–Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in the previous commit alongside the LUCEBENCH_THINK default fix.

- lucebox/README.md: fix the relative link to `cli.py`; resolves to `src/lucebox/cli.py` (the actual location), not the nonexistent `lucebox/cli.py` (Luce-Org#15). - luce-bench/NOTICE: the bundled forge_eval LICENSE says "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE with the actual upstream LICENSE (Luce-Org#16). - luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing agent / agent_recorded / forge / longctx / smoke. Add the imports + list entries so `from lucebench.areas import *` matches the actual area surface (Luce-Org#17). Addresses cubic Luce-Org#15–Luce-Org#17 (P3) on PR Luce-Org#285.

…nch in-tree Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single commit on top of origin/main (8782d07). Net: 189 files changed. Workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile with reproducible `uv sync --frozen`, docker-bake.hcl with VERSION sanitization for Docker tag charset, .github/workflows/docker.yml with SHA-pinned external actions and GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard against systemd self-defeat, container-state preflight), cmd_systemctl_passthrough (already-active short-circuit, restart-loop detection), cmd_update (bootstrap-installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default), all shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so `lucebox update` keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → ~150 lines). _load_or_build now respects env > toml > default precedence. * luce-bench: snapshot subcommand + canonical HostInfo schema v2 (multi-GPU lineup, WSL detection, source/collector trust metadata) + levels (level0/1/2/3) + report subcommand (host column + cross-host confounder warnings) + submit-baseline (level3-gated) + regrade. * Server (C++): /props.host block + props_schema=4 + host_info loader, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench machinery moved into luce-bench. * Harness: client implementations for claude/codex/opencode/hermes/pi pointed at the running lucebox server, matched against the validated run_*.sh shell wrappers. Cubic AI code review (17 findings) addressed in full: P0: contents: read on luce-bench release job permissions. P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty so per-area defaults apply. P2: 6 external actions pinned to immutable SHAs; non-interactive timeout via subprocess.run; REGISTRY trailing-slash normalize; VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw mirrored against run_*.sh wrappers; ruff scan paths corrected to server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default level1. P3: lucebox/README.md cli.py link fixed; NOTICE copyright year 2025-2026; areas/__init__.py __all__ exposes all 10 areas. CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12 prebuild, cubic reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…default 0.10) - Gate context-window admission on post-compression effective size, not raw, so >128K-raw prompts compress to fit max_ctx instead of 400 / oversized KV reservation. - Pre-compression keep-ratio sanity guard (raw*keep+max_out>max_ctx); the real effective-size gate runs post-compression in worker_loop. - Default prefill-keep-ratio 0.05 -> 0.10: real ~2x compression on agentic content (0.25 over-forces anchor-transitive to ~100% = no-op + rejects >128K). - Evidence (RTX3090, agentic replay, keep=0.10): 167K raw admitted -> 71K eff (42.6%), prefill 145s vs 845s forced; 32-128K real compression; tool-parse intact; 1629 unit asserts green; 14-cell P/PD sweep zero crashes.

…ontent channel The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna chat templates append `<think>\n` to the prompt suffix when enable_thinking is honored, so the model emits reasoning tokens directly with no opening tag — the emitter never transitioned and reasoning text leaked into `content` while `reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7% (no-think) for Qwen3.6-27B Q4_K_M. The plumbing was already there: parse_reasoning() supports started_in_thinking=true (reasoning.h:17-19) but no caller passed it. Fix: 1. chat_template.h: render_chat_template / render_chat_template_jinja now return a PromptRenderResult { text, started_in_thinking }. The built-in QWEN3 and LAGUNA branches set started_in_thinking deterministically when enable_thinking && add_generation_prompt; GEMMA4 stays false (its reasoning channel is opened by the model emitting `<|channel>`, which http_server forwards into the emitter as `<think>`). The Jinja path suffix-sniffs the rendered prompt for a trailing `<think>` opener and emits a [WARN] log when sniffing decides true so a template/model-card mismatch surfaces at runtime. 2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter. When constructed with REASONING, active_kind_ initializes to "thinking" so the Anthropic first content_block is `thinking` instead of `text` (avoids a spurious empty text-block stop+restart on the first reasoning delta). Deliberately leaves checked_think_prefix_ at its default (false) so the existing one-time `<think>` strip guard still trips if a template/model-card mismatch causes the model to emit a redundant opener. 3. http_server.cpp: thread render_result.started_in_thinking through ParsedRequest into the SseEmitter's initial_mode. Both streaming and non-streaming paths feed tokens through the same emitter, so the fix covers both response shapes. Tests: add 12 unit tests under test_server_unit (assertion count 1608 → 1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA / GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff positive/negative cases. Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming `/v1/chat/completions` with `thinking:{type:enabled}` now populates reasoning_content and never leaks `</think>` into content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add three C++ tests that chain render_chat_template + SseEmitter so the wiring between the renderer's started_in_thinking flag and the emitter's initial_mode is exercised end-to-end, not just at each end. The per-unit tests above each verify their half of the contract, but the original bug was a missing call-site wire — both halves were correct in isolation. Also tighten the Python integration test assertions for enable_thinking and reasoning.effort: require non-empty reasoning_content and no raw <think>/</think> in either channel. The prior 'doesn't crash' assertion would have passed on the broken code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.

…budget Increment 1 (Tier 1): model-card registry resolvable by normalized model id (/props.model_card → bundled cards → family fallback), per-model thinking tokens via the card with a thinking-capability gate, configurable --reasoning-effort {low,medium,high} (was hardcoded high) and --thinking-budget-tokens N, plus card_source/card_stem provenance on every row. Cards bundled into the wheel via hatch force-include from share/model_cards (single source; CI drift guard TODO). Tier 2: --client-thinking-budget N — client-side thinking termination for providers that ignore native budget hints. Streams the response, estimates reasoning tokens (char/4), and when over budget aborts and issues a forced- </think> re-prompt (a fresh conditioned sample, not decoder continuation) using the card's terminator + reply reserve, producing a gradable answer. Gated on reasoning being identifiable in the stream (reasoning_content deltas or <think> tags); unmarked output is left untouched. client_abort rows are a separate benchmark mode (never pooled with single-pass), with continuation-failure and answer-started-before-abort rows excluded from the aggregate and coverage reported. Verified live: OpenRouter qwen3.6-27b ignores reasoning_effort/budget_tokens (reasoning unbounded), but --client-thinking-budget 2000 bounds it precisely (~2001 reasoning tokens/row, continuation=ok, 8/8 pass on the head subset). 234 tests pass; ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add `--multi-turn` mode to scripts/extract-agentic-fixture.py for the coding-agent-loop autotune profile: walk one session in record order, emit a replay case at each target-token bucket (default 8K/16K/32K/64K/100K/128K). Each case ships an OpenAI-shaped `messages` list and a `prefill-and-decode` verifier so the sweep can score "does this max_ctx cell actually serve a trace of n − reply_budget tokens." Snapshot semantics: case `context_tokens_approx <= target_bucket_tokens` is guaranteed (snapshot taken pre-append for the message that would cross). Also fix a latent bug in `_is_claude_session`: it returned False on the first non-user record, which misrouted any Claude session that led with `permission-mode`, `system`, or `queue-operation` (most real sessions do) — including the one this commit was developed against. Tests cover bucket fit, role collapsing, thinking-block drop, PII scrub on HOME paths + token-looking secrets, Codex record decoding, and the leading-meta-record regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erifier Add three small surfaces to the ``agent_recorded`` area to support the coding-agent-loop autotune sweep: * ``load_agent_recorded_multi_turn_cases()`` — reads the bucketed replay fixture produced by ``extract-agentic-fixture.py --multi-turn`` and returns cases sorted ascending by ``target_bucket_tokens``. Distinct from the v1 single-prompt fixture; the two coexist. * ``pick_multi_turn_case_for_budget()`` — given a prompt-token budget (typically ``max_ctx − reply_budget``), returns the largest case that fits. ``None`` when no case fits. * ``grade_prefill_and_decode()`` — pass/fail verifier for the sweep: non-empty response within wall budget, no server error. Lighter than tool-schema-coverage on purpose — the sweep is asking "did this max_ctx setting serve a trace of this length", not "did the model do the task well." Ship a harvested fixture: one Claude Code session sliced into 6 bucketed cases (8K through 128K tokens). Per repo guidance, one long session is enough to cycle with until something breaks; the broader corpus can land later if signal demands. Tests cover the loader contract (cases fit under their bucket, sorted by bucket), the budget picker (largest-fit, None-on-empty), and the verifier's three failure modes (server error, wall-budget overrun, response-too-short) plus the reasoning_content fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loop Add an autotune Profile abstraction so different workloads can sweep different axes with different scorers. Two profiles ship: * ``heuristic`` (default, backward-compatible) — preset-agnostic bracket, scores by mean ``decode_tokens_per_sec`` from a luce-bench level1 snapshot. Identical to the prior behavior. * ``coding-agent-loop`` — architecture-aware. Gemma4's bracket is ``max_ctx × fa_window × budget × pflash_mode`` (KV-quant axis omitted because the gemma4 backend hardcodes F16 — verified at gemma4_loader.cpp). Qwen3.6 / laguna keep cache_type as an axis since their loader actually respects it. Scoring is composite: pass-rate on the agent_recorded multi-turn fixture first, then ``completion_tokens / wall_seconds`` as a tps proxy (the longctx-area snapshots ship empty ``decode_tokens_per_sec``). Wire ``--fa-window`` through to the server end-to-end: * ``DflashRuntime.fa_window`` (0 = full attention, server default) * ``DFLASH_FA_WINDOW`` emitted by docker_run.py when nonzero * entrypoint.sh appends ``--fa-window N`` to the server CLI iff ``DFLASH_FA_WINDOW > 0`` — unset env still reproduces stock behavior * ``dflash.fa_window`` round-trips through config.toml CLI: ``lucebox autotune --sweep --profile coding-agent-loop``. New ``--list-profiles`` flag prints the registered profile table. Tests: 318/318 green. New coverage: * Profile registry + ``get_profile`` error path * gemma bracket excludes the KV-quant axis (regression for the no-op axis bug) * gemma bracket varies max_ctx × fa_window × budget * qwen bracket includes tq3_0 + q8_0 * sub-22 GB tiers fall back to base-only (OOM safety) * ``_pick_winner`` ranks agent-replay results by pass→speed→ctx * ``fa_window`` is in the sweep allowlist Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the sweep is invoked directly (e.g. `uv run python -m lucebox autotune --sweep` for development, or any path that bypasses the lucebox.sh wrapper), the LUCEBOX_HOST_* env vars aren't set and ``host_facts.from_env()`` returns a zero-VRAM HostFacts. Every profile bracket then falls through to the <22 GB "base only" branch and the sweep silently degrades to a 1-cell smoke test that overwrites the operator's real config (e.g. dropping max_ctx from 131072 to the DflashRuntime default 16384). Fall back to ``cfg.host`` (populated by an earlier `lucebox check` via the wrapper) when ``from_env()`` yields no signal. Test regresses the original symptom: with LUCEBOX_HOST_* unset, the coding-agent-loop bracket on a 24 GB persisted host must produce a multi-cell sweep, not collapse to one base cell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ow one-shot forge batches sse_emitter.cpp: extend find_tool_start() to detect Gemma4's call:<verb>{ format. Previously find_tool_start only matched <tool_call>, <function=, <tool_code> XML patterns, so the emitter never entered TOOL_BUFFER mode for Gemma4's plain-text tool call emissions (call:verb{args}). Now Pattern B scans for call: preceded by a valid sentinel char and followed by at least one alpha (the verb start), causing the emitter to buffer from that point and parse_tool_calls() to run at emit_finish. Result: server now returns stop_reason=tool_use + tool_use content blocks for Gemma4. step_enforcer.py: allow one-shot batch tool calls where all pending required steps appear before the terminal tool in the batch. Gemma4 emits calls in a single response (e.g. [fetch_data, analyze, report]). The runner executes in order so required steps are satisfied before the terminal executes — the batch is not premature. This is a local modification to the vendored forge-guardrails 0.7.1. Effect: forge basic_2step passes (was 0/5, now 1/5 = 20%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dict Full 26-case agent_recorded nothink benchmark on image 658d016f-cuda12: - Gemma4: 19.2% (5/26) vs Qwen3.6: 46.2% (12/26) — Qwen3.6 wins by 27pp - Nothink suppression ineffective for Gemma4 (<|channel>thought bypasses prompt) - 12/26 cases had non-empty reasoning despite --no-think - 2 cases returned given=refused (model declined to engage) - Verdict: Qwen3.6-27B is the preferred model for coding/agent tasks on bragi Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…locks normalize_chat_messages() only extracted text/input_text/output_text from content arrays, silently dropping tool_use and tool_result blocks. This caused multi-turn tool-call conversations (Anthropic Messages API format) to lose all tool call history: the model never saw tool results and looped infinitely calling the same tool. Manifested as Qwen3.6 forge=0%. Two cases fixed: 1. Assistant message with tool_use content blocks: look up tool_memory by ID (same as the OpenAI tool_calls path). Fallback for cross-session replay: synthesize <tool_call><function=...></tool_call> XML. 2. User message with tool_result content blocks: push each result as a {"tool", content, tool_use_id} message so the chat template renders <tool_response> blocks. Skip pushing empty user containers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the normalize_chat_messages() bug where tool_use and tool_result Anthropic content blocks were silently dropped. Adds root-cause analysis, fix description, and benchmark results showing Qwen3.6 forge 0%→100% (5/5). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add Gemma4 forge results on image dc20057e: unchanged at 20% (1/5). Documents why fix is neutral for Gemma4 (one-shot batch doesn't round-trip tool_results) but critical for Qwen3.6 (turn-by-turn needs proper context). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Comprehensive summary of all autotune results, model comparison, server bug fixes, and configuration recommendations for bragi (RTX 5090 Laptop, 23 GB VRAM): - Qwen3.6-27B at budget=16, max_ctx=98304, tq3_0 KV is the optimal preset - Qwen3.6 forge 100% (5/5) vs Gemma4 20% post-fix - Documents three server fixes in dc20057e-cuda12 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mary Full pass-rate sweep on dc20057e-cuda12 (nothink): - forge 100%, agent 100%, longctx 100%, ds4-eval 77.2% - code 90%, truthfulqa-mc1 80%, agent_recorded 42.3% - hellaswag 88%, gsm8k 86% Update final tuning summary with verified numbers and corrected agent/longctx entries (agent 100% up from 75%, longctx 100% newly verified for Qwen3.6). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Empirical test on bragi: prefix_cache_slots=32 causes -19pp regression on agent_recorded (23.1% vs 42.3% baseline). 5 cases regress, 0 unlock. Update autotune.py comment with measured numbers and doc reference. Smoke test passes 100% — the bug is specific to multi-turn tool convos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All major tunables swept and validated: budget=16, max_ctx=98304, tq3_0 KV, fa_window=0, prefix_cache_slots=0 (regression confirmed), pflash off. Includes full nothink/think benchmark table and known limitations for prefix cache, pflash, and Gemma4 issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…unch Add speculator_dir field to ModelPreset for directory-based safetensors speculators (distinct from GGUF draft_file). When present on disk, the server launch sets DFLASH_DRAFT to that directory so the entrypoint's glob search finds model.safetensors inside it. For laguna-xs.2: speculator_dir="laguna-xs2-speculator" points to ~/.local/share/lucebox/models/draft/laguna-xs2-speculator/ where the 1.2 GB poolside/Laguna-XS.2-speculator.dflash safetensors live. Also adds pytest to the workspace dev deps so `make test` runs clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…aguna characterization Update bragi-tuning-complete to use verified final baseline numbers from bragi-rtx5090laptop-qwen36-27b-dc20057e-nothink-2026-05-31 (9 areas, 100% output). Key changes: forge 100% (30/30 not 5/5), hellaswag 93% (clean run not 88% restart-contaminated), agent 75% (stochastic), gsm8k 81%. Add Laguna-XS.2 characterization: 20.3 GB model, 1.2 GB safetensors speculator (+60% decode), 8 GQA KV heads, ~960 MB KV at 32K tq3_0, ~56K safe max context on 23 GB VRAM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Architecture, VRAM budget, context window feasibility table, performance vs Qwen3.6-27B comparison. Benchmark results TBD pending running sweep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Laguna-XS.2 bragi baseline complete. forge=0% (model can't emit tool_use), code=20% (FIM format mismatch), gsm8k=93% (+12pp vs Qwen3.6), agent_recorded=50%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- 48K context causes 10-70x prefill slowdown vs 32K (different kernel path) - frontier-16k times out at 300s; optimal max_ctx is 32768 - budget=4/16 crash server when using safetensors speculator (null JSON field bug) - budget=8 is the only safe value; sweep skipped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Gemma4-31B: 60-layer dense 30.7B model, 20GB Q4_K_M, 1.6GB DFlash draft. Server confirmed running at 32K/tq3_0/budget=8 on bragi (24GB VRAM). Benchmark in progress. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Models ≥20 GB (gemma-4-31b at 21 GB, qwen3.6-moe at 22 GB) leave only ~2-3 GB for KV on 24 GB VRAM; the previous heuristic suggested max_ctx=98304 which would OOM. Now caps at 32K when approx_total_gb ≥ 20. - runtime_from_host(host, preset="") accepts optional preset name - _preset_approx_gb() looks up PRESETS.approx_total_gb for size awareness - CLI passes cfg.model.preset to autotune - _coding_agent_loop_candidates seeds from preset-aware base - Tests: add large-model and unknown-preset coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…026-05-31 Full nothink 32K sweep results on RTX 5090 Laptop MaxQ: - gsm8k 95% (+14pp vs Qwen3.6), agent_recorded 38.5% (=Qwen3.6) - code 70% (-20pp), hellaswag 79% (-14pp), truthfulqa 79% (-3pp) - longctx 33% (-67pp): Gemma4 template expansion causes HTTP 400 at frontier-8k+ Key operational lessons documented: - DFlash server hang bug: forge Anthropic-format + kill → infinite GPU loop - Use --max-tokens 512 for agent_recorded (4096 too slow at 22 tok/s effective) - Effective context limit ~4K real tokens at max_ctx=32768 for this model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reconfirmation sweep after Gemma4-31B session left config in non-optimal state. Winner: budget=22, max_ctx=98304, tq3_0 (applied). New findings: - budget=32+65K+q8_0 causes GPU compute hang (SM=100%, mem=0-1%), not a silent OOM crash as previously attributed — same DFlash hang bug as Gemma4-31B, now reproduced with Qwen3.6-27B - budget=32 at 98K context is 35% slower in decode than budget=22 (30.3s vs 22.4s) due to verification overhead with 84K KV cache - budget=16 and budget=22 are functionally equivalent at 98K (within noise); budget=32 is clearly suboptimal - Winner is budget=22 vs budget=16 on 05-30; difference is within measurement noise (0.912 vs 0.905 tok/s speed_metric) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Comprehensive record of all tuning decisions for bragi (RTX 5090 Laptop, 23 GB VRAM, WSL2) covering sessions 2026-05-30 through 2026-06-01. Documents optimal Qwen3.6-27B config (budget=22, 98K, tq3_0), safe/unsafe parameter combinations, known issues (DFlash hang, prefix cache regression), model matrix, and sweep history. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add comment in _coding_agent_loop_qwen_bracket explaining that budget=32+q8_0 at 65K context is kept in the sweep bracket despite being known to cause a GPU compute hang (SM=100%, mem=0%) on 23 GB cards (observed 2026-06-01). The sweep handles it correctly via 300s timeout + systemd restart which clears the GPU state. Reference: docs/experiments/qwen3.6-27b-coding-agent-loop-sweep-bragi-2026-06-01.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…regression) Ran agent_recorded benchmark (26 cases) against Qwen3.6-27B at the winning sweep config (budget=22, max_ctx=98304, tq3_0). Result: 9/26 (34.6%) vs dc20057e baseline 10/26 (38.5%). 7 cases flipped in both directions; 1-case net delta is within noise at n=26 (σ≈9.5pp). No quality regression from the new config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All tunables (budget, max_ctx, KV quant, prefix cache, pflash, fa_window) have been swept. Documents final status and future-work blockers (prefix cache snapshot path bug, Gemma4-31B think mode not wired). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s active Previously n_gen_cap = min(think_ceiling + reply_budget, max_tokens) caused immediate force-close (step=0) for any request where max_tokens < reply_budget (e.g. gsm8k at 2048, agent_recorded at 4096, code at 2048). Benchmarks sized their max_tokens for nothink responses, so thinking was silently disabled. Fix: n_gen = think_ceiling + min(max_tokens, hard_limit_reply_budget), treating max_tokens as the post-thinking response budget rather than the total token cap. Also clamp hard_limit_remaining to min(max_output, eff_reply_budget) so the force-close boundary correctly reflects the available response window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PR Luce-Org#326 wired soft-close (Level 2 voluntary close) into the Qwen3.5 AR loop, but on qwen3.6-27b the comparator never fired across 1085 steps of a sample trajectory (prob_ratio < 1e-8 every step). Root cause: the field `BudgetHook::close_token_ids` was used for BOTH (a) the peek probe id read by `soft_close::should_fire(..., close0)` (b) the inject sequence written when the hook fires. For the qwen3.6-27b model card the `thinking_terminator_hint` is the ~16-token English directive "Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n" so close_token_ids[0] tokenizes to the id for "Considering" (~79939) — a mid-sentence content token whose logit sits 19-35 nats below the chosen token at every step. The peek therefore reported a perpetually near-zero prob_ratio and the soft-close dial (min_ratio 0.1..0.9) was empirically inert. Fix (path α): split probe-vs-inject in `BudgetHook` - close_token_ids — unchanged role. Full inject sequence written on hard close or when soft-close fires. Multi-token directive for trained-hint sidecars (Qwen3.6); single marker token for bare-marker arches. - soft_close_probe_ids — NEW. Short sequence (typically one token) used only for the comparator peek. When the operator card has a distinct marker substring inside the hint, server_main tokenizes just that marker and ships it via this field. When empty, `BudgetHook::soft_close_probe_token()` falls back to close_token_ids.front() (legacy behavior — zero churn for sidecars without a separate marker). server_main detects the marker substring inside the hint and tokenizes it in isolation; on miss it warns and leaves the probe field empty (legacy peek path stays in force). The AR-loop soft-close lambda in qwen35_backend.cpp now peeks `budget_hook.soft_close_probe_token()` and writes `close_token_ids.front()` on fire — the inject sequence is unchanged downstream. `[soft-trace]` lines now report the probe token id under `close0=...` so trajectory CSVs remain interpretable. Hard-close path is untouched: it continues to use close_token_ids verbatim, matching the contract that the operator-resolved directive is what's emitted at the budget boundary. Tests ----- + test_soft_close_probe_uses_probe_ids_not_inject_ids — verifies the peek reads probe[0] when set, NOT inject[0]. Builds a logit row where inject[0]'s logit is far below chosen but probe[0]'s logit is close to chosen; asserts soft fires and the WRITTEN token is inject[0] (not probe[0]). + test_soft_close_probe_ids_empty_falls_back_to_close_token_ids — guarantees pre-split behavior when the probe field is left empty (no churn for legacy sidecars / unit-test BudgetHook construction). + test_soft_close_inject_sequence_unchanged_when_fires — multi-token inject case: on fire we stream inject[0], inject[1], inject[2] verbatim regardless of what's in soft_close_probe_ids. Also fix a pre-existing OOB in test_soft_close_determinism_when_disabled (vocab=1000 row indexed at 248069). The UB was silently passing in Release builds before but the adjacent test additions perturbed glibc heap layout enough to crash; widen the row to vocab=250000. 15 soft-close tests pass (12 existing + 3 new). 1985 total assertions; the two remaining failures are pre-existing `test_emitter_content_mode_*` unrelated to soft-close (PR Luce-Org#329 emitter work). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cubic-dev-ai

9 issues found across 253 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md">

<violation number="1" location="docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md:49">
P2: Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.</violation>
</file>

<file name="server/src/qwen35/c2_gate.h">

<violation number="1" location="server/src/qwen35/c2_gate.h:28">
P2: The C2 gate uses overflow-prone `int` multiplication (`2 * fa_window_cfg`), which can misroute decode mode for large configured `fa_window` values.</violation>
</file>

<file name="harness/src/harness/clients/pi.py">

<violation number="1" location="harness/src/harness/clients/pi.py:80">
P3: `--tools` is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.</violation>
</file>

<file name="harness/clients/README.md">

<violation number="1" location="harness/clients/README.md:118">
P2: Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the `code` area. This will mislead users about which areas to set and what the default covers.</violation>
</file>

<file name="Makefile">

<violation number="1" location="Makefile:73">
P2: `MODELS_DIR` is unquoted in the Docker bind mount, so paths with spaces/special characters break `serve` and can mount the wrong source path.</violation>

<violation number="2" location="Makefile:112">
P1: `clean-models` uses an unquoted, unguarded `rm -rf $(MODELS_DIR)/*`, which can delete unintended files when the path is malformed or overridden unsafely.</violation>
</file>

<file name="harness/src/harness/clients/codex.py">

<violation number="1" location="harness/src/harness/clients/codex.py:64">
P2: `launch()` does not create a user-supplied `work_dir`, so writing `config.toml` can fail with `FileNotFoundError`.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:363">
P1: New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.</violation>
</file>

<file name=".github/workflows/ci.yml">

<violation number="1" location=".github/workflows/ci.yml:23">
P2: Lint and typecheck steps using `uv run --frozen --extra dev` will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the `--no-install-package torch` optimization in `check_uv_workspace.sh` that was explicitly designed to keep this job fast.</violation>
</file>

_{Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.

On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic}

cubic-dev-ai · 2026-06-03T03:31:08Z

+.PHONY: clean-models
+clean-models:  ## Remove downloaded models from $(MODELS_DIR). Destructive.
+	@echo "WARN: about to rm -rf $(MODELS_DIR)/*"
+	@read -p "Continue? [y/N] " ans && [ "$$ans" = "y" ] && rm -rf $(MODELS_DIR)/*


P1: clean-models uses an unquoted, unguarded rm -rf $(MODELS_DIR)/*, which can delete unintended files when the path is malformed or overridden unsafely.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At Makefile, line 112: <comment>`clean-models` uses an unquoted, unguarded `rm -rf $(MODELS_DIR)/*`, which can delete unintended files when the path is malformed or overridden unsafely.</comment> <file context> @@ -0,0 +1,112 @@ +.PHONY: clean-models +clean-models: ## Remove downloaded models from $(MODELS_DIR). Destructive. + @echo "WARN: about to rm -rf $(MODELS_DIR)/*" + @read -p "Continue? [y/N] " ans && [ "$$ans" = "y" ] && rm -rf $(MODELS_DIR)/* </file context>

cubic-dev-ai · 2026-06-03T03:31:08Z

+        const int64_t derived_kv_dim = L0.wk->ne[1];
+        const int64_t expected_q_dim  = (int64_t)out.n_head * out.head_dim;
+        const int64_t expected_kv_dim = (int64_t)out.n_head_kv * out.head_dim;
+        if (derived_q_dim != expected_q_dim) {


P1: New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 363: <comment>New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.</comment> <file context> @@ -349,6 +349,63 @@ bool load_draft_gguf(const std::string & path, + const int64_t derived_kv_dim = L0.wk->ne[1]; + const int64_t expected_q_dim = (int64_t)out.n_head * out.head_dim; + const int64_t expected_kv_dim = (int64_t)out.n_head_kv * out.head_dim; + if (derived_q_dim != expected_q_dim) { + char buf[256]; + std::snprintf(buf, sizeof(buf), </file context>

cubic-dev-ai · 2026-06-03T03:31:08Z

+| 4 | FAIL | FAIL |
+| 5 | PASS | PASS |
+| 6 | FAIL | FAIL |
+| 7–11 | FAIL | FAIL |


P2: Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md, line 49: <comment>Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.</comment> <file context> @@ -0,0 +1,84 @@ +| 4 | FAIL | FAIL | +| 5 | PASS | PASS | +| 6 | FAIL | FAIL | +| 7–11 | FAIL | FAIL | +| 12 | PASS | PASS | +| 13 | PASS | FAIL (regression) | </file context>

cubic-dev-ai · 2026-06-03T03:31:08Z

+                                     int kv_committed) {
+    (void)kv_committed;
+    return (fa_window_override == 0)
+        || (fa_window_override <= 2 * fa_window_cfg);


P2: The C2 gate uses overflow-prone int multiplication (2 * fa_window_cfg), which can misroute decode mode for large configured fa_window values.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/c2_gate.h, line 28: <comment>The C2 gate uses overflow-prone `int` multiplication (`2 * fa_window_cfg`), which can misroute decode mode for large configured `fa_window` values.</comment> <file context> @@ -0,0 +1,31 @@ + int kv_committed) { + (void)kv_committed; + return (fa_window_override == 0) + || (fa_window_override <= 2 * fa_window_cfg); +} + </file context>

Suggested change

|| (fa_window_override <= 2 * fa_window_cfg);

|| (static_cast<long long>(fa_window_override) <=

2LL * static_cast<long long>(fa_window_cfg));

cubic-dev-ai · 2026-06-03T03:31:08Z

+it would break a real-client launcher above.
+
+```bash
+# Full sweep (default — runs all 4 stdlib areas)


P2: Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the code area. This will mislead users about which areas to set and what the default covers.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/README.md, line 118: <comment>Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the `code` area. This will mislead users about which areas to set and what the default covers.</comment> <file context> @@ -102,6 +103,29 @@ OpenAI Chat Completions clients can call llama.cpp directly. Claude Code and +it would break a real-client launcher above. + +```bash +# Full sweep (default — runs all 4 stdlib areas) +harness/clients/run_lucebench.sh + </file context>

cubic-dev-ai · 2026-06-03T03:31:08Z

+.PHONY: serve
+serve:  ## Run the local image, foreground. Models bind-mounted from $(MODELS_DIR).
+	docker run --rm --gpus all -p 8080:8080 \
+		-v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \


P2: MODELS_DIR is unquoted in the Docker bind mount, so paths with spaces/special characters break serve and can mount the wrong source path.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At Makefile, line 73: <comment>`MODELS_DIR` is unquoted in the Docker bind mount, so paths with spaces/special characters break `serve` and can mount the wrong source path.</comment> <file context> @@ -0,0 +1,112 @@ +.PHONY: serve +serve: ## Run the local image, foreground. Models bind-mounted from $(MODELS_DIR). + docker run --rm --gpus all -p 8080:8080 \ + -v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \ + --name lucebox-gemma \ + $(IMAGE) serve </file context>

Suggested change

-v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \

-v "$(MODELS_DIR)":/opt/lucebox-hub/server/models:ro \

cubic-dev-ai · 2026-06-03T03:31:08Z

+    """
+    codex_bin = find_bin("codex", env_var="CODEX_BIN",
+                         work_dir_hint="clients/codex/npm/bin/codex")
+    home = work_dir or mktempdir("codex")


P2: launch() does not create a user-supplied work_dir, so writing config.toml can fail with FileNotFoundError.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At harness/src/harness/clients/codex.py, line 64: <comment>`launch()` does not create a user-supplied `work_dir`, so writing `config.toml` can fail with `FileNotFoundError`.</comment> <file context> @@ -0,0 +1,129 @@ + """ + codex_bin = find_bin("codex", env_var="CODEX_BIN", + work_dir_hint="clients/codex/npm/bin/codex") + home = work_dir or mktempdir("codex") + write_config(home, base_url=base_url, model=model, + sandbox=sandbox, wire_api=wire_api) </file context>

cubic-dev-ai · 2026-06-03T03:31:08Z

        run: bash scripts/check_uv_workspace.sh

+      - name: Lint Python surfaces touched by lucebox tooling
+        run: uv run --frozen --extra dev ruff check .


P2: Lint and typecheck steps using uv run --frozen --extra dev will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the --no-install-package torch optimization in check_uv_workspace.sh that was explicitly designed to keep this job fast.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/ci.yml, line 23: <comment>Lint and typecheck steps using `uv run --frozen --extra dev` will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the `--no-install-package torch` optimization in `check_uv_workspace.sh` that was explicitly designed to keep this job fast.</comment> <file context> @@ -10,20 +10,46 @@ jobs: run: bash scripts/check_uv_workspace.sh + - name: Lint Python surfaces touched by lucebox tooling + run: uv run --frozen --extra dev ruff check . + + - name: Typecheck lucebox CLI </file context>

cubic-dev-ai · 2026-06-03T03:31:08Z

+        "PI_OFFLINE": "1",
+    }
+    argv: list[str] = [bin_path]
+    if interactive:


P3: --tools is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At harness/src/harness/clients/pi.py, line 80: <comment>`--tools` is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.</comment> <file context> @@ -0,0 +1,130 @@ + "PI_OFFLINE": "1", + } + argv: list[str] = [bin_path] + if interactive: + if extra_args: + argv += extra_args </file context>

easel · 2026-06-03T03:37:19Z

Closing in favor of consolidating the probe/inject split fix directly onto PR #326's branch.

PR #331 had a fundamentally wrong base — it was opened against main with a branch cut from feat/lucebox-docker, so its diff (100 commits) included all of PR #285's umbrella changes plus PR #326's soft-close, plus my probe/inject fix. That made it impossible to review as a soft-close bugfix.

The probe/inject split (commit 175c8a72 here) has been cherry-picked onto feat/soft-close-thinking-termination as c9c410c0, with a follow-up commit 91886a9f adding a min_thinking_tokens floor (false-positive guard motivated by the empirical trajectory data). PR #326 has been updated to describe the full feature surface.

🤖 Generated with Claude Code

dusterbloom and others added 30 commits May 28, 2026 19:44

refactor(pflash): rename DFLASH_COMPRESS_* → PFLASH_COMPRESS_* (casca…

94907a4

…de env vars)

bench: add eval_quality_compare.py for LongBench F1 regression detection

766e46d

Merge PR Luce-Org#308 (qwen-think-channel) into PR Luce-Org#285 (luce…

8b48ad8

…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.

easel and others added 24 commits June 2, 2026 16:17

docs(experiments): add Laguna-XS.2 initial characterization for bragi

6f9b9cd

Architecture, VRAM budget, context window feasibility table, performance vs Qwen3.6-27B comparison. Benchmark results TBD pending running sweep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(laguna): fill in 32K benchmark results table

e9a53bc

Laguna-XS.2 bragi baseline complete. forge=0% (model can't emit tool_use), code=20% (FIM format mismatch), gsm8k=93% (+12pp vs Qwen3.6), agent_recorded=50%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai Bot reviewed Jun 3, 2026

View reviewed changes

easel closed this Jun 3, 2026

This was referenced Jun 3, 2026

feat(server): soft-close thinking termination via logit-ratio peek #326

Closed

fix(server): plain-text call:verb spans must survive emit_finish malformed-parse + responses .done easel/lucebox-hub#1

Merged

easel deleted the fix/soft-close-split-probe-from-inject branch June 3, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): split soft-close probe ids from inject ids#331

fix(server): split soft-close probe ids from inject ids#331
easel wants to merge 111 commits into
Luce-Org:mainfrom
easel:fix/soft-close-split-probe-from-inject

easel commented Jun 3, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Uh oh!

easel commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	\|\| (fa_window_override <= 2 * fa_window_cfg);
	\|\| (static_cast<long long>(fa_window_override) <=
	2LL * static_cast<long long>(fa_window_cfg));

	-v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \
	-v "$(MODELS_DIR)":/opt/lucebox-hub/server/models:ro \

Conversation

easel commented Jun 3, 2026

Summary

Fix

Empirical validation

Tests

Files changed (+259/-28)

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

easel commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants