Skip to content

fix(server): split soft-close probe ids from inject ids#331

Closed
easel wants to merge 111 commits into
Luce-Org:mainfrom
easel:fix/soft-close-split-probe-from-inject
Closed

fix(server): split soft-close probe ids from inject ids#331
easel wants to merge 111 commits into
Luce-Org:mainfrom
easel:fix/soft-close-split-probe-from-inject

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 3, 2026

Summary

Soft-close (PR #326) shipped with an empirically inert configuration on qwen3.6-27b. Root cause: BudgetHook::close_token_ids was used for both the soft-close peek probe AND the inject sequence. For qwen3.6-27b, the configured thinking_terminator_hint is a 16+ token English directive starting with "Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n" — so the peek was checking the logit of token 79939 ("Considering"), a mid-sentence content token the model rarely promotes. Trajectory data showed prob_ratio < 1e-8 across 12,888 reasoning steps; the dial was dead at any sampled ratio in {0.1, 0.3, 0.5, 0.7, 0.9}.

Fix

Split the probe ids (short marker sequence) from the inject ids (full directive). New BudgetHook::soft_close_probe_ids field; soft_close_probe_token() accessor with empty-falls-back-to-close_token_ids legacy behavior so models without the split see zero diff.

server_main.cpp now tokenizes the marker substring (\</think\>) separately when the hint contains it; logs both probe and inject id vectors at startup. qwen35_backend.cpp::maybe_soft_close peeks probe_ids.front() instead of close_token_ids.front(); [soft-trace] close0= field reports the probe id so trajectory CSVs stay interpretable. Hard-close path (maybe_force_close) untouched — still injects the full directive.

Empirical validation

Re-ran /tmp/probe_soft_close_trajectory.sh against an image built from this branch (lucebox-hub:175c8a72-cuda12) on sindri (qwen3.6-27b, RTX 3090 Ti).

Phase 2 trajectory (ratio=0, debug logits on): </think> (id 248069) reliably becomes argmax-competitive (diff >= log(0.1) = -2.30) at 66-94% of natural reasoning length across 5 diverse prompts. max_diff reaches 0.000 (prob_ratio = 1.0) on every prompt — vs prior baseline max_diff = -9.69 on token 79939. 9.7 nat improvement, restoring the mechanism to its designed regime.

prompt n_steps fire@0.1 fire@0.9
0 (arithmetic) 1723 step 1135 (66%) step 1393 (81%)
1 (Python) 2081 step 1950 (94%) step 1950 (94%)
2 (logic puzzle) 6232 step 5714 (92%) step 5714 (92%)
3 (train meet) 3894 step 3341 (86%) step 3341 (86%)
4 (influenza) 5771 step 4993 (87%) step 4993 (87%)

Phase 1 live firing: soft-close fires reliably at ratios 0.1-0.9 with stop_reason=end_turn and coherent text outputs across all configurations. Single-sample thinking-token savings are noisy (sampling non-determinism is ±30%); multi-seed sweeps are deferred to a follow-up.

Tests

  • 3 new unit tests in test_server_unit.cpp: probe-uses-probe-ids-not-inject-ids, probe-ids-empty-falls-back-to-close-token-ids, inject-sequence-unchanged-when-fires.
  • Fixed pre-existing OOB write in test_soft_close_determinism_when_disabled (vocab 1000 → 250000) — UB-silent until new tests perturbed heap layout.
  • Suite: 1985 assertions, 2 pre-existing failures unrelated to soft-close (PR fix(server): support gemma-4's plain-text call:<verb>{} tool-call format #329 emitter parser tests reproduce on unmodified feat/lucebox-docker tip).

Files changed (+259/-28)

  • server/src/common/model_backend.h — new soft_close_probe_ids + accessor.
  • server/src/qwen35/qwen35_backend.cpp — peek probe, inject full sequence.
  • server/src/server/http_server.hServerConfig::think_close_probe_token_ids.
  • server/src/server/http_server.cpp — wire probe ids into per-request BudgetHook.
  • server/src/server/server_main.cpp — split-tokenize marker substring; startup logging.
  • server/test/test_server_unit.cpp — 3 new tests + OOB fix.

Test plan

  • Local unit tests pass (all soft-close tests green).
  • Smoke test: close0 in [soft-trace] now reports 248069 (\</think\>), not 79939.
  • Phase 2 trajectory validates \</think\> reaches argmax across 5 prompts.
  • CI build + cmake + cubic review.

🤖 Generated with Claude Code

dusterbloom and others added 30 commits May 28, 2026 19:44
…g-42 tail-capture guard

ee7 truncates drafter forward at layer 7 of 28, scoring only those layers.
9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter).
Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF).
Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}.

5 unit tests included. Bench scripts split to follow-up PR.
At >=32K context the needle text is more likely to straddle multiple
chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks
~160 tokens) loses the back half of the needle digits — the model
retrieves '...is 4' but truncates/hallucinates the continuation.

Adaptive scaling based on n_chunks:
  <32K  context (<1024 chunks): radius=2,  max_anchor_hits=8   (unchanged)
  32-64K (1024-2047 chunks):    radius=4,  max_anchor_hits=16
  >=64K (>=2048 chunks):        radius=8,  max_anchor_hits=32

Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS
env vars (legacy DFLASH_COMPRESS_* names still accepted).

Validated at 49K context: NIAH needle 'kowefada 1596346' correctly
retrieved (was: '1594' or hallucinated 'is 048394839483' before fix).
Resolves the long-standing 'project_64k_quality_cliff' memory entry.
Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35
target loader and the dflash decode draft loader. After loading weight
tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] /
wk->ne[1] and compare against GGUF-declared values; set_last_error
and return false on mismatch.

Makes the 'stale scalar at graph-build time' bug class structurally
impossible. Load-time only, no runtime cost. Existing well-formed
GGUFs are unaffected (smoke verified).
When pflash compresses, set gen_req.fa_window_override =
effective_prompt + 256 so spec-decode verify sees the entire
compressed prompt. Pflash already paid compute to pick which tokens
matter; verify never throws any of them away.

When the override would exceed 2 * cfg_.fa_window (spec-decode's
drafter cost stops earning its tok/J), the C2 gate in
qwen35_backend's generate() falls back to AR (fa_window=0, full
attention). AR sees every kept token at every context; we choose
mechanism, not visibility.

Zero new CLI flags. --draft remains the only knob for composition;
all per-request adaptation is internal.
…scade default-on

Adds backwards-compat fallback wrappers for 6 cascade env vars in both
standard and bandit code paths, so harness scripts using either spelling
work against this binary. Emits one-time WARN to stderr when the legacy
DFLASH_* spelling is honored.

Also flips the default for `use_transitive` from `false` to `true` because
the gated rare-token bridge improves multi-hop F1 with zero downside in
the cascade-already-firing case.
…th drift

Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both
qwen35_score_and_compress and drafter_score_and_compress call it.
Removes two 70-LOC duplicate env-reading blocks and the duplicated
anchor-radius comment. Also removes dead force_chunk_neighborhood
(no callers) and collapses the 4-overload load_drafter pyramid to
one canonical implementation + 3 thin forwarders.
- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks),
  remove early_exit_n alias (inline early_exit_pre at call site)
- qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref;
  do_ar_decode budget-hook 15-line → 4-line + docs ref
- http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref
- model_backend.h: BudgetHook 23-line essay → 3-line + docs ref
- gguf_target_loader.cpp: 4-line prose tail → 1-line
- .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts
- docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md,
  anchor-transitive.md (consolidated rationale)
…nking is off

The hard-coded renderer appends a closed think prefill when thinking is
disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix,
leaving the model in the wrong decoding state for tool use. Mirror the
hard-coded behavior here when the rendered prompt ends with a bare
assistant generation prompt; tolerate trailing-whitespace variants
(single \n, double \n\n, trailing space).

Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%:
the drafter was distilled with the closed-think suffix in its training
distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target
and drafter disagree on what comes after <|im_start|>assistant\n.
… only

The previous commit applied the closed-think suffix to all Jinja-rendered
prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja,
defaulting to QWEN3, and guard the post-processing block with
arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes
chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe
inherits ChatFormat::QWEN3 by design (matches drafter distillation).

5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3
arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append
when template already closes the think block.

Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.
Extract the C2 spec-decode gate from an inline expression in
qwen35_backend.cpp into a pure predicate header c2_gate.h.

Zero behavior change. Identical math:
  (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg)

The new header documents the empirically-derived rationale: at
compressed KV sizes (pflash compression of long prompts), T_draft/T_target
ratio approaches 1, eliminating spec-decode's profit margin over AR.
Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced
spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when
eff_fa_window > 2*fa_window_cfg.

Adds 5 unit tests locking in the predicate's behavior with explicit
Round 5 4-arm matrix bench citations.

Files:
- server/src/qwen35/c2_gate.h (new)
- server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call)
- server/test/test_server_unit.cpp (+60 LOC, 5 tests)
…nch in-tree

Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main.
Net: 189 files changed.

Major workstreams folded in:

* Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage
  Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache,
  build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO.
* Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID
  guard, container-state preflight), cmd_systemctl_passthrough (already-
  active short-circuit, restart-loop detection), cmd_update (bootstrap-
  installer pattern), cmd_completion (bash/zsh/fish), config.toml reader
  (env > toml > default precedence), shellcheck-clean.
* Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the
  installed copy so lucebox update keeps tracking the channel; refuses
  SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL.
* In-container Python CLI (lucebox/): sparse config.toml persistence,
  config get/set/unset sub-app, models list/download sub-app (replaces
  download-models), autotune with --apply / --json / --sweep, profile
  collapsed onto luce-bench snapshot (1701 → 183 lines).
* luce-bench: snapshot subcommand + canonical HostInfo schema v2 +
  levels (level0/1/2/3) + report subcommand + submit-baseline + regrade.
* Server (C++): /props.host block + props_schema=4 + host_info read at
  startup, /props.build identity, GGUF metadata + sha256 sidecars,
  model card sidecars.
* Harness: client implementations for claude/codex/opencode/hermes/pi.
* Strict 11-field config.toml allowlist for dflash.* runtime tunables.

Deleted (rolled into new structure):
* server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by
  luce-bench snapshot + areas.
* lucebox configure, lucebox download-models, lucebox benchmark — replaced
  by config sub-app, models sub-app, autotune --sweep.
* luce-bench --sweep flag — moved to argv-sniff subcommand dispatch.

Conflict resolution:
* server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion
  (feat/lucebox-docker moved bench machinery into luce-bench).
* README.md — took feat-branch version. origin/main had 19 commits worth
  of minor README tweaks since the branch base; those need to be folded
  back in as a follow-up PR.
* docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took
  feat-branch version. origin/main had 1 link-fix commit; feat-branch
  has the schema-4 + host-block additions that strictly supersede.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when
config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That
contradicted the precedence lucebox.sh documents (env > toml > default)
and bit sindri in production: its config.toml had `[image]` without a
`registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub`
beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`.
Symptom: `lucebox start` brought up the wrong (stale luce-org) image
even after explicit `lucebox install` + `lucebox pull` against easel.

Fix: overlay env on top of whatever `load()` returns (or `live_config()`
falls back to). Only the five top-level scalars have env hooks
(LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model
intentionally don't.

Adds two regression tests:
- env beats config.toml when toml has no explicit value for that key,
- env still wins when toml is absent (covers the live_config fallback).

102 lucebox tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g#285 CI

CI's "Lint Python surfaces touched by lucebox tooling" job ran
`ruff check .` and found 11 errors across surfaces this branch touches.
Ruff --fix handled 6 (import sorting, unused imports); 5 needed
hand-edits:

  luce-bench/src/lucebench/report.py:172  E741  rename `for l in` → `for lineup in`
  lucebox/tests/test_check.py:39, 95      E731  lambda → def stub() for the two HostFacts stubs
  lucebox/tests/test_cli.py:95            E501  wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv
  lucebox/tests/test_sweep.py:174, 177    E501  wrap two CellResult constructors

22 lucebox tests touched still pass; ruff is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_autotune_candidate_configs.py: sort imports (ruff I001).
- download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo
  and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo
  (correct — we only query model repos here), resolving the mypy union-attr error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job-level `permissions` block replaces the workflow-level default
entirely, so `actions/checkout` was running without `contents: read`
and would fail on protected refs. Add `contents: read` back alongside
the existing `id-token: write`.

Addresses cubic #1 on PR Luce-Org#285.
- Dockerfile: keep --frozen on the uv sync fallback so the layer can't
  silently resolve outside the lockfile.
- harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty
  (per-area card defaults govern; --no-think only when explicitly set)
  and default LUCEBENCH_AREA to the level1 capability gate
  (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad
  for routine harness runs.

Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.
…appers

- .github/workflows/{ci,docker,release-luce-bench}.yml: pin
  actions/checkout, docker/{setup-buildx,login,metadata,bake}-action,
  and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments
  so the supply chain is reproducible (Luce-Org#4).
- harness/src/harness/clients/_common.py: replace the external `timeout`
  shell-out with `subprocess.run(..., timeout=N)`, return 124 on
  TimeoutExpired to match GNU timeout's exit code (Luce-Org#5).
- scripts/build_image.sh: normalize REGISTRY to end in `/` instead of
  silently producing `ghcr.io/luce-orglucebox-hub` when the trailing
  slash is missing (Luce-Org#6).
- harness/src/harness/clients/pi.py: non-interactive launch now mirrors
  run_pi.sh's validated invocation (--provider, --print, --mode json,
  --tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR /
  PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7).
- docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags,
  since `+` is not a valid Docker tag character (Luce-Org#8).
- harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of
  run_hermes.sh's env wiring and call `chat --provider --model
  --accept-hooks --yolo --max-turns --source --query` instead of a bare
  positional prompt (Luce-Org#9, Luce-Org#10).
- harness/src/harness/clients/openclaw.py: apply the OpenClaw config
  patch via `openclaw config patch --file` before the run, and call
  `agent --local --json --model lucebox/<model> --session-id --timeout
  --message` instead of a bare positional prompt (Luce-Org#11).
- pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server,
  tool_memory}.py ruff include pins (those paths were renamed during
  the dflash→server rename and then deleted upstream) (Luce-Org#12).
- lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to
  `**/*.sh` so scripts under nested dirs (harness/clients/*.sh,
  scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13).

Addresses cubic Luce-Org#4Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in
the previous commit alongside the LUCEBENCH_THINK default fix.
- lucebox/README.md: fix the relative link to `cli.py`; resolves to
  `src/lucebox/cli.py` (the actual location), not the nonexistent
  `lucebox/cli.py` (Luce-Org#15).
- luce-bench/NOTICE: the bundled forge_eval LICENSE says
  "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE
  with the actual upstream LICENSE (Luce-Org#16).
- luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing
  agent / agent_recorded / forge / longctx / smoke. Add the imports +
  list entries so `from lucebench.areas import *` matches the actual
  area surface (Luce-Org#17).

Addresses cubic Luce-Org#15Luce-Org#17 (P3) on PR Luce-Org#285.
…nch in-tree

Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single
commit on top of origin/main (8782d07). Net: 189 files changed.

Workstreams folded in:

* Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image,
  multi-stage Dockerfile with reproducible `uv sync --frozen`,
  docker-bake.hcl with VERSION sanitization for Docker tag charset,
  .github/workflows/docker.yml with SHA-pinned external actions and
  GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO +
  HOST_INFO.

* Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID
  guard against systemd self-defeat, container-state preflight),
  cmd_systemctl_passthrough (already-active short-circuit, restart-loop
  detection), cmd_update (bootstrap-installer pattern), cmd_completion
  (bash/zsh/fish), config.toml reader (env > toml > default), all
  shellcheck-clean.

* Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into
  the installed copy so `lucebox update` keeps tracking the channel;
  refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL.

* In-container Python CLI (lucebox/): sparse config.toml persistence,
  config get/set/unset sub-app, models list/download sub-app
  (replaces download-models), autotune with --apply / --json / --sweep,
  profile collapsed onto luce-bench snapshot (1701 → ~150 lines).
  _load_or_build now respects env > toml > default precedence.

* luce-bench: snapshot subcommand + canonical HostInfo schema v2
  (multi-GPU lineup, WSL detection, source/collector trust metadata) +
  levels (level0/1/2/3) + report subcommand (host column + cross-host
  confounder warnings) + submit-baseline (level3-gated) + regrade.

* Server (C++): /props.host block + props_schema=4 + host_info loader,
  /props.build identity, GGUF metadata + sha256 sidecars, model card
  sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench
  machinery moved into luce-bench.

* Harness: client implementations for claude/codex/opencode/hermes/pi
  pointed at the running lucebox server, matched against the validated
  run_*.sh shell wrappers.

Cubic AI code review (17 findings) addressed in full:
  P0: contents: read on luce-bench release job permissions.
  P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty
      so per-area defaults apply.
  P2: 6 external actions pinned to immutable SHAs; non-interactive
      timeout via subprocess.run; REGISTRY trailing-slash normalize;
      VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw
      mirrored against run_*.sh wrappers; ruff scan paths corrected to
      server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default
      level1.
  P3: lucebox/README.md cli.py link fixed; NOTICE copyright year
      2025-2026; areas/__init__.py __all__ exposes all 10 areas.

CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12
prebuild, cubic reviewer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…default 0.10)

- Gate context-window admission on post-compression effective size, not raw, so
  >128K-raw prompts compress to fit max_ctx instead of 400 / oversized KV reservation.
- Pre-compression keep-ratio sanity guard (raw*keep+max_out>max_ctx); the real
  effective-size gate runs post-compression in worker_loop.
- Default prefill-keep-ratio 0.05 -> 0.10: real ~2x compression on agentic content
  (0.25 over-forces anchor-transitive to ~100% = no-op + rejects >128K).
- Evidence (RTX3090, agentic replay, keep=0.10): 167K raw admitted -> 71K eff (42.6%),
  prefill 145s vs 845s forced; 32-128K real compression; tool-parse intact; 1629 unit asserts green; 14-cell P/PD sweep zero crashes.
…ontent channel

The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.

The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.

Fix:

1. chat_template.h: render_chat_template / render_chat_template_jinja now
   return a PromptRenderResult { text, started_in_thinking }. The built-in
   QWEN3 and LAGUNA branches set started_in_thinking deterministically when
   enable_thinking && add_generation_prompt; GEMMA4 stays false (its
   reasoning channel is opened by the model emitting `<|channel>`, which
   http_server forwards into the emitter as `<think>`). The Jinja path
   suffix-sniffs the rendered prompt for a trailing `<think>` opener and
   emits a [WARN] log when sniffing decides true so a template/model-card
   mismatch surfaces at runtime.

2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
   When constructed with REASONING, active_kind_ initializes to "thinking"
   so the Anthropic first content_block is `thinking` instead of `text`
   (avoids a spurious empty text-block stop+restart on the first reasoning
   delta). Deliberately leaves checked_think_prefix_ at its default (false)
   so the existing one-time `<think>` strip guard still trips if a
   template/model-card mismatch causes the model to emit a redundant opener.

3. http_server.cpp: thread render_result.started_in_thinking through
   ParsedRequest into the SseEmitter's initial_mode. Both streaming and
   non-streaming paths feed tokens through the same emitter, so the fix
   covers both response shapes.

Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.

Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three C++ tests that chain render_chat_template + SseEmitter so the
wiring between the renderer's started_in_thinking flag and the emitter's
initial_mode is exercised end-to-end, not just at each end. The per-unit
tests above each verify their half of the contract, but the original bug
was a missing call-site wire — both halves were correct in isolation.

Also tighten the Python integration test assertions for enable_thinking
and reasoning.effort: require non-empty reasoning_content and no raw
<think>/</think> in either channel. The prior 'doesn't crash' assertion
would have passed on the broken code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…box-docker)

Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into
reasoning_content channel instead of content) into the lucebox-docker stack.
…budget

Increment 1 (Tier 1): model-card registry resolvable by normalized model id
(/props.model_card → bundled cards → family fallback), per-model thinking tokens
via the card with a thinking-capability gate, configurable --reasoning-effort
{low,medium,high} (was hardcoded high) and --thinking-budget-tokens N, plus
card_source/card_stem provenance on every row. Cards bundled into the wheel via
hatch force-include from share/model_cards (single source; CI drift guard TODO).

Tier 2: --client-thinking-budget N — client-side thinking termination for
providers that ignore native budget hints. Streams the response, estimates
reasoning tokens (char/4), and when over budget aborts and issues a forced-
</think> re-prompt (a fresh conditioned sample, not decoder continuation) using
the card's terminator + reply reserve, producing a gradable answer. Gated on
reasoning being identifiable in the stream (reasoning_content deltas or <think>
tags); unmarked output is left untouched. client_abort rows are a separate
benchmark mode (never pooled with single-pass), with continuation-failure and
answer-started-before-abort rows excluded from the aggregate and coverage
reported.

Verified live: OpenRouter qwen3.6-27b ignores reasoning_effort/budget_tokens
(reasoning unbounded), but --client-thinking-budget 2000 bounds it precisely
(~2001 reasoning tokens/row, continuation=ok, 8/8 pass on the head subset).
234 tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `--multi-turn` mode to scripts/extract-agentic-fixture.py for the
coding-agent-loop autotune profile: walk one session in record order,
emit a replay case at each target-token bucket (default
8K/16K/32K/64K/100K/128K). Each case ships an OpenAI-shaped `messages`
list and a `prefill-and-decode` verifier so the sweep can score
"does this max_ctx cell actually serve a trace of n − reply_budget
tokens." Snapshot semantics: case `context_tokens_approx <=
target_bucket_tokens` is guaranteed (snapshot taken pre-append for
the message that would cross).

Also fix a latent bug in `_is_claude_session`: it returned False on
the first non-user record, which misrouted any Claude session that
led with `permission-mode`, `system`, or `queue-operation` (most
real sessions do) — including the one this commit was developed
against.

Tests cover bucket fit, role collapsing, thinking-block drop, PII
scrub on HOME paths + token-looking secrets, Codex record decoding,
and the leading-meta-record regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erifier

Add three small surfaces to the ``agent_recorded`` area to support
the coding-agent-loop autotune sweep:

* ``load_agent_recorded_multi_turn_cases()`` — reads the bucketed
  replay fixture produced by ``extract-agentic-fixture.py --multi-turn``
  and returns cases sorted ascending by ``target_bucket_tokens``.
  Distinct from the v1 single-prompt fixture; the two coexist.
* ``pick_multi_turn_case_for_budget()`` — given a prompt-token budget
  (typically ``max_ctx − reply_budget``), returns the largest case
  that fits. ``None`` when no case fits.
* ``grade_prefill_and_decode()`` — pass/fail verifier for the sweep:
  non-empty response within wall budget, no server error. Lighter
  than tool-schema-coverage on purpose — the sweep is asking "did
  this max_ctx setting serve a trace of this length", not "did the
  model do the task well."

Ship a harvested fixture: one Claude Code session sliced into 6
bucketed cases (8K through 128K tokens). Per repo guidance, one
long session is enough to cycle with until something breaks; the
broader corpus can land later if signal demands.

Tests cover the loader contract (cases fit under their bucket,
sorted by bucket), the budget picker (largest-fit, None-on-empty),
and the verifier's three failure modes (server error, wall-budget
overrun, response-too-short) plus the reasoning_content fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop

Add an autotune Profile abstraction so different workloads can sweep
different axes with different scorers. Two profiles ship:

* ``heuristic`` (default, backward-compatible) — preset-agnostic
  bracket, scores by mean ``decode_tokens_per_sec`` from a luce-bench
  level1 snapshot. Identical to the prior behavior.
* ``coding-agent-loop`` — architecture-aware. Gemma4's bracket is
  ``max_ctx × fa_window × budget × pflash_mode`` (KV-quant axis
  omitted because the gemma4 backend hardcodes F16 — verified at
  gemma4_loader.cpp). Qwen3.6 / laguna keep cache_type as an axis
  since their loader actually respects it. Scoring is composite:
  pass-rate on the agent_recorded multi-turn fixture first, then
  ``completion_tokens / wall_seconds`` as a tps proxy (the
  longctx-area snapshots ship empty ``decode_tokens_per_sec``).

Wire ``--fa-window`` through to the server end-to-end:

* ``DflashRuntime.fa_window`` (0 = full attention, server default)
* ``DFLASH_FA_WINDOW`` emitted by docker_run.py when nonzero
* entrypoint.sh appends ``--fa-window N`` to the server CLI iff
  ``DFLASH_FA_WINDOW > 0`` — unset env still reproduces stock behavior
* ``dflash.fa_window`` round-trips through config.toml

CLI: ``lucebox autotune --sweep --profile coding-agent-loop``. New
``--list-profiles`` flag prints the registered profile table.

Tests: 318/318 green. New coverage:

* Profile registry + ``get_profile`` error path
* gemma bracket excludes the KV-quant axis (regression for the
  no-op axis bug)
* gemma bracket varies max_ctx × fa_window × budget
* qwen bracket includes tq3_0 + q8_0
* sub-22 GB tiers fall back to base-only (OOM safety)
* ``_pick_winner`` ranks agent-replay results by pass→speed→ctx
* ``fa_window`` is in the sweep allowlist

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the sweep is invoked directly (e.g. `uv run python -m lucebox
autotune --sweep` for development, or any path that bypasses the
lucebox.sh wrapper), the LUCEBOX_HOST_* env vars aren't set and
``host_facts.from_env()`` returns a zero-VRAM HostFacts. Every profile
bracket then falls through to the <22 GB "base only" branch and the
sweep silently degrades to a 1-cell smoke test that overwrites the
operator's real config (e.g. dropping max_ctx from 131072 to the
DflashRuntime default 16384).

Fall back to ``cfg.host`` (populated by an earlier `lucebox check`
via the wrapper) when ``from_env()`` yields no signal. Test regresses
the original symptom: with LUCEBOX_HOST_* unset, the coding-agent-loop
bracket on a 24 GB persisted host must produce a multi-cell sweep,
not collapse to one base cell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel and others added 24 commits June 2, 2026 16:17
…ow one-shot forge batches

sse_emitter.cpp: extend find_tool_start() to detect Gemma4's call:<verb>{ format.
Previously find_tool_start only matched <tool_call>, <function=, <tool_code> XML
patterns, so the emitter never entered TOOL_BUFFER mode for Gemma4's plain-text
tool call emissions (call:verb{args}).  Now Pattern B scans for call: preceded by
a valid sentinel char and followed by at least one alpha (the verb start), causing
the emitter to buffer from that point and parse_tool_calls() to run at emit_finish.
Result: server now returns stop_reason=tool_use + tool_use content blocks for Gemma4.

step_enforcer.py: allow one-shot batch tool calls where all pending required steps
appear before the terminal tool in the batch.  Gemma4 emits calls in a single
response (e.g. [fetch_data, analyze, report]).  The runner executes in order so
required steps are satisfied before the terminal executes — the batch is not
premature.  This is a local modification to the vendored forge-guardrails 0.7.1.
Effect: forge basic_2step passes (was 0/5, now 1/5 = 20%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dict

Full 26-case agent_recorded nothink benchmark on image 658d016f-cuda12:
- Gemma4: 19.2% (5/26) vs Qwen3.6: 46.2% (12/26) — Qwen3.6 wins by 27pp
- Nothink suppression ineffective for Gemma4 (<|channel>thought bypasses prompt)
- 12/26 cases had non-empty reasoning despite --no-think
- 2 cases returned given=refused (model declined to engage)
- Verdict: Qwen3.6-27B is the preferred model for coding/agent tasks on bragi

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…locks

normalize_chat_messages() only extracted text/input_text/output_text from
content arrays, silently dropping tool_use and tool_result blocks. This
caused multi-turn tool-call conversations (Anthropic Messages API format)
to lose all tool call history: the model never saw tool results and looped
infinitely calling the same tool. Manifested as Qwen3.6 forge=0%.

Two cases fixed:
1. Assistant message with tool_use content blocks: look up tool_memory by
   ID (same as the OpenAI tool_calls path). Fallback for cross-session
   replay: synthesize <tool_call><function=...></tool_call> XML.
2. User message with tool_result content blocks: push each result as a
   {"tool", content, tool_use_id} message so the chat template renders
   <tool_response> blocks. Skip pushing empty user containers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the normalize_chat_messages() bug where tool_use and tool_result
Anthropic content blocks were silently dropped. Adds root-cause analysis,
fix description, and benchmark results showing Qwen3.6 forge 0%→100% (5/5).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Gemma4 forge results on image dc20057e: unchanged at 20% (1/5).
Documents why fix is neutral for Gemma4 (one-shot batch doesn't round-trip
tool_results) but critical for Qwen3.6 (turn-by-turn needs proper context).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive summary of all autotune results, model comparison, server bug
fixes, and configuration recommendations for bragi (RTX 5090 Laptop, 23 GB VRAM):
- Qwen3.6-27B at budget=16, max_ctx=98304, tq3_0 KV is the optimal preset
- Qwen3.6 forge 100% (5/5) vs Gemma4 20% post-fix
- Documents three server fixes in dc20057e-cuda12

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mary

Full pass-rate sweep on dc20057e-cuda12 (nothink):
- forge 100%, agent 100%, longctx 100%, ds4-eval 77.2%
- code 90%, truthfulqa-mc1 80%, agent_recorded 42.3%
- hellaswag 88%, gsm8k 86%

Update final tuning summary with verified numbers and corrected agent/longctx
entries (agent 100% up from 75%, longctx 100% newly verified for Qwen3.6).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empirical test on bragi: prefix_cache_slots=32 causes -19pp regression
on agent_recorded (23.1% vs 42.3% baseline). 5 cases regress, 0 unlock.

Update autotune.py comment with measured numbers and doc reference.
Smoke test passes 100% — the bug is specific to multi-turn tool convos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All major tunables swept and validated: budget=16, max_ctx=98304,
tq3_0 KV, fa_window=0, prefix_cache_slots=0 (regression confirmed),
pflash off. Includes full nothink/think benchmark table and known
limitations for prefix cache, pflash, and Gemma4 issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unch

Add speculator_dir field to ModelPreset for directory-based safetensors
speculators (distinct from GGUF draft_file). When present on disk, the
server launch sets DFLASH_DRAFT to that directory so the entrypoint's
glob search finds model.safetensors inside it.

For laguna-xs.2: speculator_dir="laguna-xs2-speculator" points to
~/.local/share/lucebox/models/draft/laguna-xs2-speculator/ where the
1.2 GB poolside/Laguna-XS.2-speculator.dflash safetensors live.

Also adds pytest to the workspace dev deps so `make test` runs clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aguna characterization

Update bragi-tuning-complete to use verified final baseline numbers from
bragi-rtx5090laptop-qwen36-27b-dc20057e-nothink-2026-05-31 (9 areas, 100%
output). Key changes: forge 100% (30/30 not 5/5), hellaswag 93% (clean run
not 88% restart-contaminated), agent 75% (stochastic), gsm8k 81%.

Add Laguna-XS.2 characterization: 20.3 GB model, 1.2 GB safetensors
speculator (+60% decode), 8 GQA KV heads, ~960 MB KV at 32K tq3_0,
~56K safe max context on 23 GB VRAM.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Architecture, VRAM budget, context window feasibility table, performance
vs Qwen3.6-27B comparison. Benchmark results TBD pending running sweep.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Laguna-XS.2 bragi baseline complete. forge=0% (model can't emit tool_use),
code=20% (FIM format mismatch), gsm8k=93% (+12pp vs Qwen3.6), agent_recorded=50%.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 48K context causes 10-70x prefill slowdown vs 32K (different kernel path)
- frontier-16k times out at 300s; optimal max_ctx is 32768
- budget=4/16 crash server when using safetensors speculator (null JSON field bug)
- budget=8 is the only safe value; sweep skipped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gemma4-31B: 60-layer dense 30.7B model, 20GB Q4_K_M, 1.6GB DFlash draft.
Server confirmed running at 32K/tq3_0/budget=8 on bragi (24GB VRAM).
Benchmark in progress.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Models ≥20 GB (gemma-4-31b at 21 GB, qwen3.6-moe at 22 GB) leave only
~2-3 GB for KV on 24 GB VRAM; the previous heuristic suggested max_ctx=98304
which would OOM. Now caps at 32K when approx_total_gb ≥ 20.

- runtime_from_host(host, preset="") accepts optional preset name
- _preset_approx_gb() looks up PRESETS.approx_total_gb for size awareness
- CLI passes cfg.model.preset to autotune
- _coding_agent_loop_candidates seeds from preset-aware base
- Tests: add large-model and unknown-preset coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…026-05-31

Full nothink 32K sweep results on RTX 5090 Laptop MaxQ:
- gsm8k 95% (+14pp vs Qwen3.6), agent_recorded 38.5% (=Qwen3.6)
- code 70% (-20pp), hellaswag 79% (-14pp), truthfulqa 79% (-3pp)
- longctx 33% (-67pp): Gemma4 template expansion causes HTTP 400 at frontier-8k+

Key operational lessons documented:
- DFlash server hang bug: forge Anthropic-format + kill → infinite GPU loop
- Use --max-tokens 512 for agent_recorded (4096 too slow at 22 tok/s effective)
- Effective context limit ~4K real tokens at max_ctx=32768 for this model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reconfirmation sweep after Gemma4-31B session left config in non-optimal
state. Winner: budget=22, max_ctx=98304, tq3_0 (applied).

New findings:
- budget=32+65K+q8_0 causes GPU compute hang (SM=100%, mem=0-1%), not
  a silent OOM crash as previously attributed — same DFlash hang bug
  as Gemma4-31B, now reproduced with Qwen3.6-27B
- budget=32 at 98K context is 35% slower in decode than budget=22
  (30.3s vs 22.4s) due to verification overhead with 84K KV cache
- budget=16 and budget=22 are functionally equivalent at 98K (within
  noise); budget=32 is clearly suboptimal
- Winner is budget=22 vs budget=16 on 05-30; difference is within
  measurement noise (0.912 vs 0.905 tok/s speed_metric)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive record of all tuning decisions for bragi (RTX 5090 Laptop,
23 GB VRAM, WSL2) covering sessions 2026-05-30 through 2026-06-01.

Documents optimal Qwen3.6-27B config (budget=22, 98K, tq3_0), safe/unsafe
parameter combinations, known issues (DFlash hang, prefix cache regression),
model matrix, and sweep history.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add comment in _coding_agent_loop_qwen_bracket explaining that
budget=32+q8_0 at 65K context is kept in the sweep bracket despite
being known to cause a GPU compute hang (SM=100%, mem=0%) on 23 GB
cards (observed 2026-06-01). The sweep handles it correctly via 300s
timeout + systemd restart which clears the GPU state.

Reference: docs/experiments/qwen3.6-27b-coding-agent-loop-sweep-bragi-2026-06-01.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…regression)

Ran agent_recorded benchmark (26 cases) against Qwen3.6-27B at the
winning sweep config (budget=22, max_ctx=98304, tq3_0). Result: 9/26
(34.6%) vs dc20057e baseline 10/26 (38.5%). 7 cases flipped in both
directions; 1-case net delta is within noise at n=26 (σ≈9.5pp).
No quality regression from the new config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All tunables (budget, max_ctx, KV quant, prefix cache, pflash, fa_window)
have been swept. Documents final status and future-work blockers
(prefix cache snapshot path bug, Gemma4-31B think mode not wired).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s active

Previously n_gen_cap = min(think_ceiling + reply_budget, max_tokens) caused
immediate force-close (step=0) for any request where max_tokens < reply_budget
(e.g. gsm8k at 2048, agent_recorded at 4096, code at 2048). Benchmarks sized
their max_tokens for nothink responses, so thinking was silently disabled.

Fix: n_gen = think_ceiling + min(max_tokens, hard_limit_reply_budget), treating
max_tokens as the post-thinking response budget rather than the total token cap.
Also clamp hard_limit_remaining to min(max_output, eff_reply_budget) so the
force-close boundary correctly reflects the available response window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR Luce-Org#326 wired soft-close (Level 2 voluntary close) into the Qwen3.5 AR
loop, but on qwen3.6-27b the comparator never fired across 1085 steps
of a sample trajectory (prob_ratio < 1e-8 every step). Root cause: the
field `BudgetHook::close_token_ids` was used for BOTH

  (a) the peek probe id read by `soft_close::should_fire(..., close0)`
  (b) the inject sequence written when the hook fires.

For the qwen3.6-27b model card the `thinking_terminator_hint` is the
~16-token English directive

    "Considering the limited time by the user, I have to give the
     solution based on the thinking directly now.\n</think>\n\n"

so close_token_ids[0] tokenizes to the id for "Considering" (~79939) —
a mid-sentence content token whose logit sits 19-35 nats below the
chosen token at every step. The peek therefore reported a perpetually
near-zero prob_ratio and the soft-close dial (min_ratio 0.1..0.9) was
empirically inert.

Fix (path α): split probe-vs-inject in `BudgetHook`

  - close_token_ids — unchanged role. Full inject sequence written
    on hard close or when soft-close fires. Multi-token directive
    for trained-hint sidecars (Qwen3.6); single marker token for
    bare-marker arches.
  - soft_close_probe_ids — NEW. Short sequence (typically one token)
    used only for the comparator peek. When the operator card has a
    distinct marker substring inside the hint, server_main tokenizes
    just that marker and ships it via this field. When empty,
    `BudgetHook::soft_close_probe_token()` falls back to
    close_token_ids.front() (legacy behavior — zero churn for
    sidecars without a separate marker).

server_main detects the marker substring inside the hint and tokenizes
it in isolation; on miss it warns and leaves the probe field empty
(legacy peek path stays in force). The AR-loop soft-close lambda in
qwen35_backend.cpp now peeks `budget_hook.soft_close_probe_token()`
and writes `close_token_ids.front()` on fire — the inject sequence is
unchanged downstream. `[soft-trace]` lines now report the probe token
id under `close0=...` so trajectory CSVs remain interpretable.

Hard-close path is untouched: it continues to use close_token_ids
verbatim, matching the contract that the operator-resolved directive
is what's emitted at the budget boundary.

Tests
-----
+ test_soft_close_probe_uses_probe_ids_not_inject_ids — verifies the
  peek reads probe[0] when set, NOT inject[0]. Builds a logit row
  where inject[0]'s logit is far below chosen but probe[0]'s logit
  is close to chosen; asserts soft fires and the WRITTEN token is
  inject[0] (not probe[0]).
+ test_soft_close_probe_ids_empty_falls_back_to_close_token_ids —
  guarantees pre-split behavior when the probe field is left empty
  (no churn for legacy sidecars / unit-test BudgetHook construction).
+ test_soft_close_inject_sequence_unchanged_when_fires — multi-token
  inject case: on fire we stream inject[0], inject[1], inject[2]
  verbatim regardless of what's in soft_close_probe_ids.

Also fix a pre-existing OOB in test_soft_close_determinism_when_disabled
(vocab=1000 row indexed at 248069). The UB was silently passing in
Release builds before but the adjacent test additions perturbed glibc
heap layout enough to crash; widen the row to vocab=250000.

15 soft-close tests pass (12 existing + 3 new). 1985 total assertions;
the two remaining failures are pre-existing `test_emitter_content_mode_*`
unrelated to soft-close (PR Luce-Org#329 emitter work).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 253 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md">

<violation number="1" location="docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md:49">
P2: Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.</violation>
</file>

<file name="server/src/qwen35/c2_gate.h">

<violation number="1" location="server/src/qwen35/c2_gate.h:28">
P2: The C2 gate uses overflow-prone `int` multiplication (`2 * fa_window_cfg`), which can misroute decode mode for large configured `fa_window` values.</violation>
</file>

<file name="harness/src/harness/clients/pi.py">

<violation number="1" location="harness/src/harness/clients/pi.py:80">
P3: `--tools` is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.</violation>
</file>

<file name="harness/clients/README.md">

<violation number="1" location="harness/clients/README.md:118">
P2: Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the `code` area. This will mislead users about which areas to set and what the default covers.</violation>
</file>

<file name="Makefile">

<violation number="1" location="Makefile:73">
P2: `MODELS_DIR` is unquoted in the Docker bind mount, so paths with spaces/special characters break `serve` and can mount the wrong source path.</violation>

<violation number="2" location="Makefile:112">
P1: `clean-models` uses an unquoted, unguarded `rm -rf $(MODELS_DIR)/*`, which can delete unintended files when the path is malformed or overridden unsafely.</violation>
</file>

<file name="harness/src/harness/clients/codex.py">

<violation number="1" location="harness/src/harness/clients/codex.py:64">
P2: `launch()` does not create a user-supplied `work_dir`, so writing `config.toml` can fail with `FileNotFoundError`.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:363">
P1: New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.</violation>
</file>

<file name=".github/workflows/ci.yml">

<violation number="1" location=".github/workflows/ci.yml:23">
P2: Lint and typecheck steps using `uv run --frozen --extra dev` will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the `--no-install-package torch` optimization in `check_uv_workspace.sh` that was explicitly designed to keep this job fast.</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic

Comment thread Makefile
.PHONY: clean-models
clean-models: ## Remove downloaded models from $(MODELS_DIR). Destructive.
@echo "WARN: about to rm -rf $(MODELS_DIR)/*"
@read -p "Continue? [y/N] " ans && [ "$$ans" = "y" ] && rm -rf $(MODELS_DIR)/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: clean-models uses an unquoted, unguarded rm -rf $(MODELS_DIR)/*, which can delete unintended files when the path is malformed or overridden unsafely.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At Makefile, line 112:

<comment>`clean-models` uses an unquoted, unguarded `rm -rf $(MODELS_DIR)/*`, which can delete unintended files when the path is malformed or overridden unsafely.</comment>

<file context>
@@ -0,0 +1,112 @@
+.PHONY: clean-models
+clean-models:  ## Remove downloaded models from $(MODELS_DIR). Destructive.
+	@echo "WARN: about to rm -rf $(MODELS_DIR)/*"
+	@read -p "Continue? [y/N] " ans && [ "$$ans" = "y" ] && rm -rf $(MODELS_DIR)/*
</file context>

const int64_t derived_kv_dim = L0.wk->ne[1];
const int64_t expected_q_dim = (int64_t)out.n_head * out.head_dim;
const int64_t expected_kv_dim = (int64_t)out.n_head_kv * out.head_dim;
if (derived_q_dim != expected_q_dim) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 363:

<comment>New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.</comment>

<file context>
@@ -349,6 +349,63 @@ bool load_draft_gguf(const std::string & path,
+        const int64_t derived_kv_dim = L0.wk->ne[1];
+        const int64_t expected_q_dim  = (int64_t)out.n_head * out.head_dim;
+        const int64_t expected_kv_dim = (int64_t)out.n_head_kv * out.head_dim;
+        if (derived_q_dim != expected_q_dim) {
+            char buf[256];
+            std::snprintf(buf, sizeof(buf),
</file context>

| 4 | FAIL | FAIL |
| 5 | PASS | PASS |
| 6 | FAIL | FAIL |
| 7–11 | FAIL | FAIL |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md, line 49:

<comment>Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.</comment>

<file context>
@@ -0,0 +1,84 @@
+| 4 | FAIL | FAIL |
+| 5 | PASS | PASS |
+| 6 | FAIL | FAIL |
+| 7–11 | FAIL | FAIL |
+| 12 | PASS | PASS |
+| 13 | PASS | FAIL (regression) |
</file context>

int kv_committed) {
(void)kv_committed;
return (fa_window_override == 0)
|| (fa_window_override <= 2 * fa_window_cfg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The C2 gate uses overflow-prone int multiplication (2 * fa_window_cfg), which can misroute decode mode for large configured fa_window values.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/c2_gate.h, line 28:

<comment>The C2 gate uses overflow-prone `int` multiplication (`2 * fa_window_cfg`), which can misroute decode mode for large configured `fa_window` values.</comment>

<file context>
@@ -0,0 +1,31 @@
+                                     int kv_committed) {
+    (void)kv_committed;
+    return (fa_window_override == 0)
+        || (fa_window_override <= 2 * fa_window_cfg);
+}
+
</file context>
Suggested change
|| (fa_window_override <= 2 * fa_window_cfg);
|| (static_cast<long long>(fa_window_override) <=
2LL * static_cast<long long>(fa_window_cfg));

Comment thread harness/clients/README.md
it would break a real-client launcher above.

```bash
# Full sweep (default — runs all 4 stdlib areas)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the code area. This will mislead users about which areas to set and what the default covers.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/README.md, line 118:

<comment>Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the `code` area. This will mislead users about which areas to set and what the default covers.</comment>

<file context>
@@ -102,6 +103,29 @@ OpenAI Chat Completions clients can call llama.cpp directly. Claude Code and
+it would break a real-client launcher above.
+
+```bash
+# Full sweep (default — runs all 4 stdlib areas)
+harness/clients/run_lucebench.sh
+
</file context>

Comment thread Makefile
.PHONY: serve
serve: ## Run the local image, foreground. Models bind-mounted from $(MODELS_DIR).
docker run --rm --gpus all -p 8080:8080 \
-v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: MODELS_DIR is unquoted in the Docker bind mount, so paths with spaces/special characters break serve and can mount the wrong source path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At Makefile, line 73:

<comment>`MODELS_DIR` is unquoted in the Docker bind mount, so paths with spaces/special characters break `serve` and can mount the wrong source path.</comment>

<file context>
@@ -0,0 +1,112 @@
+.PHONY: serve
+serve:  ## Run the local image, foreground. Models bind-mounted from $(MODELS_DIR).
+	docker run --rm --gpus all -p 8080:8080 \
+		-v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \
+		--name lucebox-gemma \
+		$(IMAGE) serve
</file context>
Suggested change
-v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \
-v "$(MODELS_DIR)":/opt/lucebox-hub/server/models:ro \

"""
codex_bin = find_bin("codex", env_var="CODEX_BIN",
work_dir_hint="clients/codex/npm/bin/codex")
home = work_dir or mktempdir("codex")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: launch() does not create a user-supplied work_dir, so writing config.toml can fail with FileNotFoundError.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/src/harness/clients/codex.py, line 64:

<comment>`launch()` does not create a user-supplied `work_dir`, so writing `config.toml` can fail with `FileNotFoundError`.</comment>

<file context>
@@ -0,0 +1,129 @@
+    """
+    codex_bin = find_bin("codex", env_var="CODEX_BIN",
+                         work_dir_hint="clients/codex/npm/bin/codex")
+    home = work_dir or mktempdir("codex")
+    write_config(home, base_url=base_url, model=model,
+                 sandbox=sandbox, wire_api=wire_api)
</file context>

Comment thread .github/workflows/ci.yml
run: bash scripts/check_uv_workspace.sh

- name: Lint Python surfaces touched by lucebox tooling
run: uv run --frozen --extra dev ruff check .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Lint and typecheck steps using uv run --frozen --extra dev will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the --no-install-package torch optimization in check_uv_workspace.sh that was explicitly designed to keep this job fast.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/ci.yml, line 23:

<comment>Lint and typecheck steps using `uv run --frozen --extra dev` will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the `--no-install-package torch` optimization in `check_uv_workspace.sh` that was explicitly designed to keep this job fast.</comment>

<file context>
@@ -10,20 +10,46 @@ jobs:
         run: bash scripts/check_uv_workspace.sh
 
+      - name: Lint Python surfaces touched by lucebox tooling
+        run: uv run --frozen --extra dev ruff check .
+
+      - name: Typecheck lucebox CLI
</file context>

"PI_OFFLINE": "1",
}
argv: list[str] = [bin_path]
if interactive:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: --tools is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/src/harness/clients/pi.py, line 80:

<comment>`--tools` is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.</comment>

<file context>
@@ -0,0 +1,130 @@
+        "PI_OFFLINE": "1",
+    }
+    argv: list[str] = [bin_path]
+    if interactive:
+        if extra_args:
+            argv += extra_args
</file context>

@easel
Copy link
Copy Markdown
Collaborator Author

easel commented Jun 3, 2026

Closing in favor of consolidating the probe/inject split fix directly onto PR #326's branch.

PR #331 had a fundamentally wrong base — it was opened against main with a branch cut from feat/lucebox-docker, so its diff (100 commits) included all of PR #285's umbrella changes plus PR #326's soft-close, plus my probe/inject fix. That made it impossible to review as a soft-close bugfix.

The probe/inject split (commit 175c8a72 here) has been cherry-picked onto feat/soft-close-thinking-termination as c9c410c0, with a follow-up commit 91886a9f adding a min_thinking_tokens floor (false-positive guard motivated by the empirical trajectory data). PR #326 has been updated to describe the full feature surface.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants