fix(server): split soft-close probe ids from inject ids#331
Conversation
…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.
At >=32K context the needle text is more likely to straddle multiple chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks ~160 tokens) loses the back half of the needle digits — the model retrieves '...is 4' but truncates/hallucinates the continuation. Adaptive scaling based on n_chunks: <32K context (<1024 chunks): radius=2, max_anchor_hits=8 (unchanged) 32-64K (1024-2047 chunks): radius=4, max_anchor_hits=16 >=64K (>=2048 chunks): radius=8, max_anchor_hits=32 Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS env vars (legacy DFLASH_COMPRESS_* names still accepted). Validated at 49K context: NIAH needle 'kowefada 1596346' correctly retrieved (was: '1594' or hallucinated 'is 048394839483' before fix). Resolves the long-standing 'project_64k_quality_cliff' memory entry.
Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35 target loader and the dflash decode draft loader. After loading weight tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] / wk->ne[1] and compare against GGUF-declared values; set_last_error and return false on mismatch. Makes the 'stale scalar at graph-build time' bug class structurally impossible. Load-time only, no runtime cost. Existing well-formed GGUFs are unaffected (smoke verified).
When pflash compresses, set gen_req.fa_window_override = effective_prompt + 256 so spec-decode verify sees the entire compressed prompt. Pflash already paid compute to pick which tokens matter; verify never throws any of them away. When the override would exceed 2 * cfg_.fa_window (spec-decode's drafter cost stops earning its tok/J), the C2 gate in qwen35_backend's generate() falls back to AR (fa_window=0, full attention). AR sees every kept token at every context; we choose mechanism, not visibility. Zero new CLI flags. --draft remains the only knob for composition; all per-request adaptation is internal.
…scade default-on Adds backwards-compat fallback wrappers for 6 cascade env vars in both standard and bandit code paths, so harness scripts using either spelling work against this binary. Emits one-time WARN to stderr when the legacy DFLASH_* spelling is honored. Also flips the default for `use_transitive` from `false` to `true` because the gated rare-token bridge improves multi-hop F1 with zero downside in the cascade-already-firing case.
…th drift Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both qwen35_score_and_compress and drafter_score_and_compress call it. Removes two 70-LOC duplicate env-reading blocks and the duplicated anchor-radius comment. Also removes dead force_chunk_neighborhood (no callers) and collapses the 4-overload load_drafter pyramid to one canonical implementation + 3 thin forwarders.
- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks), remove early_exit_n alias (inline early_exit_pre at call site) - qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref; do_ar_decode budget-hook 15-line → 4-line + docs ref - http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref - model_backend.h: BudgetHook 23-line essay → 3-line + docs ref - gguf_target_loader.cpp: 4-line prose tail → 1-line - .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts - docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md, anchor-transitive.md (consolidated rationale)
…nking is off The hard-coded renderer appends a closed think prefill when thinking is disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix, leaving the model in the wrong decoding state for tool use. Mirror the hard-coded behavior here when the rendered prompt ends with a bare assistant generation prompt; tolerate trailing-whitespace variants (single \n, double \n\n, trailing space). Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%: the drafter was distilled with the closed-think suffix in its training distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target and drafter disagree on what comes after <|im_start|>assistant\n.
… only The previous commit applied the closed-think suffix to all Jinja-rendered prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja, defaulting to QWEN3, and guard the post-processing block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe inherits ChatFormat::QWEN3 by design (matches drafter distillation). 5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3 arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append when template already closes the think block. Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.
Extract the C2 spec-decode gate from an inline expression in qwen35_backend.cpp into a pure predicate header c2_gate.h. Zero behavior change. Identical math: (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg) The new header documents the empirically-derived rationale: at compressed KV sizes (pflash compression of long prompts), T_draft/T_target ratio approaches 1, eliminating spec-decode's profit margin over AR. Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when eff_fa_window > 2*fa_window_cfg. Adds 5 unit tests locking in the predicate's behavior with explicit Round 5 4-arm matrix bench citations. Files: - server/src/qwen35/c2_gate.h (new) - server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call) - server/test/test_server_unit.cpp (+60 LOC, 5 tests)
…nch in-tree Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main. Net: 189 files changed. Major workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard, container-state preflight), cmd_systemctl_passthrough (already- active short-circuit, restart-loop detection), cmd_update (bootstrap- installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default precedence), shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so lucebox update keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → 183 lines). * luce-bench: snapshot subcommand + canonical HostInfo schema v2 + levels (level0/1/2/3) + report subcommand + submit-baseline + regrade. * Server (C++): /props.host block + props_schema=4 + host_info read at startup, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. * Harness: client implementations for claude/codex/opencode/hermes/pi. * Strict 11-field config.toml allowlist for dflash.* runtime tunables. Deleted (rolled into new structure): * server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by luce-bench snapshot + areas. * lucebox configure, lucebox download-models, lucebox benchmark — replaced by config sub-app, models sub-app, autotune --sweep. * luce-bench --sweep flag — moved to argv-sniff subcommand dispatch. Conflict resolution: * server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion (feat/lucebox-docker moved bench machinery into luce-bench). * README.md — took feat-branch version. origin/main had 19 commits worth of minor README tweaks since the branch base; those need to be folded back in as a follow-up PR. * docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took feat-branch version. origin/main had 1 link-fix commit; feat-branch has the schema-4 + host-block additions that strictly supersede. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That contradicted the precedence lucebox.sh documents (env > toml > default) and bit sindri in production: its config.toml had `[image]` without a `registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub` beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`. Symptom: `lucebox start` brought up the wrong (stale luce-org) image even after explicit `lucebox install` + `lucebox pull` against easel. Fix: overlay env on top of whatever `load()` returns (or `live_config()` falls back to). Only the five top-level scalars have env hooks (LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model intentionally don't. Adds two regression tests: - env beats config.toml when toml has no explicit value for that key, - env still wins when toml is absent (covers the live_config fallback). 102 lucebox tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_autotune_candidate_configs.py: sort imports (ruff I001). - download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo (correct — we only query model repos here), resolving the mypy union-attr error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job-level `permissions` block replaces the workflow-level default entirely, so `actions/checkout` was running without `contents: read` and would fail on protected refs. Add `contents: read` back alongside the existing `id-token: write`. Addresses cubic #1 on PR Luce-Org#285.
- Dockerfile: keep --frozen on the uv sync fallback so the layer can't silently resolve outside the lockfile. - harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty (per-area card defaults govern; --no-think only when explicitly set) and default LUCEBENCH_AREA to the level1 capability gate (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad for routine harness runs. Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.
…appers
- .github/workflows/{ci,docker,release-luce-bench}.yml: pin
actions/checkout, docker/{setup-buildx,login,metadata,bake}-action,
and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments
so the supply chain is reproducible (Luce-Org#4).
- harness/src/harness/clients/_common.py: replace the external `timeout`
shell-out with `subprocess.run(..., timeout=N)`, return 124 on
TimeoutExpired to match GNU timeout's exit code (Luce-Org#5).
- scripts/build_image.sh: normalize REGISTRY to end in `/` instead of
silently producing `ghcr.io/luce-orglucebox-hub` when the trailing
slash is missing (Luce-Org#6).
- harness/src/harness/clients/pi.py: non-interactive launch now mirrors
run_pi.sh's validated invocation (--provider, --print, --mode json,
--tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR /
PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7).
- docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags,
since `+` is not a valid Docker tag character (Luce-Org#8).
- harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of
run_hermes.sh's env wiring and call `chat --provider --model
--accept-hooks --yolo --max-turns --source --query` instead of a bare
positional prompt (Luce-Org#9, Luce-Org#10).
- harness/src/harness/clients/openclaw.py: apply the OpenClaw config
patch via `openclaw config patch --file` before the run, and call
`agent --local --json --model lucebox/<model> --session-id --timeout
--message` instead of a bare positional prompt (Luce-Org#11).
- pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server,
tool_memory}.py ruff include pins (those paths were renamed during
the dflash→server rename and then deleted upstream) (Luce-Org#12).
- lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to
`**/*.sh` so scripts under nested dirs (harness/clients/*.sh,
scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13).
Addresses cubic Luce-Org#4–Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in
the previous commit alongside the LUCEBENCH_THINK default fix.
- lucebox/README.md: fix the relative link to `cli.py`; resolves to `src/lucebox/cli.py` (the actual location), not the nonexistent `lucebox/cli.py` (Luce-Org#15). - luce-bench/NOTICE: the bundled forge_eval LICENSE says "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE with the actual upstream LICENSE (Luce-Org#16). - luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing agent / agent_recorded / forge / longctx / smoke. Add the imports + list entries so `from lucebench.areas import *` matches the actual area surface (Luce-Org#17). Addresses cubic Luce-Org#15–Luce-Org#17 (P3) on PR Luce-Org#285.
…nch in-tree Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single commit on top of origin/main (8782d07). Net: 189 files changed. Workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile with reproducible `uv sync --frozen`, docker-bake.hcl with VERSION sanitization for Docker tag charset, .github/workflows/docker.yml with SHA-pinned external actions and GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard against systemd self-defeat, container-state preflight), cmd_systemctl_passthrough (already-active short-circuit, restart-loop detection), cmd_update (bootstrap-installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default), all shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so `lucebox update` keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → ~150 lines). _load_or_build now respects env > toml > default precedence. * luce-bench: snapshot subcommand + canonical HostInfo schema v2 (multi-GPU lineup, WSL detection, source/collector trust metadata) + levels (level0/1/2/3) + report subcommand (host column + cross-host confounder warnings) + submit-baseline (level3-gated) + regrade. * Server (C++): /props.host block + props_schema=4 + host_info loader, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench machinery moved into luce-bench. * Harness: client implementations for claude/codex/opencode/hermes/pi pointed at the running lucebox server, matched against the validated run_*.sh shell wrappers. Cubic AI code review (17 findings) addressed in full: P0: contents: read on luce-bench release job permissions. P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty so per-area defaults apply. P2: 6 external actions pinned to immutable SHAs; non-interactive timeout via subprocess.run; REGISTRY trailing-slash normalize; VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw mirrored against run_*.sh wrappers; ruff scan paths corrected to server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default level1. P3: lucebox/README.md cli.py link fixed; NOTICE copyright year 2025-2026; areas/__init__.py __all__ exposes all 10 areas. CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12 prebuild, cubic reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…default 0.10) - Gate context-window admission on post-compression effective size, not raw, so >128K-raw prompts compress to fit max_ctx instead of 400 / oversized KV reservation. - Pre-compression keep-ratio sanity guard (raw*keep+max_out>max_ctx); the real effective-size gate runs post-compression in worker_loop. - Default prefill-keep-ratio 0.05 -> 0.10: real ~2x compression on agentic content (0.25 over-forces anchor-transitive to ~100% = no-op + rejects >128K). - Evidence (RTX3090, agentic replay, keep=0.10): 167K raw admitted -> 71K eff (42.6%), prefill 145s vs 845s forced; 32-128K real compression; tool-parse intact; 1629 unit asserts green; 14-cell P/PD sweep zero crashes.
…ontent channel
The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.
The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.
Fix:
1. chat_template.h: render_chat_template / render_chat_template_jinja now
return a PromptRenderResult { text, started_in_thinking }. The built-in
QWEN3 and LAGUNA branches set started_in_thinking deterministically when
enable_thinking && add_generation_prompt; GEMMA4 stays false (its
reasoning channel is opened by the model emitting `<|channel>`, which
http_server forwards into the emitter as `<think>`). The Jinja path
suffix-sniffs the rendered prompt for a trailing `<think>` opener and
emits a [WARN] log when sniffing decides true so a template/model-card
mismatch surfaces at runtime.
2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
When constructed with REASONING, active_kind_ initializes to "thinking"
so the Anthropic first content_block is `thinking` instead of `text`
(avoids a spurious empty text-block stop+restart on the first reasoning
delta). Deliberately leaves checked_think_prefix_ at its default (false)
so the existing one-time `<think>` strip guard still trips if a
template/model-card mismatch causes the model to emit a redundant opener.
3. http_server.cpp: thread render_result.started_in_thinking through
ParsedRequest into the SseEmitter's initial_mode. Both streaming and
non-streaming paths feed tokens through the same emitter, so the fix
covers both response shapes.
Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.
Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three C++ tests that chain render_chat_template + SseEmitter so the wiring between the renderer's started_in_thinking flag and the emitter's initial_mode is exercised end-to-end, not just at each end. The per-unit tests above each verify their half of the contract, but the original bug was a missing call-site wire — both halves were correct in isolation. Also tighten the Python integration test assertions for enable_thinking and reasoning.effort: require non-empty reasoning_content and no raw <think>/</think> in either channel. The prior 'doesn't crash' assertion would have passed on the broken code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.
…budget
Increment 1 (Tier 1): model-card registry resolvable by normalized model id
(/props.model_card → bundled cards → family fallback), per-model thinking tokens
via the card with a thinking-capability gate, configurable --reasoning-effort
{low,medium,high} (was hardcoded high) and --thinking-budget-tokens N, plus
card_source/card_stem provenance on every row. Cards bundled into the wheel via
hatch force-include from share/model_cards (single source; CI drift guard TODO).
Tier 2: --client-thinking-budget N — client-side thinking termination for
providers that ignore native budget hints. Streams the response, estimates
reasoning tokens (char/4), and when over budget aborts and issues a forced-
</think> re-prompt (a fresh conditioned sample, not decoder continuation) using
the card's terminator + reply reserve, producing a gradable answer. Gated on
reasoning being identifiable in the stream (reasoning_content deltas or <think>
tags); unmarked output is left untouched. client_abort rows are a separate
benchmark mode (never pooled with single-pass), with continuation-failure and
answer-started-before-abort rows excluded from the aggregate and coverage
reported.
Verified live: OpenRouter qwen3.6-27b ignores reasoning_effort/budget_tokens
(reasoning unbounded), but --client-thinking-budget 2000 bounds it precisely
(~2001 reasoning tokens/row, continuation=ok, 8/8 pass on the head subset).
234 tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `--multi-turn` mode to scripts/extract-agentic-fixture.py for the coding-agent-loop autotune profile: walk one session in record order, emit a replay case at each target-token bucket (default 8K/16K/32K/64K/100K/128K). Each case ships an OpenAI-shaped `messages` list and a `prefill-and-decode` verifier so the sweep can score "does this max_ctx cell actually serve a trace of n − reply_budget tokens." Snapshot semantics: case `context_tokens_approx <= target_bucket_tokens` is guaranteed (snapshot taken pre-append for the message that would cross). Also fix a latent bug in `_is_claude_session`: it returned False on the first non-user record, which misrouted any Claude session that led with `permission-mode`, `system`, or `queue-operation` (most real sessions do) — including the one this commit was developed against. Tests cover bucket fit, role collapsing, thinking-block drop, PII scrub on HOME paths + token-looking secrets, Codex record decoding, and the leading-meta-record regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erifier Add three small surfaces to the ``agent_recorded`` area to support the coding-agent-loop autotune sweep: * ``load_agent_recorded_multi_turn_cases()`` — reads the bucketed replay fixture produced by ``extract-agentic-fixture.py --multi-turn`` and returns cases sorted ascending by ``target_bucket_tokens``. Distinct from the v1 single-prompt fixture; the two coexist. * ``pick_multi_turn_case_for_budget()`` — given a prompt-token budget (typically ``max_ctx − reply_budget``), returns the largest case that fits. ``None`` when no case fits. * ``grade_prefill_and_decode()`` — pass/fail verifier for the sweep: non-empty response within wall budget, no server error. Lighter than tool-schema-coverage on purpose — the sweep is asking "did this max_ctx setting serve a trace of this length", not "did the model do the task well." Ship a harvested fixture: one Claude Code session sliced into 6 bucketed cases (8K through 128K tokens). Per repo guidance, one long session is enough to cycle with until something breaks; the broader corpus can land later if signal demands. Tests cover the loader contract (cases fit under their bucket, sorted by bucket), the budget picker (largest-fit, None-on-empty), and the verifier's three failure modes (server error, wall-budget overrun, response-too-short) plus the reasoning_content fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop Add an autotune Profile abstraction so different workloads can sweep different axes with different scorers. Two profiles ship: * ``heuristic`` (default, backward-compatible) — preset-agnostic bracket, scores by mean ``decode_tokens_per_sec`` from a luce-bench level1 snapshot. Identical to the prior behavior. * ``coding-agent-loop`` — architecture-aware. Gemma4's bracket is ``max_ctx × fa_window × budget × pflash_mode`` (KV-quant axis omitted because the gemma4 backend hardcodes F16 — verified at gemma4_loader.cpp). Qwen3.6 / laguna keep cache_type as an axis since their loader actually respects it. Scoring is composite: pass-rate on the agent_recorded multi-turn fixture first, then ``completion_tokens / wall_seconds`` as a tps proxy (the longctx-area snapshots ship empty ``decode_tokens_per_sec``). Wire ``--fa-window`` through to the server end-to-end: * ``DflashRuntime.fa_window`` (0 = full attention, server default) * ``DFLASH_FA_WINDOW`` emitted by docker_run.py when nonzero * entrypoint.sh appends ``--fa-window N`` to the server CLI iff ``DFLASH_FA_WINDOW > 0`` — unset env still reproduces stock behavior * ``dflash.fa_window`` round-trips through config.toml CLI: ``lucebox autotune --sweep --profile coding-agent-loop``. New ``--list-profiles`` flag prints the registered profile table. Tests: 318/318 green. New coverage: * Profile registry + ``get_profile`` error path * gemma bracket excludes the KV-quant axis (regression for the no-op axis bug) * gemma bracket varies max_ctx × fa_window × budget * qwen bracket includes tq3_0 + q8_0 * sub-22 GB tiers fall back to base-only (OOM safety) * ``_pick_winner`` ranks agent-replay results by pass→speed→ctx * ``fa_window`` is in the sweep allowlist Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the sweep is invoked directly (e.g. `uv run python -m lucebox autotune --sweep` for development, or any path that bypasses the lucebox.sh wrapper), the LUCEBOX_HOST_* env vars aren't set and ``host_facts.from_env()`` returns a zero-VRAM HostFacts. Every profile bracket then falls through to the <22 GB "base only" branch and the sweep silently degrades to a 1-cell smoke test that overwrites the operator's real config (e.g. dropping max_ctx from 131072 to the DflashRuntime default 16384). Fall back to ``cfg.host`` (populated by an earlier `lucebox check` via the wrapper) when ``from_env()`` yields no signal. Test regresses the original symptom: with LUCEBOX_HOST_* unset, the coding-agent-loop bracket on a 24 GB persisted host must produce a multi-cell sweep, not collapse to one base cell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ow one-shot forge batches
sse_emitter.cpp: extend find_tool_start() to detect Gemma4's call:<verb>{ format.
Previously find_tool_start only matched <tool_call>, <function=, <tool_code> XML
patterns, so the emitter never entered TOOL_BUFFER mode for Gemma4's plain-text
tool call emissions (call:verb{args}). Now Pattern B scans for call: preceded by
a valid sentinel char and followed by at least one alpha (the verb start), causing
the emitter to buffer from that point and parse_tool_calls() to run at emit_finish.
Result: server now returns stop_reason=tool_use + tool_use content blocks for Gemma4.
step_enforcer.py: allow one-shot batch tool calls where all pending required steps
appear before the terminal tool in the batch. Gemma4 emits calls in a single
response (e.g. [fetch_data, analyze, report]). The runner executes in order so
required steps are satisfied before the terminal executes — the batch is not
premature. This is a local modification to the vendored forge-guardrails 0.7.1.
Effect: forge basic_2step passes (was 0/5, now 1/5 = 20%).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dict Full 26-case agent_recorded nothink benchmark on image 658d016f-cuda12: - Gemma4: 19.2% (5/26) vs Qwen3.6: 46.2% (12/26) — Qwen3.6 wins by 27pp - Nothink suppression ineffective for Gemma4 (<|channel>thought bypasses prompt) - 12/26 cases had non-empty reasoning despite --no-think - 2 cases returned given=refused (model declined to engage) - Verdict: Qwen3.6-27B is the preferred model for coding/agent tasks on bragi Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…locks
normalize_chat_messages() only extracted text/input_text/output_text from
content arrays, silently dropping tool_use and tool_result blocks. This
caused multi-turn tool-call conversations (Anthropic Messages API format)
to lose all tool call history: the model never saw tool results and looped
infinitely calling the same tool. Manifested as Qwen3.6 forge=0%.
Two cases fixed:
1. Assistant message with tool_use content blocks: look up tool_memory by
ID (same as the OpenAI tool_calls path). Fallback for cross-session
replay: synthesize <tool_call><function=...></tool_call> XML.
2. User message with tool_result content blocks: push each result as a
{"tool", content, tool_use_id} message so the chat template renders
<tool_response> blocks. Skip pushing empty user containers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the normalize_chat_messages() bug where tool_use and tool_result Anthropic content blocks were silently dropped. Adds root-cause analysis, fix description, and benchmark results showing Qwen3.6 forge 0%→100% (5/5). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Gemma4 forge results on image dc20057e: unchanged at 20% (1/5). Documents why fix is neutral for Gemma4 (one-shot batch doesn't round-trip tool_results) but critical for Qwen3.6 (turn-by-turn needs proper context). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive summary of all autotune results, model comparison, server bug fixes, and configuration recommendations for bragi (RTX 5090 Laptop, 23 GB VRAM): - Qwen3.6-27B at budget=16, max_ctx=98304, tq3_0 KV is the optimal preset - Qwen3.6 forge 100% (5/5) vs Gemma4 20% post-fix - Documents three server fixes in dc20057e-cuda12 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mary Full pass-rate sweep on dc20057e-cuda12 (nothink): - forge 100%, agent 100%, longctx 100%, ds4-eval 77.2% - code 90%, truthfulqa-mc1 80%, agent_recorded 42.3% - hellaswag 88%, gsm8k 86% Update final tuning summary with verified numbers and corrected agent/longctx entries (agent 100% up from 75%, longctx 100% newly verified for Qwen3.6). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empirical test on bragi: prefix_cache_slots=32 causes -19pp regression on agent_recorded (23.1% vs 42.3% baseline). 5 cases regress, 0 unlock. Update autotune.py comment with measured numbers and doc reference. Smoke test passes 100% — the bug is specific to multi-turn tool convos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All major tunables swept and validated: budget=16, max_ctx=98304, tq3_0 KV, fa_window=0, prefix_cache_slots=0 (regression confirmed), pflash off. Includes full nothink/think benchmark table and known limitations for prefix cache, pflash, and Gemma4 issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unch Add speculator_dir field to ModelPreset for directory-based safetensors speculators (distinct from GGUF draft_file). When present on disk, the server launch sets DFLASH_DRAFT to that directory so the entrypoint's glob search finds model.safetensors inside it. For laguna-xs.2: speculator_dir="laguna-xs2-speculator" points to ~/.local/share/lucebox/models/draft/laguna-xs2-speculator/ where the 1.2 GB poolside/Laguna-XS.2-speculator.dflash safetensors live. Also adds pytest to the workspace dev deps so `make test` runs clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aguna characterization Update bragi-tuning-complete to use verified final baseline numbers from bragi-rtx5090laptop-qwen36-27b-dc20057e-nothink-2026-05-31 (9 areas, 100% output). Key changes: forge 100% (30/30 not 5/5), hellaswag 93% (clean run not 88% restart-contaminated), agent 75% (stochastic), gsm8k 81%. Add Laguna-XS.2 characterization: 20.3 GB model, 1.2 GB safetensors speculator (+60% decode), 8 GQA KV heads, ~960 MB KV at 32K tq3_0, ~56K safe max context on 23 GB VRAM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Architecture, VRAM budget, context window feasibility table, performance vs Qwen3.6-27B comparison. Benchmark results TBD pending running sweep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Laguna-XS.2 bragi baseline complete. forge=0% (model can't emit tool_use), code=20% (FIM format mismatch), gsm8k=93% (+12pp vs Qwen3.6), agent_recorded=50%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 48K context causes 10-70x prefill slowdown vs 32K (different kernel path) - frontier-16k times out at 300s; optimal max_ctx is 32768 - budget=4/16 crash server when using safetensors speculator (null JSON field bug) - budget=8 is the only safe value; sweep skipped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gemma4-31B: 60-layer dense 30.7B model, 20GB Q4_K_M, 1.6GB DFlash draft. Server confirmed running at 32K/tq3_0/budget=8 on bragi (24GB VRAM). Benchmark in progress. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Models ≥20 GB (gemma-4-31b at 21 GB, qwen3.6-moe at 22 GB) leave only ~2-3 GB for KV on 24 GB VRAM; the previous heuristic suggested max_ctx=98304 which would OOM. Now caps at 32K when approx_total_gb ≥ 20. - runtime_from_host(host, preset="") accepts optional preset name - _preset_approx_gb() looks up PRESETS.approx_total_gb for size awareness - CLI passes cfg.model.preset to autotune - _coding_agent_loop_candidates seeds from preset-aware base - Tests: add large-model and unknown-preset coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…026-05-31 Full nothink 32K sweep results on RTX 5090 Laptop MaxQ: - gsm8k 95% (+14pp vs Qwen3.6), agent_recorded 38.5% (=Qwen3.6) - code 70% (-20pp), hellaswag 79% (-14pp), truthfulqa 79% (-3pp) - longctx 33% (-67pp): Gemma4 template expansion causes HTTP 400 at frontier-8k+ Key operational lessons documented: - DFlash server hang bug: forge Anthropic-format + kill → infinite GPU loop - Use --max-tokens 512 for agent_recorded (4096 too slow at 22 tok/s effective) - Effective context limit ~4K real tokens at max_ctx=32768 for this model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reconfirmation sweep after Gemma4-31B session left config in non-optimal state. Winner: budget=22, max_ctx=98304, tq3_0 (applied). New findings: - budget=32+65K+q8_0 causes GPU compute hang (SM=100%, mem=0-1%), not a silent OOM crash as previously attributed — same DFlash hang bug as Gemma4-31B, now reproduced with Qwen3.6-27B - budget=32 at 98K context is 35% slower in decode than budget=22 (30.3s vs 22.4s) due to verification overhead with 84K KV cache - budget=16 and budget=22 are functionally equivalent at 98K (within noise); budget=32 is clearly suboptimal - Winner is budget=22 vs budget=16 on 05-30; difference is within measurement noise (0.912 vs 0.905 tok/s speed_metric) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive record of all tuning decisions for bragi (RTX 5090 Laptop, 23 GB VRAM, WSL2) covering sessions 2026-05-30 through 2026-06-01. Documents optimal Qwen3.6-27B config (budget=22, 98K, tq3_0), safe/unsafe parameter combinations, known issues (DFlash hang, prefix cache regression), model matrix, and sweep history. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add comment in _coding_agent_loop_qwen_bracket explaining that budget=32+q8_0 at 65K context is kept in the sweep bracket despite being known to cause a GPU compute hang (SM=100%, mem=0%) on 23 GB cards (observed 2026-06-01). The sweep handles it correctly via 300s timeout + systemd restart which clears the GPU state. Reference: docs/experiments/qwen3.6-27b-coding-agent-loop-sweep-bragi-2026-06-01.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…regression) Ran agent_recorded benchmark (26 cases) against Qwen3.6-27B at the winning sweep config (budget=22, max_ctx=98304, tq3_0). Result: 9/26 (34.6%) vs dc20057e baseline 10/26 (38.5%). 7 cases flipped in both directions; 1-case net delta is within noise at n=26 (σ≈9.5pp). No quality regression from the new config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All tunables (budget, max_ctx, KV quant, prefix cache, pflash, fa_window) have been swept. Documents final status and future-work blockers (prefix cache snapshot path bug, Gemma4-31B think mode not wired). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s active Previously n_gen_cap = min(think_ceiling + reply_budget, max_tokens) caused immediate force-close (step=0) for any request where max_tokens < reply_budget (e.g. gsm8k at 2048, agent_recorded at 4096, code at 2048). Benchmarks sized their max_tokens for nothink responses, so thinking was silently disabled. Fix: n_gen = think_ceiling + min(max_tokens, hard_limit_reply_budget), treating max_tokens as the post-thinking response budget rather than the total token cap. Also clamp hard_limit_remaining to min(max_output, eff_reply_budget) so the force-close boundary correctly reflects the available response window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR Luce-Org#326 wired soft-close (Level 2 voluntary close) into the Qwen3.5 AR loop, but on qwen3.6-27b the comparator never fired across 1085 steps of a sample trajectory (prob_ratio < 1e-8 every step). Root cause: the field `BudgetHook::close_token_ids` was used for BOTH (a) the peek probe id read by `soft_close::should_fire(..., close0)` (b) the inject sequence written when the hook fires. For the qwen3.6-27b model card the `thinking_terminator_hint` is the ~16-token English directive "Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n" so close_token_ids[0] tokenizes to the id for "Considering" (~79939) — a mid-sentence content token whose logit sits 19-35 nats below the chosen token at every step. The peek therefore reported a perpetually near-zero prob_ratio and the soft-close dial (min_ratio 0.1..0.9) was empirically inert. Fix (path α): split probe-vs-inject in `BudgetHook` - close_token_ids — unchanged role. Full inject sequence written on hard close or when soft-close fires. Multi-token directive for trained-hint sidecars (Qwen3.6); single marker token for bare-marker arches. - soft_close_probe_ids — NEW. Short sequence (typically one token) used only for the comparator peek. When the operator card has a distinct marker substring inside the hint, server_main tokenizes just that marker and ships it via this field. When empty, `BudgetHook::soft_close_probe_token()` falls back to close_token_ids.front() (legacy behavior — zero churn for sidecars without a separate marker). server_main detects the marker substring inside the hint and tokenizes it in isolation; on miss it warns and leaves the probe field empty (legacy peek path stays in force). The AR-loop soft-close lambda in qwen35_backend.cpp now peeks `budget_hook.soft_close_probe_token()` and writes `close_token_ids.front()` on fire — the inject sequence is unchanged downstream. `[soft-trace]` lines now report the probe token id under `close0=...` so trajectory CSVs remain interpretable. Hard-close path is untouched: it continues to use close_token_ids verbatim, matching the contract that the operator-resolved directive is what's emitted at the budget boundary. Tests ----- + test_soft_close_probe_uses_probe_ids_not_inject_ids — verifies the peek reads probe[0] when set, NOT inject[0]. Builds a logit row where inject[0]'s logit is far below chosen but probe[0]'s logit is close to chosen; asserts soft fires and the WRITTEN token is inject[0] (not probe[0]). + test_soft_close_probe_ids_empty_falls_back_to_close_token_ids — guarantees pre-split behavior when the probe field is left empty (no churn for legacy sidecars / unit-test BudgetHook construction). + test_soft_close_inject_sequence_unchanged_when_fires — multi-token inject case: on fire we stream inject[0], inject[1], inject[2] verbatim regardless of what's in soft_close_probe_ids. Also fix a pre-existing OOB in test_soft_close_determinism_when_disabled (vocab=1000 row indexed at 248069). The UB was silently passing in Release builds before but the adjacent test additions perturbed glibc heap layout enough to crash; widen the row to vocab=250000. 15 soft-close tests pass (12 existing + 3 new). 1985 total assertions; the two remaining failures are pre-existing `test_emitter_content_mode_*` unrelated to soft-close (PR Luce-Org#329 emitter work). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
9 issues found across 253 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md">
<violation number="1" location="docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md:49">
P2: Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.</violation>
</file>
<file name="server/src/qwen35/c2_gate.h">
<violation number="1" location="server/src/qwen35/c2_gate.h:28">
P2: The C2 gate uses overflow-prone `int` multiplication (`2 * fa_window_cfg`), which can misroute decode mode for large configured `fa_window` values.</violation>
</file>
<file name="harness/src/harness/clients/pi.py">
<violation number="1" location="harness/src/harness/clients/pi.py:80">
P3: `--tools` is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.</violation>
</file>
<file name="harness/clients/README.md">
<violation number="1" location="harness/clients/README.md:118">
P2: Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the `code` area. This will mislead users about which areas to set and what the default covers.</violation>
</file>
<file name="Makefile">
<violation number="1" location="Makefile:73">
P2: `MODELS_DIR` is unquoted in the Docker bind mount, so paths with spaces/special characters break `serve` and can mount the wrong source path.</violation>
<violation number="2" location="Makefile:112">
P1: `clean-models` uses an unquoted, unguarded `rm -rf $(MODELS_DIR)/*`, which can delete unintended files when the path is malformed or overridden unsafely.</violation>
</file>
<file name="harness/src/harness/clients/codex.py">
<violation number="1" location="harness/src/harness/clients/codex.py:64">
P2: `launch()` does not create a user-supplied `work_dir`, so writing `config.toml` can fail with `FileNotFoundError`.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:363">
P1: New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.</violation>
</file>
<file name=".github/workflows/ci.yml">
<violation number="1" location=".github/workflows/ci.yml:23">
P2: Lint and typecheck steps using `uv run --frozen --extra dev` will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the `--no-install-package torch` optimization in `check_uv_workspace.sh` that was explicitly designed to keep this job fast.</violation>
</file>
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Re-trigger cubic
| .PHONY: clean-models | ||
| clean-models: ## Remove downloaded models from $(MODELS_DIR). Destructive. | ||
| @echo "WARN: about to rm -rf $(MODELS_DIR)/*" | ||
| @read -p "Continue? [y/N] " ans && [ "$$ans" = "y" ] && rm -rf $(MODELS_DIR)/* |
There was a problem hiding this comment.
P1: clean-models uses an unquoted, unguarded rm -rf $(MODELS_DIR)/*, which can delete unintended files when the path is malformed or overridden unsafely.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At Makefile, line 112:
<comment>`clean-models` uses an unquoted, unguarded `rm -rf $(MODELS_DIR)/*`, which can delete unintended files when the path is malformed or overridden unsafely.</comment>
<file context>
@@ -0,0 +1,112 @@
+.PHONY: clean-models
+clean-models: ## Remove downloaded models from $(MODELS_DIR). Destructive.
+ @echo "WARN: about to rm -rf $(MODELS_DIR)/*"
+ @read -p "Continue? [y/N] " ans && [ "$$ans" = "y" ] && rm -rf $(MODELS_DIR)/*
</file context>
| const int64_t derived_kv_dim = L0.wk->ne[1]; | ||
| const int64_t expected_q_dim = (int64_t)out.n_head * out.head_dim; | ||
| const int64_t expected_kv_dim = (int64_t)out.n_head_kv * out.head_dim; | ||
| if (derived_q_dim != expected_q_dim) { |
There was a problem hiding this comment.
P1: New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 363:
<comment>New strict metadata-vs-shape assertions can reject valid mismatched GGUF drafts (notably Gemma4) before downstream shape-based correction runs.</comment>
<file context>
@@ -349,6 +349,63 @@ bool load_draft_gguf(const std::string & path,
+ const int64_t derived_kv_dim = L0.wk->ne[1];
+ const int64_t expected_q_dim = (int64_t)out.n_head * out.head_dim;
+ const int64_t expected_kv_dim = (int64_t)out.n_head_kv * out.head_dim;
+ if (derived_q_dim != expected_q_dim) {
+ char buf[256];
+ std::snprintf(buf, sizeof(buf),
</file context>
| | 4 | FAIL | FAIL | | ||
| | 5 | PASS | PASS | | ||
| | 6 | FAIL | FAIL | | ||
| | 7–11 | FAIL | FAIL | |
There was a problem hiding this comment.
P2: Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/experiments/qwen3.6-27b-prefix-cache-regression-bragi-2026-05-31.md, line 49:
<comment>Pass/fail table contains contradictory entries for cases 9 and 17 — ranges 7–11 and 14–18 show them as FAIL/FAIL but individual rows and the summary show them as regressions (PASS→FAIL). Readers cannot tell which data is authoritative.</comment>
<file context>
@@ -0,0 +1,84 @@
+| 4 | FAIL | FAIL |
+| 5 | PASS | PASS |
+| 6 | FAIL | FAIL |
+| 7–11 | FAIL | FAIL |
+| 12 | PASS | PASS |
+| 13 | PASS | FAIL (regression) |
</file context>
| int kv_committed) { | ||
| (void)kv_committed; | ||
| return (fa_window_override == 0) | ||
| || (fa_window_override <= 2 * fa_window_cfg); |
There was a problem hiding this comment.
P2: The C2 gate uses overflow-prone int multiplication (2 * fa_window_cfg), which can misroute decode mode for large configured fa_window values.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/c2_gate.h, line 28:
<comment>The C2 gate uses overflow-prone `int` multiplication (`2 * fa_window_cfg`), which can misroute decode mode for large configured `fa_window` values.</comment>
<file context>
@@ -0,0 +1,31 @@
+ int kv_committed) {
+ (void)kv_committed;
+ return (fa_window_override == 0)
+ || (fa_window_override <= 2 * fa_window_cfg);
+}
+
</file context>
| || (fa_window_override <= 2 * fa_window_cfg); | |
| || (static_cast<long long>(fa_window_override) <= | |
| 2LL * static_cast<long long>(fa_window_cfg)); |
| it would break a real-client launcher above. | ||
|
|
||
| ```bash | ||
| # Full sweep (default — runs all 4 stdlib areas) |
There was a problem hiding this comment.
P2: Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the code area. This will mislead users about which areas to set and what the default covers.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/README.md, line 118:
<comment>Inaccurate documentation: the default sweep runs 5 level1 areas (smoke/code/gsm8k/agent/longctx), not "all 4 stdlib areas". Also "HumanEval" is not a luce-bench area name — it's a dataset inside the `code` area. This will mislead users about which areas to set and what the default covers.</comment>
<file context>
@@ -102,6 +103,29 @@ OpenAI Chat Completions clients can call llama.cpp directly. Claude Code and
+it would break a real-client launcher above.
+
+```bash
+# Full sweep (default — runs all 4 stdlib areas)
+harness/clients/run_lucebench.sh
+
</file context>
| .PHONY: serve | ||
| serve: ## Run the local image, foreground. Models bind-mounted from $(MODELS_DIR). | ||
| docker run --rm --gpus all -p 8080:8080 \ | ||
| -v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \ |
There was a problem hiding this comment.
P2: MODELS_DIR is unquoted in the Docker bind mount, so paths with spaces/special characters break serve and can mount the wrong source path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At Makefile, line 73:
<comment>`MODELS_DIR` is unquoted in the Docker bind mount, so paths with spaces/special characters break `serve` and can mount the wrong source path.</comment>
<file context>
@@ -0,0 +1,112 @@
+.PHONY: serve
+serve: ## Run the local image, foreground. Models bind-mounted from $(MODELS_DIR).
+ docker run --rm --gpus all -p 8080:8080 \
+ -v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \
+ --name lucebox-gemma \
+ $(IMAGE) serve
</file context>
| -v $(MODELS_DIR):/opt/lucebox-hub/server/models:ro \ | |
| -v "$(MODELS_DIR)":/opt/lucebox-hub/server/models:ro \ |
| """ | ||
| codex_bin = find_bin("codex", env_var="CODEX_BIN", | ||
| work_dir_hint="clients/codex/npm/bin/codex") | ||
| home = work_dir or mktempdir("codex") |
There was a problem hiding this comment.
P2: launch() does not create a user-supplied work_dir, so writing config.toml can fail with FileNotFoundError.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/src/harness/clients/codex.py, line 64:
<comment>`launch()` does not create a user-supplied `work_dir`, so writing `config.toml` can fail with `FileNotFoundError`.</comment>
<file context>
@@ -0,0 +1,129 @@
+ """
+ codex_bin = find_bin("codex", env_var="CODEX_BIN",
+ work_dir_hint="clients/codex/npm/bin/codex")
+ home = work_dir or mktempdir("codex")
+ write_config(home, base_url=base_url, model=model,
+ sandbox=sandbox, wire_api=wire_api)
</file context>
| run: bash scripts/check_uv_workspace.sh | ||
|
|
||
| - name: Lint Python surfaces touched by lucebox tooling | ||
| run: uv run --frozen --extra dev ruff check . |
There was a problem hiding this comment.
P2: Lint and typecheck steps using uv run --frozen --extra dev will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the --no-install-package torch optimization in check_uv_workspace.sh that was explicitly designed to keep this job fast.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/ci.yml, line 23:
<comment>Lint and typecheck steps using `uv run --frozen --extra dev` will trigger a full re-sync that installs the cu128 torch wheel (~2 GB), defeating the `--no-install-package torch` optimization in `check_uv_workspace.sh` that was explicitly designed to keep this job fast.</comment>
<file context>
@@ -10,20 +10,46 @@ jobs:
run: bash scripts/check_uv_workspace.sh
+ - name: Lint Python surfaces touched by lucebox tooling
+ run: uv run --frozen --extra dev ruff check .
+
+ - name: Typecheck lucebox CLI
</file context>
| "PI_OFFLINE": "1", | ||
| } | ||
| argv: list[str] = [bin_path] | ||
| if interactive: |
There was a problem hiding this comment.
P3: --tools is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/src/harness/clients/pi.py, line 80:
<comment>`--tools` is ignored in interactive mode, causing inconsistent behavior and a misleading CLI contract.</comment>
<file context>
@@ -0,0 +1,130 @@
+ "PI_OFFLINE": "1",
+ }
+ argv: list[str] = [bin_path]
+ if interactive:
+ if extra_args:
+ argv += extra_args
</file context>
|
Closing in favor of consolidating the probe/inject split fix directly onto PR #326's branch. PR #331 had a fundamentally wrong base — it was opened against The probe/inject split (commit 🤖 Generated with Claude Code |
Summary
Soft-close (PR #326) shipped with an empirically inert configuration on qwen3.6-27b. Root cause:
BudgetHook::close_token_idswas used for both the soft-close peek probe AND the inject sequence. For qwen3.6-27b, the configuredthinking_terminator_hintis a 16+ token English directive starting with "Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n" — so the peek was checking the logit of token 79939 ("Considering"), a mid-sentence content token the model rarely promotes. Trajectory data showedprob_ratio < 1e-8across 12,888 reasoning steps; the dial was dead at any sampled ratio in {0.1, 0.3, 0.5, 0.7, 0.9}.Fix
Split the probe ids (short marker sequence) from the inject ids (full directive). New
BudgetHook::soft_close_probe_idsfield;soft_close_probe_token()accessor with empty-falls-back-to-close_token_ids legacy behavior so models without the split see zero diff.server_main.cppnow tokenizes the marker substring (\</think\>) separately when the hint contains it; logs both probe and inject id vectors at startup.qwen35_backend.cpp::maybe_soft_closepeeksprobe_ids.front()instead ofclose_token_ids.front();[soft-trace]close0=field reports the probe id so trajectory CSVs stay interpretable. Hard-close path (maybe_force_close) untouched — still injects the full directive.Empirical validation
Re-ran
/tmp/probe_soft_close_trajectory.shagainst an image built from this branch (lucebox-hub:175c8a72-cuda12) on sindri (qwen3.6-27b, RTX 3090 Ti).Phase 2 trajectory (ratio=0, debug logits on):
</think>(id 248069) reliably becomes argmax-competitive (diff >= log(0.1) = -2.30) at 66-94% of natural reasoning length across 5 diverse prompts.max_diffreaches 0.000 (prob_ratio = 1.0) on every prompt — vs prior baselinemax_diff = -9.69on token 79939. 9.7 nat improvement, restoring the mechanism to its designed regime.Phase 1 live firing: soft-close fires reliably at ratios 0.1-0.9 with
stop_reason=end_turnand coherent text outputs across all configurations. Single-sample thinking-token savings are noisy (sampling non-determinism is ±30%); multi-seed sweeps are deferred to a follow-up.Tests
test_server_unit.cpp: probe-uses-probe-ids-not-inject-ids, probe-ids-empty-falls-back-to-close-token-ids, inject-sequence-unchanged-when-fires.test_soft_close_determinism_when_disabled(vocab 1000 → 250000) — UB-silent until new tests perturbed heap layout.feat/lucebox-dockertip).Files changed (+259/-28)
server/src/common/model_backend.h— newsoft_close_probe_ids+ accessor.server/src/qwen35/qwen35_backend.cpp— peek probe, inject full sequence.server/src/server/http_server.h—ServerConfig::think_close_probe_token_ids.server/src/server/http_server.cpp— wire probe ids into per-requestBudgetHook.server/src/server/server_main.cpp— split-tokenize marker substring; startup logging.server/test/test_server_unit.cpp— 3 new tests + OOB fix.Test plan
close0in[soft-trace]now reports 248069 (\</think\>), not 79939.\</think\>reaches argmax across 5 prompts.🤖 Generated with Claude Code