feat(server): soft-close thinking termination via logit-ratio peek#326
feat(server): soft-close thinking termination via logit-ratio peek#326easel wants to merge 3 commits into
Conversation
…io peek Settle the design for a configurable soft-close dial that lets the AR loop terminate `</think>` early once its close-token logit comes within a configurable probability ratio of the chosen-token logit. Default disabled (zero cost when off); operator opt-in via `--think-soft-close-min-ratio`; per-request override clamps to the server ceiling like other thinking knobs. Key design choices documented: - Reuse the existing per-step CPU logits read (no graph addition). - Compare via `logit_diff >= log(min_ratio)` — no softmax required. - Multi-token close peeks first id only; existing inject machinery drives the rest of the sequence. - Soft wins ties against hard on same-step trigger (rebuttal in §12). - Spec-decode boundary unchanged — pure-AR only in v1. Next steps: codex review (§11 placeholder), implementation, tests.
Add an operator-configurable dial (`--think-soft-close-min-ratio`) that lets the AR loop terminate `</think>` early when its close-token logit comes within a configured probability ratio of the chosen-token logit. Default `0.0` (disabled) is byte-identical to pre-change behaviour. Mechanism (Qwen3.5/3.6 AR loop only in v1): - Comparator runs after sampling, before the existing hard-cap hook, using the logits row that's already on CPU for the sampler — no graph addition, no extra GPU work. - Threshold check uses `logit[close] - logit[chosen] >= log(min_ratio)`, which is mathematically equivalent to a probability-ratio compare but avoids softmax / exp() cost. - Per-request override (`thinking.soft_close_min_ratio`) clamps to `min(requested, server_default)`; ignored entirely when the operator has the dial at 0 (codex review Q5 fix). - Multi-token close peeks first id only; existing inject machinery drives the remaining ids. - New `close_kind="soft"` value in `finish_details`; spec §7 updated. Soft wins ties against hard on the same step (plan §4 + §12). Plumbing: - `BudgetHook::soft_close_min_ratio` (model_backend.h). - `GenerateResult::soft_forced_close`. - `ServerConfig::soft_close_min_ratio` + `--think-soft-close-min-ratio` CLI flag + startup banner line. - `ParsedRequest::per_req_soft_close_min_ratio` parsed from `thinking.soft_close_min_ratio`. - `do_ar_decode` / `do_spec_decode` signatures extended with a `soft_forced_close_out` pointer; existing hard-cap path untouched. Tests (12 new, 17 RUN_TEST invocations adding ~135 assertions): - Comparator math: disabled/strict/aggressive/below-threshold/ chosen-is-close/tiny-ratio edge cases. - State machine: single-token + multi-token inject, soft-preempts-hard, disabled-hard-still-fires, natural-at-boundary, byte-identical determinism when disabled. Spec-decode boundary documented as v1 limitation (out of scope). Gemma4 + Laguna soft-close are follow-ups; lucebox python config and autotune sweep brackets land in the lucebox CLI repo. See docs/experiments/soft-close-thinking-termination-plan.md for the full design (with verbatim codex review + dispositions).
There was a problem hiding this comment.
2 issues found across 9 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="docs/experiments/soft-close-thinking-termination-plan.md">
<violation number="1" location="docs/experiments/soft-close-thinking-termination-plan.md:103">
P3: Plan document contradicts itself on whether `log(min_ratio)` is precomputed once outside the AR loop or computed each step. §3.1's code snippet computes `std::log(budget_hook.soft_close_min_ratio)` inside the loop's if-block (every step the comparator runs), but the text immediately after says it is 'precomputed once outside the loop' and §3.6 repeats 'precomputed once at AR entry'. The actual implementation in `soft_close::should_fire` (model_backend.h:108) also computes `std::log(min_ratio)` on each call rather than caching it. A reader trying to implement from the plan would get contradictory guidance about where to place the `log()` call.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:983">
P1: Soft-close skips the first token of multi-token close sequences because `maybe_force_close` immediately overwrites `close[0]` with `close[1]` in the same step.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| tok, close0, budget_hook.close_token_ids.size()); | ||
| tok = close0; | ||
| budget_close_started = true; | ||
| close_inject_pos = 1; |
There was a problem hiding this comment.
P1: Soft-close skips the first token of multi-token close sequences because maybe_force_close immediately overwrites close[0] with close[1] in the same step.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 983:
<comment>Soft-close skips the first token of multi-token close sequences because `maybe_force_close` immediately overwrites `close[0]` with `close[1]` in the same step.</comment>
<file context>
@@ -938,6 +943,47 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen,
+ tok, close0, budget_hook.close_token_ids.size());
+ tok = close0;
+ budget_close_started = true;
+ close_inject_pos = 1;
+ if (soft_forced_close_out) *soft_forced_close_out = true;
+ };
</file context>
| close_inject_pos = 1; | |
| close_inject_pos = 0; |
| // prob[close] / prob[chosen] = exp(l_close - l_chosen); | ||
| // Compare l_close - l_chosen >= log(min_ratio) — single fma, | ||
| // no exp() needed. | ||
| const float log_ratio = std::log(budget_hook.soft_close_min_ratio); |
There was a problem hiding this comment.
P3: Plan document contradicts itself on whether log(min_ratio) is precomputed once outside the AR loop or computed each step. §3.1's code snippet computes std::log(budget_hook.soft_close_min_ratio) inside the loop's if-block (every step the comparator runs), but the text immediately after says it is 'precomputed once outside the loop' and §3.6 repeats 'precomputed once at AR entry'. The actual implementation in soft_close::should_fire (model_backend.h:108) also computes std::log(min_ratio) on each call rather than caching it. A reader trying to implement from the plan would get contradictory guidance about where to place the log() call.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/experiments/soft-close-thinking-termination-plan.md, line 103:
<comment>Plan document contradicts itself on whether `log(min_ratio)` is precomputed once outside the AR loop or computed each step. §3.1's code snippet computes `std::log(budget_hook.soft_close_min_ratio)` inside the loop's if-block (every step the comparator runs), but the text immediately after says it is 'precomputed once outside the loop' and §3.6 repeats 'precomputed once at AR entry'. The actual implementation in `soft_close::should_fire` (model_backend.h:108) also computes `std::log(min_ratio)` on each call rather than caching it. A reader trying to implement from the plan would get contradictory guidance about where to place the `log()` call.</comment>
<file context>
@@ -0,0 +1,774 @@
+ // prob[close] / prob[chosen] = exp(l_close - l_chosen);
+ // Compare l_close - l_chosen >= log(min_ratio) — single fma,
+ // no exp() needed.
+ const float log_ratio = std::log(budget_hook.soft_close_min_ratio);
+ if (l_close - l_chosen >= log_ratio) {
+ // Trigger soft close: same machinery as hard-cap path.
</file context>
Integrate soft-close thinking termination while preserving the existing empty-visible-output retry path, stall guards, MoE AR dispatch path, and C2 gate tests.
Record PR Luce-Org#326 integration, current PR-head coverage, retained conflict probes, and Luce-Org#321 target-shard IPC feasibility findings.
…ebox-docker
Brings the soft-close logit-ratio peek mechanism onto feat/lucebox-docker
so the cuda12 image can be rebuilt with both the call:<verb>{} parser+
emitter fix (Luce-Org#329) AND the auto-thinking-cap dial available in a single
sweep.
Folded:
- 1552495 docs(experiments): plan soft-close thinking termination
- d799d00 feat(server): soft-close thinking termination via logit-ratio peek
Conflicts resolved:
- server/src/qwen35/qwen35_backend.cpp: do_ar_decode signature kept
HEAD's terse comment + soft-close's new bool *soft_forced_close_out
parameter.
- server/test/test_server_unit.cpp: concatenated HEAD's C2-gate tests
with soft-close's comparator/state-machine tests; merged both
RUN_TEST blocks.
Plumbing added in this merge (not on the source branch):
- DFLASH_THINK_SOFT_CLOSE_MIN_RATIO env var in entrypoint.sh, emitted
to the server CLI as --think-soft-close-min-ratio only when nonzero
(preserves byte-identical-when-disabled invariant).
- DflashRuntime.think_soft_close_min_ratio (float, default 0.0) in
lucebox types/config/docker_run so `lucebox config set
dflash.think_soft_close_min_ratio=0.5` propagates through.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring PR Luce-Org#326 merge Cherry-pick artifact from resolving the conflict in test_server_unit.cpp during the soft-close merge — `sed -i '4155d'` deleted the closing brace of test_soft_close_natural_at_boundary instead of the leftover conflict-marker line. Compile fails with 'a function-definition is not allowed here before `{` token' at the int main() that follows. Restores the brace; no logic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an operator-only flag that emits one stderr line per AR step inside the thinking phase recording (committed, chosen_tok, close0_tok, logit[close], logit[chosen], diff, prob_ratio). Designed to capture real close-vs-chosen logit trajectories on qwen3.6 so a sliding-ratio soft-close curve can be fit from data rather than guessed. The fixed-ratio soft-close (PR Luce-Org#326) terminates thinking when logit[close]-logit[chosen] >= log(ratio). A single ratio is the wrong tool for both "step 1 reasoning" and "5K-token reasoning" — what we want is a ratio that slides from strict at the start to permissive at the cap. Curve shape (linear / exponential / piecewise) depends on how the logit gap evolves through thinking, which this flag now exposes empirically. Plumbing: - BudgetHook::debug_thinking_logits (model_backend.h) - qwen35_backend.cpp maybe_soft_close lambda: emits [soft-trace] every step when flag set, regardless of soft_close_min_ratio. Also enables the prefill-last-logits read on the first AR token so step 0 participates. - ServerConfig::debug_thinking_logits + --debug-thinking-logits CLI + startup banner line. - http_server.cpp threads config_.debug_thinking_logits into the per-request BudgetHook. - DFLASH_DEBUG_THINKING_LOGITS env in entrypoint.sh (default 0; forwarded to --debug-thinking-logits when "1"). - lucebox: DflashRuntime.debug_thinking_logits (bool, default False) + config.py setter + docker_run.py env emission. Zero GPU cost (logits already on CPU for sampling); ~1 stderr line per thinking token across in-flight requests when on. Off by default. No behavior change when DFLASH_DEBUG_THINKING_LOGITS=0. test_server_unit: 1973 assertions, 0 failures. lucebox tests: 114/114 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…floor Empirical motivation ==================== Soft-close (PR Luce-Org#326 mainline) was effectively inert on qwen3.6-27b. A trajectory probe across 5 diverse prompts (1085-5771 thinking tokens each) showed `prob_ratio < 1e-8` every step — meaning no sampled ratio in {0.1, 0.3, 0.5, 0.7, 0.9} would ever fire. Root cause: `BudgetHook::close_token_ids` was used for both: (a) the peek-token id read by `soft_close::should_fire(..., close0)` (b) the inject sequence written when the hook fires. For qwen3.6-27b the model card's `thinking_terminator_hint` is a 16+ token English directive starting with "Considering the limited time by the user, ...". So `close_token_ids[0]` tokenized to ~79939 ("Considering") — a mid-sentence content token whose logit sits 19-35 nats below the chosen token at every thinking step. Fix (path α): split probe-vs-inject in BudgetHook ================================================== * `close_token_ids` — unchanged role. Full inject sequence written on hard close or when soft-close fires. * `soft_close_probe_ids` — NEW. Short sequence (typically one token) used only for the comparator peek. server_main detects the close marker substring inside the hint and tokenizes it in isolation; on miss it leaves the probe field empty (legacy fallback peek path in force). `BudgetHook::soft_close_probe_token()` returns the probe when set, else falls back to close_token_ids.front(). Validation: re-probed with image built from this branch. `</think>` (token 248069) reliably becomes argmax-competitive at 66-94% of natural reasoning length across all 5 prompts. `max_diff` reaches 0.000 (`prob_ratio = 1.0`) on every prompt vs prior `max_diff = -9.69` on token 79939. 9.7 nat improvement, restoring the mechanism to its designed regime. False-positive guard: min_thinking_tokens floor ================================================ The peek runs every AR step but the fire decision can be gated by a new `BudgetHook::soft_close_min_tokens` (server CLI: `--think-soft-close-min-tokens N`). When set, suppress fire until `committed_now - committed_at_entry >= soft_close_min_tokens`. Protects against a rare early `</think>` logit spike on prompts where the model briefly considers concluding mid-thought. Default 0 = floor disabled (no behavior change from prior). Empirical 66-94% fire window puts typical operating point at floor=128 for qwen3.6-27b. Per-request override not exposed (server-policy gate). Diagnostic: --debug-thinking-logits ==================================== Adds `BudgetHook::debug_thinking_logits` + server CLI flag. When on, emits one stderr line per AR step recording committed, chosen, probe0, logit[close], logit[chosen], diff, prob_ratio. Used to capture full close-vs-chosen trajectories so a sliding-ratio curve can be designed from data rather than guessed. Zero GPU cost (logits already on CPU for sampling); stderr-heavy, operator-only. Tests ===== 5 new unit tests: - test_soft_close_probe_uses_probe_ids_not_inject_ids - test_soft_close_probe_ids_empty_falls_back_to_close_token_ids - test_soft_close_inject_sequence_unchanged_when_fires - test_soft_close_min_tokens_blocks_early_fire - test_soft_close_min_tokens_default_zero_unchanged_behavior Also fixes a pre-existing OOB write in test_soft_close_determinism_when_disabled (vocab=1000 row indexed at 248069). UB-silent in Release before the new tests perturbed heap layout enough to crash; widened to vocab=250000 in place. test_server_unit: 1780 assertions, 0 failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up commit
|
…floor Empirical motivation ==================== Soft-close (PR Luce-Org#326 mainline) was effectively inert on qwen3.6-27b. A trajectory probe across 5 diverse prompts (1085-5771 thinking tokens each) showed `prob_ratio < 1e-8` every step — meaning no sampled ratio in {0.1, 0.3, 0.5, 0.7, 0.9} would ever fire. Root cause: `BudgetHook::close_token_ids` was used for both: (a) the peek-token id read by `soft_close::should_fire(..., close0)` (b) the inject sequence written when the hook fires. For qwen3.6-27b the model card's `thinking_terminator_hint` is a 16+ token English directive starting with "Considering the limited time by the user, ...". So `close_token_ids[0]` tokenized to ~79939 ("Considering") — a mid-sentence content token whose logit sits 19-35 nats below the chosen token at every thinking step. Fix (path α): split probe-vs-inject in BudgetHook ================================================== * `close_token_ids` — unchanged role. Full inject sequence written on hard close or when soft-close fires. * `soft_close_probe_ids` — NEW. Short sequence (typically one token) used only for the comparator peek. server_main detects the close marker substring inside the hint and tokenizes it in isolation; on miss it leaves the probe field empty (legacy fallback peek path in force). `BudgetHook::soft_close_probe_token()` returns the probe when set, else falls back to close_token_ids.front(). Validation: re-probed with image built from this branch. `</think>` (token 248069) reliably becomes argmax-competitive at 66-94% of natural reasoning length across all 5 prompts. `max_diff` reaches 0.000 (`prob_ratio = 1.0`) on every prompt vs prior `max_diff = -9.69` on token 79939. 9.7 nat improvement, restoring the mechanism to its designed regime. False-positive guard: min_thinking_tokens floor ================================================ The peek runs every AR step but the fire decision can be gated by a new `BudgetHook::soft_close_min_tokens` (server CLI: `--think-soft-close-min-tokens N`). When set, suppress fire until `committed_now - committed_at_entry >= soft_close_min_tokens`. Protects against a rare early `</think>` logit spike on prompts where the model briefly considers concluding mid-thought. Default 0 = floor disabled (no behavior change from prior). Empirical 66-94% fire window puts typical operating point at floor=128 for qwen3.6-27b. Per-request override not exposed (server-policy gate). Diagnostic: --debug-thinking-logits ==================================== Adds `BudgetHook::debug_thinking_logits` + server CLI flag. When on, emits one stderr line per AR step recording committed, chosen, probe0, logit[close], logit[chosen], diff, prob_ratio. Used to capture full close-vs-chosen trajectories so a sliding-ratio curve can be designed from data rather than guessed. Zero GPU cost (logits already on CPU for sampling); stderr-heavy, operator-only. Tests ===== 5 new unit tests: - test_soft_close_probe_uses_probe_ids_not_inject_ids - test_soft_close_probe_ids_empty_falls_back_to_close_token_ids - test_soft_close_inject_sequence_unchanged_when_fires - test_soft_close_min_tokens_blocks_early_fire - test_soft_close_min_tokens_default_zero_unchanged_behavior Also fixes a pre-existing OOB write in test_soft_close_determinism_when_disabled (vocab=1000 row indexed at 248069). UB-silent in Release before the new tests perturbed heap layout enough to crash; widened to vocab=250000 in place. test_server_unit: 1780 assertions, 0 failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Adds an operator-configurable soft-close dial that lets the AR
decode loop terminate
</think>early when the close-token logit comeswithin a configured probability ratio of the most-likely-token logit.
Default
0.0keeps current behaviour byte-identical; any positivevalue activates the third
close_kind="soft"taxonomy entry thatdocs/specs/thinking-budget.md§7 has reserved since the v2 work.Motivation. Gemma 4 26B decodes at ~30 tok/s through up to 15 488
phase-1 thinking tokens (~8 min wall-clock / case) before the hard-cap
hook fires. Spot-checks of close-token logits in the late phase-1
window show
</think>riding at 10-60 % of the chosen-tokenprobability for thousands of steps — i.e. the model is near ready
to close. A soft-ratio dial in the
[0.05, 0.5]range can reclaim30-50 % of those tokens at no quality loss. Sweep methodology and
recommended dial values land in a follow-up PR (out of scope here).
What's in the PR
docs/experiments/soft-close-thinking-termination-plan.md, with averbatim codex review and per-finding dispositions.
BudgetHook::soft_close_min_ratio+GenerateResult::soft_forced_close.qwen35_backend.cpp::do_ar_decode.--think-soft-close-min-ratio <F>+ startup banner line.thinking.soft_close_min_ratio, clamped to the server ceiling.close_kind="soft"emitted infinish_detailswhen the soft pathfired (precedence: soft > hard > natural on tie — see plan §12).
Mechanism — zero-cost when disabled
The AR loop already materialises the full logits row to CPU each step
for the sampler. The comparator reads two scalars from that buffer
and runs
logit[close0] - logit[chosen] >= log(min_ratio)— no graphmodification, no extra GPU work. When
min_ratio == 0the outerguard short-circuits before any work happens. Generation determinism
is byte-identical to pre-PR with the dial off.
For the math:
prob[i]/prob[j] = exp(logit[i] - logit[j]), socomparing
logit_diff >= log(min_ratio)is identical to comparingprob_ratio >= min_ratiobut skips the softmax. Numerically stablein fp32 for typical LLM logit ranges (codex confirmed §3.4).
Scope (v1)
pattern (full logits already on CPU per step) but get their own PRs
to keep the diff reviewable.
argmax-of-target — soft-peek there would require graph extension.
Spec-decode tails off into AR before the budget edge, so the soft
trigger still fires correctly on the AR tail.
</think>sequences (Laguna)peek only the first id; the existing inject machinery handles the
rest. Codex agreed this is the right engineering trade-off (review
Q3).
Codex review
Sent to the live Gemma 4 26B service via
lucebox codex; verbatimreview and per-finding dispositions are recorded in plan §11. Codex
verdict: PROCEED WITH CHANGES. The single critical finding (Q5,
per-request clamp logic broken when server_default=0) is fixed: the
operator-disabled-server case now silently ignores per-request opt-in
attempts rather than enabling them via the clamp loophole.
Tests
server/test/test_server_unit.cppgains 12 new test functions(~135 new assertions). The comparator is small and inline in
model_backend.h::soft_close::should_fire, called from a unit-testmirror of the AR loop's close-state machine. No GPU required.
Standalone smoke-test of the comparator math (uncommitted) confirmed
135/135 assertions pass before committing.
Test plan
cmake --build server/build --target test_server_unit && ctest -R server_unitluce-bench/tests/test_client_thinking_budget.pypass unchanged (default dial=0 → behaviour byte-identical).--think-soft-close-min-ratio 0.1in a smoke deploy, confirms server banner shows the new line, runs a Qwen3.6 thinking-enabled probe, and verifies afinish_details.close_kind="soft"appears for at least one case.min_ratio ∈ {0.05, 0.1, 0.2, 0.5}on the existing coding-agent-loop probes.Follow-ups (NOT in this PR)
dflash.think_soft_close_min_ratioknob in the lucebox python CLI repo + autotune sweep bracket entry.🤖 Generated with Claude Code