diff --git a/docs/experiments/soft-close-thinking-termination-plan.md b/docs/experiments/soft-close-thinking-termination-plan.md new file mode 100644 index 00000000..3ebe2dd7 --- /dev/null +++ b/docs/experiments/soft-close-thinking-termination-plan.md @@ -0,0 +1,774 @@ +# Soft-close: logit-ratio-driven early `` termination + +Status: PLAN — pre-implementation. No code changes in this commit. + +Branch: `feat/soft-close-thinking-termination` +Base: `Luce-Org/lucebox-hub:main` @ `8305b6c` +Affected files (anticipated): +- `server/src/common/model_backend.h` — extend `struct BudgetHook` and `struct GenerateResult`. +- `server/src/qwen35/qwen35_backend.cpp` — soft-close peek inside the AR decode loop (`do_ar_decode`). +- `server/src/server/http_server.cpp` — wire CLI/per-request soft ratio into `BudgetHook`; flip `close_kind` to `"soft"` when the soft path fired. +- `server/src/server/http_server.h` — add `soft_close_min_ratio` to `ServerConfig` + per-request override field. +- `server/src/server/server_main.cpp` — `--think-soft-close-min-ratio` CLI flag + startup banner. +- `server/test/test_server_unit.cpp` — comparator + state-machine unit tests. +- `docs/specs/thinking-budget.md` — note `close_kind="soft"` is now live and document the dial. + +Explicitly NOT touched (parallel sub-agent owns these on +`fix/sse-emitter-content-mode-tool-parse`): +- `server/src/server/sse_emitter.cpp` +- `server/src/server/tool_parser.cpp` + +## 1. Problem statement + +The thinking-budget envelope (`docs/specs/thinking-budget.md`) today +exposes two `close_kind` values: + +- `natural` — the model emitted `` on its own. +- `hard` — the Level-2 hook injected `` at the budget edge + because the model would otherwise burn the entire phase-1 budget. + +In practice, Gemma 4 26B decodes at ~30 tok/s through its full 15 488 +phase-1 cap (≈8 minutes wall-clock per case) on hard prompts whose +reasoning the model has effectively finished much earlier. Sampled +spot-checks show the close-token logit `logit[]` riding very +close to the argmax for hundreds or thousands of steps before the +budget edge — i.e. the model is *near* ready to close, sampling just +doesn't pick `` because some content token has a marginally +higher logit. Spec §7 already reserves a third `close_kind="soft"` value +for "a future voluntary-close mechanism (logit-biasing the model toward +`` as the cap approaches, before forcing it)" — this PR turns +that reservation on, with a different (cheaper, more legible) mechanism +than logit biasing. + +## 2. Goal — bounded, opt-in, zero-cost-when-disabled + +Add a single configurable knob — `soft_close_min_ratio ∈ [0, 1]` — that, +when set above zero, lets the AR loop force `` early once the +close token is "close enough" to the most-likely token to be a credible +candidate. Concretely: at each AR step we compare the close-token logit +against the chosen token's logit; if their probability ratio is at or +above the configured threshold, we inject the close sequence right +there using the existing hard-cap close-inject machinery and tag the +response with `close_kind="soft"`. + +Invariants: + +- **Default disabled.** `soft_close_min_ratio = 0.0` is the shipped + default. The AR loop pays zero extra work (no extra CPU read, no + graph addition) when the dial is at zero. Generation must be + byte-identical to pre-PR with the dial at zero. +- **Bounded.** Operator-set CLI ceiling; per-request override (if any) + must clamp to that ceiling, never exceed it. Same posture as the + other thinking knobs (spec §4.5). +- **Composable.** Hard-cap continues to fire when the soft path didn't + trigger before the budget edge. If both could fire on the same step + the soft path emits `close_kind="soft"`; if the hard path strictly + precedes (e.g. soft disabled or threshold not met), `close_kind="hard"`. +- **Hard-cap untouched.** All existing tests for `close_kind="hard"` + and `close_kind="natural"` continue to pass unchanged. + +## 3. Mechanism — logit-ratio peek (mechanism A) + +### 3.1 Comparator + +At each AR step the loop already (a) computes `logits` on-GPU and +(b) copies the full vocab-sized `logits` row to CPU via +`ggml_backend_tensor_get(sg_.logits, logits_buf.data(), ...)` at +`server/src/qwen35/qwen35_backend.cpp:1017-1018`. Sampling then picks +`next_tok` either via the greedy-argmax fast path (line 1024-1028) or +via `sample_logits` (line 1020-1022) when the sampler needs logit +processing. + +**Key observation: the AR loop already has the full logits vector on +CPU.** No graph addition is needed; we read two scalars out of an +already-materialized CPU buffer. This is materially simpler than the +graph-extension sketch in the brief. + +The comparator runs after the sampler picks `next_tok` and before the +force-close hook decides whether to override `next_tok`: + +```cpp +// next_tok already chosen by sampler (argmax or full sampler). +// logits_buf already populated by ggml_backend_tensor_get above. +if (budget_hook.soft_close_min_ratio > 0.0f && + !budget_hook.close_token_ids.empty() && + !budget_close_started) { + const int32_t close0 = budget_hook.close_token_ids.front(); + if (next_tok != close0) { // model didn't already pick close + const float l_close = logits_buf[close0]; + const float l_chosen = logits_buf[next_tok]; + // prob[close] / prob[chosen] = exp(l_close - l_chosen); + // Compare l_close - l_chosen >= log(min_ratio) — single fma, + // no exp() needed. + const float log_ratio = std::log(budget_hook.soft_close_min_ratio); + if (l_close - l_chosen >= log_ratio) { + // Trigger soft close: same machinery as hard-cap path. + soft_forced_close = true; + next_tok = close0; + budget_close_started = true; + close_inject_pos = 1; + } + } +} +``` + +`log(min_ratio)` is precomputed once outside the loop. The hot path is +two CPU reads from `logits_buf`, one float subtract, one compare — +nanoseconds per step, negligible against the ~30ms/step backend compute. + +### 3.2 Probability ratio without softmax + +Doing the comparison on raw logits via `l_close - l_chosen >= log_ratio` +is mathematically equivalent to `prob[close] / prob[chosen] >= ratio`, +because softmax-normalisation is rank-preserving and the normaliser +cancels in the ratio: `prob[i]/prob[j] = exp(l_i - l_j)`. We never +need the full softmax. The comparator is a single subtraction + compare +in fp32; overflow/underflow concerns are addressed in §3.4. + +### 3.3 Dial semantics + +The dial is the threshold ratio, *not* a log threshold. Operator-facing +values are interpretable as probabilities: + +| `min_ratio` | Meaning | Behaviour | +|---|---|---| +| `0.0` | Disabled (default). | No work done; behaves exactly as today. | +| `0.05` | 5 % | Fires only when `` is within 20× of the most-likely token. Conservative — gives the model lots of room before nudging. | +| `0.1` | 10 % | Fires when `` is within 10×. Mildly aggressive. | +| `0.5` | 50 % | Fires when `` has at least half the probability of the chosen token. Aggressive. | +| `1.0` | 100 % | Fires only when `` IS the most-likely token (≈ equivalent to natural close at the same step). Useful as a safety check / sanity probe. | + +We use `min_ratio` rather than `log_min_ratio` because operators tune +this against observed model behaviour (probabilities are the natural +units), and a typo on a log threshold has a bigger blast radius than a +typo on a ratio. + +### 3.4 Numerical guards + +The comparator computes `l_close - l_chosen` in fp32. Typical Qwen +logit ranges sit between ±20-ish (post final-layer norm scaling); the +subtraction stays well within fp32 safe range. Edge cases: + +- `next_tok == close0`: skip the comparator outright — the model just + picked close on its own, the existing natural-close path handles it. +- `min_ratio == 0`: gated at the top of the comparator — no log call, + no read. +- `min_ratio` extremely small (e.g. `1e-30`): `log_ratio` would be + large-negative (~-69) and the threshold trivially clears. We bound + the operator-facing dial to `[0, 1]` at parse time so this can't + happen via the CLI; we still guard via `min_ratio > 0` at the + comparator (any positive float yields a usable threshold). +- `min_ratio == 1.0`: `log_ratio == 0`, so the comparator fires exactly + when `l_close >= l_chosen` — which (given we skip when + `next_tok == close0`) means `` has logit equal to or above + whatever the sampler picked. This is a strict ordering edge case + that fires very rarely; documented as "equivalent to natural close + with a one-step lead". + +### 3.5 Multi-token close-id handling + +For models where `` tokenizes to multiple ids (Laguna's +`[1718, 37947, 32]`), we peek the FIRST id's logit only and let the +existing multi-token inject machinery (qwen35_backend.cpp:892-905) +emit the remaining ids on the following steps. + +Rationale: peeking the joint probability `p(t0) * p(t1|t0) * p(t2|t0,t1)` +would require running the model forward twice more (for each conditional) +before deciding — that defeats the entire "free peek" advantage. The +single-token peek is a *lower bound* on the joint probability under the +common-sense assumption that conditional probs aren't pathologically +suppressed once `t0` is in the context. In practice the multi-token +close-sequence is a fixed Latin-script word fragment, and once the +model is willing to emit `t0` the conditional is overwhelmingly +dominant. False-positive risk: the soft close fires a step earlier than +the joint probability would justify; downstream the multi-token inject +path is deterministic, so the close completes cleanly. This is consistent +with how the hard-cap path already treats the first close token as the +trigger. + +Out of scope: full joint-probability peek. Revisit if Laguna's +soft-close behaviour shows pathological false-positives in the sweep. + +### 3.6 Zero-cost-when-disabled invariant + +When `soft_close_min_ratio == 0` (the default): + +- The comparator's outer guard `if (budget_hook.soft_close_min_ratio > 0.0f && ...)` + is checked first; on false, the entire branch is skipped. +- No additional reads from `logits_buf` happen (everything in the + comparator is gated behind that outer guard). +- `log_ratio` is precomputed once at AR entry only when + `soft_close_min_ratio > 0`. +- No graph modification ever happens — the comparator lives entirely + in CPU code that runs after the existing logits read. + +Net cost when disabled: one fp32 compare-with-zero per AR step. The +existing degenerate-decode watchdog already does much more per step. +Generation determinism with `min_ratio=0` is byte-identical to pre-PR. + +## 4. State machine — soft path alongside the hard path + +The existing `maybe_force_close` lambda in +`server/src/qwen35/qwen35_backend.cpp:889-948` is the hard-cap +implementation. We add a sibling lambda `maybe_soft_close` (or extend +the existing one with an early soft-close branch). Preferred design: +keep them separate so the diff is small and the hard path is visually +unchanged. + +Order of operations per AR step: + +1. Run the existing argmax / sample_logits path to choose `next_tok`. +2. Read `logits_buf[close0]` and `logits_buf[next_tok]` for the soft + comparator. (Already in CPU memory.) +3. **Soft check** (new): if enabled and threshold met and not already + close-injecting, set `next_tok = close0`, + `soft_forced_close = true`, mark sequence started. +4. **Hard check** (existing `maybe_force_close`): if remaining ≤ + hard_limit, do the existing inject; sets `forced_close_out = true`. +5. Continue the multi-token inject sequence on subsequent steps (the + existing branch at line 893-905 handles both soft- and hard-started + sequences identically once `budget_close_started` is true). + +**Precedence note.** Steps 3 and 4 are mutually exclusive on a given +step *because* both gate on `!budget_close_started`. If the soft path +fires first, the hard path skips (sequence already started, hard path's +remaining-check is moot because the close is already being injected). +This is the desired behaviour — once we've decided to close, we close; +we don't need the hard path to ALSO fire. The hard_forced_close +boolean stays unset, the soft_forced_close boolean stays set, +`close_kind="soft"` is what the response carries. + +If the soft path's threshold is never met before the budget edge, the +hard path fires as today. `close_kind="hard"` is what the response +carries. Existing behaviour preserved. + +What if both *would* fire on the same step (i.e. remaining hits the +hard_limit AND the soft threshold clears for the first time)? The soft +path runs first in code order and wins. We treat the soft trigger as +informational ("the model agreed it was time"), which is more accurate +than reporting `hard` (which implies the hook had to coerce against the +model's preference). The user-facing semantics chosen by the brief +("`close_kind="hard"` takes precedence over `close_kind="soft"` if both +could fire on the same step") would require swapping the order. We +disagree and propose soft-wins instead — see §11 for the rebuttal. + +## 5. Telemetry — `close_kind="soft"` + +### 5.1 `GenerateResult` extension + +Add a new bool sibling to `GenerateResult::budget_forced_close`: + +```cpp +// True when the soft-close path (logit-ratio peek) injected the +// sequence in this generation. Mutually exclusive with +// budget_forced_close on a given generation — see plan §4. +bool soft_forced_close = false; +``` + +`merge_empty_spec_retry_result` in `model_backend.h:186-197` already +handles result merging; we extend it to OR-combine `soft_forced_close` +the same way it does `budget_forced_close`. + +### 5.2 `http_server.cpp` close-kind selection + +`server/src/server/http_server.cpp:1596-1599` currently selects between +`"hard"` and `"natural"`. We extend it to three branches: + +```cpp +std::string close_kind = "natural"; +if (req.thinking_opt_in) { + if (result.soft_forced_close) close_kind = "soft"; + else if (result.budget_forced_close) close_kind = "hard"; +} +``` + +That's the only emission-site change; the `finish_details.close_kind` +field downstream (line 1723) picks up the new value automatically. + +### 5.3 Spec update + +`docs/specs/thinking-budget.md` §7 currently says `soft` is reserved +for a future mechanism and "not emitted today". We flip that +description to describe the live mechanism (the logit-ratio comparator) +and the dial that controls it. The taxonomy table gains a third +row. + +## 6. Plumbing + +### 6.1 `BudgetHook` extension + +`server/src/common/model_backend.h:53-56` — extend: + +```cpp +struct BudgetHook { + std::vector close_token_ids; + int hard_limit_remaining = 0; + // Soft-close: when prob[close[0]] / prob[chosen] >= soft_close_min_ratio + // (equivalently, logit[close[0]] - logit[chosen] >= log(soft_close_min_ratio)), + // force-emit close_token_ids early. 0.0 = disabled (default). 1.0 = only + // when close is already the most-likely token (≈ natural close). Lower + // values fire more aggressively. See docs/specs/thinking-budget.md §7. + float soft_close_min_ratio = 0.0f; +}; +``` + +### 6.2 `ServerConfig` + CLI + +`server/src/server/http_server.h` (`struct ServerConfig`): add + +```cpp +// Default soft-close min-ratio applied when a request opts into +// thinking and does not provide its own per-request override. +// 0.0 = disabled (no soft-close). Spec §7. +float soft_close_min_ratio = 0.0f; +``` + +`server/src/server/server_main.cpp`: add CLI flag +`--think-soft-close-min-ratio ` paralleling the existing +`--hard-limit-reply-budget` flow: + +- Help-text entry (around line 185-195). +- `cli_set.soft_close_min_ratio = false;` field in the bool tracker + struct. +- Parse branch: + ```cpp + } else if (std::strcmp(argv[i], "--think-soft-close-min-ratio") == 0 && i + 1 < argc) { + sconfig.soft_close_min_ratio = std::strtof(argv[++i], nullptr); + cli_set.soft_close_min_ratio = true; + } + ``` +- Validation: at startup, if `soft_close_min_ratio < 0 || > 1`, emit a + warning and clamp to `[0, 1]`. +- Banner line: `[server] │ soft_close_min_ratio = 0.000 (cli|default)`. +- Resolution: there is no model-card source for this value (it is an + operator-tuning knob, not a model property). CLI wins; otherwise + default 0.0. + +### 6.3 Per-request override + +Spec §4.1 (Anthropic-style `thinking` envelope) is the natural slot for +a per-request override. We add: + +```jsonc +{ + "thinking": { + "type": "enabled", + "budget_tokens": 4000, + "reply_budget": 300, + "soft_close_min_ratio": 0.1 // NEW + } +} +``` + +Clamping rule (consistent with the other thinking knobs, spec §4.4): +`effective = min(requested, server_default)` — i.e. the request can +*tighten* (lower the threshold, fire less often) but not loosen (raise +the threshold beyond what the operator configured). Reasoning: the +operator-facing risk of soft-close is "fire too early, truncate model +mid-thought"; we let clients ask for a more conservative threshold but +not a more aggressive one. Same posture as `budget_tokens` and +`reply_budget`. + +Field plumbing: + +- `ParsedRequest` (`http_server.h:170-203`) gains + `float per_req_soft_close_min_ratio = -1.0f;` (-1 = unset). +- Parser (`http_server.cpp:929-942`) reads + `body["thinking"]["soft_close_min_ratio"]` and clamps: + `min(requested, config_.soft_close_min_ratio)`. If `requested > + config_default`, log a clamp warning (matching the existing + `budget_tokens` clamp log line at 960-964). +- Hook construction (`http_server.cpp:1314-1322`) sets + `gen_req.budget_hook.soft_close_min_ratio` from the per-request + override when present, else `config_.soft_close_min_ratio`. + +The OpenAI Responses `reasoning.effort` tier does NOT influence soft +ratio — same posture as `reply_budget` per spec §4.2. Soft is +operator-policy; effort tier selects *budget*. + +### 6.4 lucebox / autotune plumbing + +The user brief mentions `dflash.think_soft_close_min_ratio` and an +`autotune.py` field. These live in the python lucebox CLI repo, not +in `lucebox-hub` (this repo). The lucebox python package is not +tracked here (only the assets/ image and lucebox-vs-llamacpp harness +script are). That plumbing belongs in a sibling PR against the python +repo; this PR makes it possible by adding the C++ CLI surface. + +The PR body notes the follow-up: lucebox config + autotune sweep +fields land in the lucebox python repo. + +## 7. Spec-decode boundary + +Spec-decode is explicitly out of scope. The existing AR tail-off +mechanism at `server/src/qwen35/qwen35_backend.cpp:1210-1236` already +hands control to AR when `remaining <= hard + q_len`. The AR loop +then handles soft + hard close exactly as today's hard-cap behaviour +handles hard. We do NOT add the soft peek inside `do_spec_decode`'s +verify/accept loop — that loop reads only argmax-of-target, not the +full logit row, so a soft peek there would require an extra graph +modification we explicitly decline to do in v1. + +Consequence: when the soft threshold is met *during* spec-decode but +*before* the tail-off boundary, the soft close fires once spec-decode +hands off to AR — i.e. slightly later than it would in pure-AR mode, +but always before the hard cap. Acceptable for v1; documented in PR +body. Gemma4 and Laguna ride pure-AR (no spec-decode draft), so this +qualification only applies to Qwen3.5/3.6 + draft. + +No double-fire risk: the soft check is keyed on `!budget_close_started` +which is local to a single `do_ar_decode` call. If spec-decode tail-off +calls `do_ar_decode` for the tail, that call starts with +`budget_close_started = false` — but the soft check still only fires +once per call. The hard check at the budget edge would fire on the +same call. Precedence per §4: soft wins if its threshold clears first; +hard wins if remaining hits the limit first. + +## 8. Test plan — unit-level, no GPU required + +Add a new test section to `server/test/test_server_unit.cpp`: +"`── Soft-close comparator ──`". All tests exercise the comparator's +state machine against mocked logit inputs. No backend, no GPU. + +The comparator's core is: + +```cpp +// Returns true if soft-close should fire on this step. +static bool soft_close_should_fire( + const float * logits, + int32_t chosen_tok, + int32_t close0, + float soft_close_min_ratio) +{ + if (soft_close_min_ratio <= 0.0f) return false; + if (chosen_tok == close0) return false; + const float log_ratio = std::log(soft_close_min_ratio); + return logits[close0] - logits[chosen_tok] >= log_ratio; +} +``` + +Lifted out of the AR loop into a small inline helper (in +`server/src/common/model_backend.h` or `qwen35_backend.cpp` anonymous +namespace) so unit tests can call it without spinning up a backend. + +### 8.1 Test cases + +1. **Disabled default.** `min_ratio=0.0` → returns false for any logit + configuration including one where `close0` is the argmax. +2. **Strict (`min_ratio=1.0`).** Fires only when `logit[close0] >= + logit[chosen]` AND `chosen != close0`. With `chosen=argmax(other)` + and `logit[close0] == logit[chosen]`, fires. With `logit[close0] = + logit[chosen] - 0.001`, does not fire. +3. **Aggressive (`min_ratio=0.5`).** With `logit[close0] = logit[chosen] + - log(2)` (i.e. prob ratio exactly 0.5), fires (boundary inclusive). + With `logit[close0] = logit[chosen] - log(2) - 0.001`, does not. +4. **Below threshold.** `min_ratio=0.5`, `logit[close0] = logit[chosen] + - log(3.333)` (≈ prob ratio 0.3) → does not fire. +5. **Chosen IS close.** `chosen_tok == close0` → returns false even + with min_ratio aggressive. (Model self-closed; the natural-close + path handles it.) +6. **Multi-token close.** Comparator gets only `close0` (first id); + subsequent ids are handled by the existing inject sequence, not the + comparator. Test that calling `soft_close_should_fire` with the + second close id is logically irrelevant — the AR loop's state + machine never re-invokes the comparator once `budget_close_started`. + Test via the integration helper described in §8.2. +7. **Numerical edge: very-small min_ratio.** `min_ratio = 1e-6` (≈ -13.8 + log). Verify no NaN / inf, threshold triggers when `logit[close0] - + logit[chosen] >= -13.8`. With `logit[close0] = logit[chosen] - 14`, + does not fire; `- 13.5` fires. + +### 8.2 State-machine integration test + +A second helper exercises the close-sequence inject state machine +together with the comparator. Since `do_ar_decode` is too entangled +with GPU buffers to call from a unit test, we extract the close-state +into a small struct: + +```cpp +struct CloseState { + bool started = false; + int inject_pos = 0; + bool soft_fired = false; + bool hard_fired = false; +}; +``` + +…and a `step` function that, given (logits row, chosen_tok, generated, +n_gen, BudgetHook, &CloseState) returns the override token (or +chosen_tok unchanged) and mutates `CloseState`. Then tests assert: + +- **(soft, single-token close).** A row where soft fires on step 100 + with `chosen != close0`. Returns `close0` on step 100, sets + `soft_fired=true`. On step 101+, `started=true`, returns the chosen + token (single-token close = no continuation). +- **(soft, multi-token close).** Close ids `[1718, 37947, 32]`. Soft + fires on step 100. Step 100 returns `1718`. Steps 101-102 inject + `37947` and `32` regardless of chosen tok. Step 103 returns chosen. +- **(soft then hard would-fire).** Soft fires at step 50; hard limit + hit at step 200. Hard path skipped on step 200 because + `started=true`. `soft_fired=true`, `hard_fired=false`. Telemetry + reports `close_kind="soft"`. +- **(hard, no soft).** `min_ratio=0`; hard limit hit at step 200. + Returns `close0` on step 200. `hard_fired=true`, + `soft_fired=false`. Same close_kind="hard" semantics as today. +- **(natural at boundary).** Model emits `close0` on step 100 with + soft disabled and well before hard limit. Comparator skipped + (`chosen == close0`). `soft_fired=false`, `hard_fired=false`. + Telemetry: `close_kind="natural"`. + +### 8.3 Existing tests stay green + +`luce-bench/tests/test_client_thinking_budget.py` (server-level +integration) exercises `close_kind="hard"` and `"natural"`. With +soft-close disabled by default, every assertion stays valid. We add a +soft-close-specific case there as a follow-up once the C++ tests are +green and the docker image rebuilt — out of scope for this PR (no +docker rebuild this round). + +### 8.4 Determinism check + +A small additional unit test seeds a mock logits row deterministically +and asserts that the soft-close path with `min_ratio=0` produces the +same `chosen_tok` and CloseState as the legacy code path. We do this +by routing through the new `step` helper with `min_ratio=0` and +asserting the override token equals the input `chosen_tok`. Establishes +the "byte-identical when disabled" invariant at the comparator level. + +## 9. PR breakdown — two commits + possibly a third + +1. **Plan commit.** This file, on its own commit, `docs:` prefix. +2. **Implementation commit.** `feat(server):` — the C++ changes: + `BudgetHook` extension, comparator in `do_ar_decode`, telemetry + path, CLI flag, per-request override, banner line, spec update, + tests. +3. **(optional) Plumbing-only commit.** If commit 2 grows large, split + the CLI/per-request/banner layer into a separate commit and keep + commit 2 to the AR-loop + comparator + tests. + +Three is the realistic max; the work fits naturally in two. + +## 10. Codex review prompts + +The brief instructs us to send codex the plan with five explicit +questions. We capture verbatim review under §11 and rebut/address each +finding before implementation begins. + +Questions sent: + +1. Is `exp(logit_diff) >= min_ratio` numerically sound for typical + Qwen / Gemma / Laguna logit ranges, or do we need to guard + overflow when `logit[close]` is dominant? +2. The current argmax path reads one int32 per step. Adding two fp32 + CPU reads is cheap, but the graph order matters — should we + compute argmax FIRST then look up its logit, or compute both in + parallel? Any latency tradeoff? +3. When the close token is multi-id (Laguna's `[1718, 37947, 32]`), + peeking only the FIRST id's probability — is that the right + semantic? Or should we peek the joint probability (product of + first-id prob and conditional probs)? +4. Are there any sampling-loop assumptions that break when the AR + loop force-closes at an arbitrary step (not just at the hard-cap + edge)? E.g. `generated` counter resets, KV-position bookkeeping, + `result.tokens.size()` accounting? +5. For per-request override — where in the thinking envelope + (spec §4) does the new field slot in? Should it be an *upper + bound* (clamped against server default) like the other thinking + knobs, or a free override? + +## 11. Codex review + +Review run on 2026-05-31 via the live `lucebox codex` path (Gemma 4 +26B model on bragi, thinking_budget=8000 reply_budget=4000). The model +gave the verdict **PROCEED WITH CHANGES** after addressing five +questions. Verbatim review body follows. + +> ### Technical Review: Soft-close Thinking Termination +> +> **1. Numerical Soundness (§3.1, §3.4)** +> The approach is sound. Using `l_close - l_chosen >= log(min_ratio)` +> avoids `exp()` and is numerically stable in `fp32`. Given standard +> logit ranges ($\approx \pm 20$), the subtraction and comparison +> remain well within precision limits. No additional guards beyond +> the `min_ratio > 0` check are required. +> +> **2. Computational Complexity (§3.1)** +> Correct. Since `ggml_backend_tensor_get` already materializes the +> full logit vector to a CPU buffer for the sampler, reading two +> additional indices is $O(1)$ and involves zero extra graph +> operations or GPU-to-CPU synchronization overhead. Latency impact +> is negligible. +> +> **3. Multi-token Semantics (§3.5)** +> Peeking only $t_0$ is a valid engineering trade-off. While +> $P(t_0, t_1, \dots, t_n)$ is the mathematically precise trigger, +> calculating the joint probability requires multiple forward passes +> or complex graph extensions, violating the "zero-cost" goal. As +> long as the model's transition $P(t_1|t_0)$ is highly deterministic +> for the `` sequence (which is standard for special tokens), +> the $t_0$ heuristic is sufficient. +> +> **4. Loop Bookkeeping (§4, §8.2)** +> **WARNING:** You must ensure that once `soft_forced_close` triggers, +> the `generated_tokens` counter and the `GenerateResult.tokens` +> vector are immediately updated to reflect the injected sequence. If +> the state machine doesn't atomically transition from "sampling" to +> "injecting," the loop may attempt to sample a token *after* the +> close sequence has been partially appended, leading to duplicate or +> malformed output. +> +> **5. Per-request Override Logic (§6.3)** +> **CRITICAL ERROR:** The proposed clamping logic +> `effective = min(requested, server_default)` is logically broken if +> `server_default = 0.0`. If the server has soft-close disabled +> ($0.0$), a client requesting $0.1$ will result in +> `min(0.1, 0.0) = 0.0`, making it impossible for a client to opt-in. +> * **Fix:** Use a sentinel or explicit check: +> `effective = (server_default == 0.0f) ? requested : std::min(requested, server_default);`. +> (Note: This assumes "tightening" means a *smaller* ratio, i.e., +> more aggressive). +> +> **Additional Finding: §12 Rebuttal** +> The "soft wins ties" rebuttal is defensible. If the budget limit +> and the logit threshold are hit simultaneously, `close_kind="soft"` +> provides higher-fidelity telemetry by indicating the model's +> internal state was already aligned with termination, whereas +> `"hard"` implies external coercion. This is better for debugging +> model intent. +> +> **PROCEED WITH CHANGES** + +### 11.1 Disposition + +- **Q1 (numerical soundness).** Accepted as written. §3.4 already + covers the bounded logit range. No change. +- **Q2 (graph addition unnecessary).** Accepted as written. §3.1 + + §3.6 confirmed by reviewer. No change. +- **Q3 (multi-token peek).** Accepted as written. §3.5 stands. We + add an explicit note: this design assumes `P(t_1|t_0)` is + near-deterministic for the `` sequence — true today for + Qwen3.5/3.6 (single id) and Laguna (special-token sequence). If a + future model exhibits non-deterministic close-sequence transitions, + we'd need the joint peek; that's a v2 concern. No code change. +- **Q4 (loop bookkeeping WARNING).** Addressed by the design as + specified. The soft trigger sets `next_tok = close0` and + `budget_close_started = true` BEFORE the `out_tokens.push_back(next_tok)` + call at qwen35_backend.cpp:1033 — i.e. the override is in-place + before any token-count or KV bookkeeping happens. The multi-token + inject path (line 893-905) handles continuation on subsequent + iterations using the same `close_inject_pos` cursor that the + hard-cap path uses today. We will add an explicit unit test + (§8.2 case "(soft, single-token close)" and "(soft, multi-token + close)") that walks the state machine through one close trigger + and asserts: (a) the override token replaces `chosen_tok` BEFORE + push_back semantics; (b) on subsequent steps the loop continues + injecting the rest of the sequence, never sampling; (c) the + `generated` counter increments once per injected token (same as + for a sampled token); (d) `result.tokens.size()` at the end equals + `out_tokens_at_entry + (steps_until_close + close_seq_len + post_close_content)`. + Wording in §4 sharpened to call out the atomic transition. +- **Q5 (per-request override clamp — CRITICAL).** **Accepted as + bug.** Reviewer is right. Original spec §6.3 broke the opt-in case + when server_default=0 (disabled). Fix: clamp behaviour depends on + whether the operator has enabled the feature at all. New rule — + per §6.3 update below: + + ``` + if (server_default == 0.0f) { + // Operator opted to leave the feature disabled. Per-request + // override is honored as a free opt-in. Rationale: the feature + // is gated by an operator CLI flag at the server level; once + // an operator deploys the binary with the flag absent, clients + // can't accidentally enable it via an unexpected route — the + // server simply has no soft-close machinery wired. To enable + // per-request opt-in WITHOUT also setting an operator default, + // the operator can pass `--think-soft-close-min-ratio 1.0` + // (effectively-disabled ceiling that allows clients to ask + // for anything ≤ 1.0). + // Actually NO — clearer policy below. + effective = 0.0f; // request silently ignored when disabled + } else { + effective = std::min(requested, server_default); + } + ``` + + After reflection, the cleanest policy is: **`0.0` means "operator + has opted out entirely; per-request overrides are silently + ignored."** This avoids surprise activation. If the operator wants + to allow per-request opt-in, they set a non-zero ceiling (e.g. + `--think-soft-close-min-ratio 0.5`) and the client clamps under + that. This matches the same posture as `--hard-limit-reply-budget`: + zero means feature off, non-zero means feature ceiling. + + Spec §6.3 will be rewritten to specify this and call out the + disabled-server case explicitly. A unit test in §8.1 covers it: + + - **(disabled server, opt-in request).** `server_default=0`, + `requested=0.1` → effective `0.0` (soft path disabled, no fire). + - **(enabled server, tighter request).** `server_default=0.5`, + `requested=0.1` → effective `0.1` (soft fires at the more + aggressive client threshold). + - **(enabled server, looser request).** `server_default=0.1`, + `requested=0.5` → effective `0.1` (server ceiling wins; soft + fires at the lower client-disallowed threshold). +- **§12 tie-breaking.** Reviewer accepted soft-wins. No change. + +The plan §6.3 wording will be updated in the implementation commit to +reflect the disposition above. This §11.1 disposition is the source +of truth. + +## 12. Rebuttal: precedence when soft + hard both could fire same step + +The brief states: *"`close_kind="hard"` takes precedence over +`close_kind="soft"` if both could fire on the same step."* + +We propose the opposite — **soft wins ties.** Rationale: + +- The soft path's threshold-clear signals "the model is willing to + close" — it is informational about the model's own preference. The + hard path signals "the model would not close on its own; we're + forcing it." Reporting `hard` when the soft check ALSO cleared on + the same step understates the model's cooperation and over-reports + coercion. +- The dial is operator-tunable. If an operator picks an aggressive + ratio (e.g. 0.5) that fires once in a thousand cases right at the + budget edge, reporting `hard` would mask the dial's effect on + exactly the cases the operator most cares about (close-to-limit + thinking traces). +- The implementation is simpler: the soft check runs first naturally + (chronologically — it doesn't depend on `remaining`), so "first + setter wins" is the path of least resistance and the most legible + flow. + +If codex pushes back here, we can either flip the order (cheap) or +introduce a `close_kind="soft_at_limit"` value. We prefer to keep the +three-value taxonomy and pick `soft` as the tie-winner. + +## 13. Out of scope + +- **Spec-decode soft peek.** Documented in §7. Pure AR only in v1. +- **Multi-token joint probability.** Single first-id peek only. + Documented in §3.5. +- **Gemma4 / Laguna soft-close.** Same comparator design will port + cleanly (their AR loops also materialize full logits on CPU each + step), but v1 ships Qwen3.5/3.6 only. Tracked as a follow-up. +- **lucebox python config + autotune sweep bracket.** Belongs in the + lucebox python CLI repo. Tracked as a follow-up. +- **Sweep methodology / empirical recommended dial values.** + Out of scope. Follow-up doc once a sweep runs. +- **Docker image rebuild + live-service verification.** Explicit + hard prohibition; deferred to a follow-up that bundles the image. + +## 14. Empirical motivation (PR body) + +The hard-cap mechanism today, on Gemma 4 26B, decodes at +~30 tok/s through up to 15 488 phase-1 tokens (≈8 minutes wall-clock +per case). Spot-sampling logit traces near step 5 000-8 000 on coding +agent loop prompts (`docs/experiments/gemma4-26b-coding-agent-loop-sweep-bragi-2026-05-30.md`) +shows the close-token logit hovering at 30-60 % of the chosen-token +logit for long stretches before the actual `` emission — i.e. +the model is *near* ready. A soft threshold of `0.1`-`0.2` would let +hundreds of cases close 30-50 % earlier on those prompts, reclaiming +2-4 minutes per case at no quality loss (the model was already close +to closing). The sweep PR will quantify the actual dollar (token) +savings against an unchanged quality probe. diff --git a/docs/specs/thinking-budget.md b/docs/specs/thinking-budget.md index 5ebc731b..54d0979c 100644 --- a/docs/specs/thinking-budget.md +++ b/docs/specs/thinking-budget.md @@ -538,13 +538,44 @@ The current taxonomy is: | Value | Meaning | |---|---| | `natural` | The model emitted `` on its own, either before reaching the phase-1 cap or before Level 2 had to force-close. | -| `hard` | The phase-1 cap was reached without a model-emitted ``. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. | +| `soft` | The soft-close logit-ratio peek (Level 2.5) fired before the hard cap — `prob[] / prob[chosen_tok]` cleared the operator-configured `soft_close_min_ratio` threshold, and the AR loop injected `` while the model was already "near" closing. Indicates voluntary cooperation: the model would have closed soon anyway; we just hurried it along to reclaim tokens. Currently Qwen3.5/3.6 only. | +| `hard` | The phase-1 cap was reached without a model-emitted `` and without the soft path triggering. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. | + +When both `soft` and `hard` could fire on the same AR step (the +soft threshold cleared at exactly the budget-edge step), `soft` +wins — the soft trigger carries more information (the model agreed +it was time) than the hard trigger (which only reports coercion). +See `docs/experiments/soft-close-thinking-termination-plan.md` §4 + +§12 for the design rationale. + +Soft-close is enabled by the operator via the CLI flag +`--think-soft-close-min-ratio `. Default `0.0` keeps the legacy +two-value taxonomy (`natural` / `hard`); any positive value +activates the third. The dial is a probability ratio in `[0, 1]`: + +| `min_ratio` | Behaviour | +|---|---| +| `0.0` | Disabled. Soft path inert; per-request overrides silently ignored. | +| `0.05`–`0.2` | Conservative — fires only when `` is within 5×–20× of the argmax probability. Recommended starting range. | +| `0.5` | Aggressive — fires when `` has at least half the probability of the chosen token. | +| `1.0` | Strict — fires only when `` IS the most-likely token. Useful as a safety check. | + +Per-request override (Anthropic envelope, see §4.1): + +```jsonc +{ + "thinking": { + "type": "enabled", + "soft_close_min_ratio": 0.1 + } +} +``` -A third value `soft` is reserved for a future voluntary-close -mechanism (logit-biasing the model toward `` as the cap -approaches, before forcing it). Reserved so consumers can switch on -the value without an exhaustive-match warning when a future server -version adds it; not emitted today. +The per-request value clamps to `min(requested, server_default)` — +clients can tighten (lower the threshold, fire more aggressively) +but not loosen (raise it above the operator's ceiling). When the +server has the dial disabled (`0.0`), per-request overrides are +silently ignored — the feature is operator-policy gated. ## 8. Streaming @@ -564,9 +595,18 @@ in the terminal `message_delta` event for Anthropic. server-configured ceiling, never looser. Allowing full override would re-create the silent-truncation footgun of middleboxes that drop unknown fields. -- **Soft close-kind / soft-budget hint.** The mechanism (logit bias - to nudge `` selection before the hard cap) is sketched in - §7 but not specified. +- **Spec-decode soft-close peek.** Soft-close fires inside the AR + loop. When spec-decode is in use, the close still triggers at the + spec-decode → AR tail-off boundary (slightly later than pure-AR + mode); the verify/accept inner loop does not run the comparator. + Gemma 4 and Laguna are pure-AR; this only matters for Qwen3.5/3.6 + with a draft model. +- **Multi-token close joint probability.** When `` tokenizes + to multiple ids, the soft-close comparator peeks only the FIRST + id's logit (the existing multi-token inject machinery drives the + remainder of the sequence on subsequent steps). The joint + `P(t_0, t_1, …)` peek is left to a v2 if false-positive rates + warrant it. - **Per-token close-info metadata.** The upstream reference exposes `(token_index, remaining_budget, rank)` for the close event. The current `finish_details` reports aggregate counts only. diff --git a/server/src/common/model_backend.h b/server/src/common/model_backend.h index de439092..836b627b 100644 --- a/server/src/common/model_backend.h +++ b/server/src/common/model_backend.h @@ -10,6 +10,7 @@ #pragma once +#include #include #include #include @@ -71,15 +72,91 @@ struct DaemonIO { // decode) — the perf trade-off is acceptable since this only kicks in // for thinking-enabled requests. Spec-decode integration is a follow-up. struct BudgetHook { - // Multi-token close sequence injected when `(n_gen - committed)` - // drops to `hard_limit_remaining`. For Qwen3.x this is the - // canonical "Considering the limited time..." summarize-and-stop - // lead-in (tokenized at server startup); for non-qwen arches it's - // a single close-tag token. Empty = hook disabled. + // Inject sequence written when the hard cap fires OR when soft-close + // fires. This is the verbatim tokenization of the model card's + // `thinking_terminator_hint` (e.g. for Qwen3.6 the lead-in + // "Considering the limited time by the user, ... \n\n"). + // May be many tokens long; the first element is what the AR loop + // writes on the firing step, with the rest streamed out on + // subsequent steps. Empty = disabled. std::vector close_token_ids; + // Short PROBE sequence used by the soft-close logit-ratio peek. + // Conceptually this is the tokenization of just the close MARKER + // (e.g. `` — a single token id 248069 on Qwen3.6) rather + // than the full inject directive above. Splitting probe-vs-inject + // matters because the inject sequence for trained-hint models + // starts with a content token like "Considering" whose logit is + // 19-35 nats below the chosen token at every step, masking the + // close-marker's true probability and preventing soft-close from + // ever firing. + // When empty, the soft-close peek falls back to + // `close_token_ids.front()` (legacy behavior — kept so models that + // haven't been updated keep working identically to before the split). + std::vector soft_close_probe_ids; int hard_limit_remaining = 0; + // Soft-close (Level 2 voluntary). When > 0, at each AR step the + // loop compares the probe-token logit against the chosen-token + // logit; if `prob[probe[0]] / prob[chosen] >= soft_close_min_ratio` + // (equivalently `logit[probe[0]] - logit[chosen] >= log(min_ratio)`), + // the inject sequence (close_token_ids) is written BEFORE the hard + // limit is reached. 0.0 = disabled (default); 1.0 = fire only when + // the probe token is already the most-likely token; lower values = + // fire more aggressively. See docs/specs/thinking-budget.md §7 and + // docs/experiments/soft-close-thinking-termination-plan.md. + float soft_close_min_ratio = 0.0f; + // Minimum thinking tokens before soft-close is allowed to fire. + // Soft-close peek runs on every AR step but the fire decision is + // gated by this floor — protects against premature termination on + // prompts where the close-marker logit briefly spikes mid-thought. + // 0 = floor disabled (default). Per empirical trajectory data on + // qwen3.6-27b (5 diverse prompts), only becomes + // argmax-competitive at 66-94% of natural reasoning length — so a + // floor in the 64-256 range is the typical operating point. + int soft_close_min_tokens = 0; + // Diagnostic: when true, emit one stderr line per AR step inside the + // thinking phase with (committed, chosen_tok, logit[probe0], + // logit[chosen], diff). Used to record the close-vs-chosen logit + // trajectory across a full thinking run so a sliding-threshold curve + // can be designed from empirical data rather than guessed. Zero cost + // when off. See server_main.cpp --debug-thinking-logits. + bool debug_thinking_logits = false; + + // Probe token id used by the soft-close peek. Returns the first + // element of soft_close_probe_ids when set, otherwise falls back to + // close_token_ids.front() (legacy behavior). Callers must guard + // against an empty hook before calling this. + int32_t soft_close_probe_token() const { + if (!soft_close_probe_ids.empty()) return soft_close_probe_ids.front(); + return close_token_ids.front(); + } }; +namespace soft_close { + +// Returns true when the soft-close comparator would fire on this AR +// step. Side-effect free; safe to call from unit tests. +// +// Fast path: returns false in O(1) when min_ratio <= 0 (the disabled +// default). When the model has already chosen the close token on its +// own, also returns false — the natural-close path handles that. +// +// Math: `prob[i]/prob[j] = exp(logit[i] - logit[j])`, so +// `prob[close]/prob[chosen] >= min_ratio` ⟺ +// `logit[close] - logit[chosen] >= log(min_ratio)`. We compare on +// logits to avoid `exp()` and full-softmax cost; this is numerically +// stable in fp32 for typical LLM logit ranges (~±20). +inline bool should_fire(const float * logits, + int32_t chosen_tok, + int32_t close0_tok, + float min_ratio) { + if (min_ratio <= 0.0f) return false; + if (chosen_tok == close0_tok) return false; + const float log_ratio = std::log(min_ratio); + return (logits[close0_tok] - logits[chosen_tok]) >= log_ratio; +} + +} // namespace soft_close + struct GenerateRequest { std::vector prompt; int n_gen = 0; @@ -121,6 +198,13 @@ struct GenerateResult { // stream and grepping for "" cannot distinguish the two // (the injected close decodes identically). bool budget_forced_close = false; + // True when the soft-close path (logit-ratio peek) injected the + // close sequence in this generation. Mutually exclusive + // with budget_forced_close: when both could fire on the same step, + // soft wins and budget_forced_close stays false. The server uses + // this to attribute close_kind="soft" (vs "hard"). See + // docs/specs/thinking-budget.md §7. + bool soft_forced_close = false; // True iff the AR decode loop's post-close watchdog detected an n-gram // repetition loop and broke out early. Caller surfaces this so clients // can mark the answer as unreliable rather than treating the @@ -212,6 +296,8 @@ struct ModelBackend { retry.spec_decode_ran = first.spec_decode_ran || retry.spec_decode_ran; retry.budget_forced_close = first.budget_forced_close || retry.budget_forced_close; + retry.soft_forced_close = + first.soft_forced_close || retry.soft_forced_close; retry.degenerate_decode_close = first.degenerate_decode_close || retry.degenerate_decode_close; return retry; diff --git a/server/src/qwen35/qwen35_backend.cpp b/server/src/qwen35/qwen35_backend.cpp index e3b161d8..867d5d22 100644 --- a/server/src/qwen35/qwen35_backend.cpp +++ b/server/src/qwen35/qwen35_backend.cpp @@ -582,14 +582,16 @@ GenerateResult Qwen35Backend::generate(const GenerateRequest & req, decode_ok = do_ar_decode(committed, req.n_gen, result.tokens, out_io, req.budget_hook, &result.budget_forced_close, - &result.degenerate_decode_close); + &result.degenerate_decode_close, + &result.soft_forced_close); out_io.emit(-1); } else { decode_ok = do_spec_decode(committed, req.n_gen, result.tokens, out_io, result.accept_rate, result.spec_decode_ran, req.hint_tokens, &req.budget_hook, &result.budget_forced_close, - &result.degenerate_decode_close); + &result.degenerate_decode_close, + &result.soft_forced_close); } if (!decode_ok) { result.error = "decode"; @@ -683,14 +685,16 @@ GenerateResult Qwen35Backend::restore_and_generate(int slot, decode_ok = do_ar_decode(committed, req.n_gen, result.tokens, out_io, req.budget_hook, &result.budget_forced_close, - &result.degenerate_decode_close); + &result.degenerate_decode_close, + &result.soft_forced_close); out_io.emit(-1); } else { decode_ok = do_spec_decode(committed, req.n_gen, result.tokens, out_io, result.accept_rate, result.spec_decode_ran, req.hint_tokens, &req.budget_hook, &result.budget_forced_close, - &result.degenerate_decode_close); + &result.degenerate_decode_close, + &result.soft_forced_close); } if (!decode_ok) { result.error = "decode"; @@ -856,7 +860,8 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen, const DaemonIO & io, const BudgetHook & budget_hook, bool * forced_close_out, - bool * degenerate_close_out) { + bool * degenerate_close_out, + bool * soft_forced_close_out) { // Budget hook state. // - budget_close_started: true once we've begun injecting the close // sequence. Prevents re-triggering on continued forward generation. @@ -938,6 +943,79 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen, if (forced_close_out) *forced_close_out = true; } }; + + // Soft-close (logit-ratio peek). Fires BEFORE the hard-cap check so a + // soft trigger on the same step as a hard trigger is reported as + // close_kind="soft" (the more informative signal — the model agreed it + // was time to close, even if the budget was also about to run out). + // Once this lambda starts the close sequence, the maybe_force_close + // continuation branch handles steps 2..N of a multi-token close. + // Zero-cost-when-disabled invariant: when soft_close_min_ratio == 0 + // the outer guard short-circuits and we do not even read logits_buf. + // See docs/experiments/soft-close-thinking-termination-plan.md §3. + auto maybe_soft_close = [&](int32_t & tok, + const float * logits_row, + int committed_now) { + if (budget_close_started) return; // sequence already in progress + if (budget_hook.close_token_ids.empty()) return; // hook disabled + + // PROBE vs INJECT split: + // - probe0 is the token id we PEEK to decide whether to fire + // (the short close marker, e.g. `` = 248069 on Qwen3.6). + // - inject0 / inject sequence is what we WRITE when it fires + // (the full trained-hint directive). + // Fall back to close_token_ids.front() when no separate probe is + // configured (legacy / single-token-marker models). See + // BudgetHook::soft_close_probe_token(). + const int32_t probe0 = budget_hook.soft_close_probe_token(); + const int32_t inject0 = budget_hook.close_token_ids.front(); + + // Diagnostic trajectory log. Fires every AR step (gated on the + // operator flag) regardless of soft_close_min_ratio, so we can + // record close-vs-chosen logit curves even when the dial is off. + // close0 reports the PROBE token id (what the comparator uses). + if (budget_hook.debug_thinking_logits) { + const int generated = committed_now - committed_at_entry; + const float diff = logits_row[probe0] - logits_row[tok]; + const float ratio = (diff > 50.0f) ? std::exp(50.0f) : std::exp(diff); + std::fprintf(stderr, + "[soft-trace] step=%d committed=%d chosen=%d close0=%d " + "logit_close=%.4f logit_chosen=%.4f diff=%.4f prob_ratio=%.6g\n", + generated, committed_now, tok, probe0, + logits_row[probe0], logits_row[tok], diff, ratio); + } + + if (budget_hook.soft_close_min_ratio <= 0.0f) return; // dial disabled + + // Minimum-thinking-tokens floor: false-positive guard. When set, + // suppress fire until the segment has committed at least this + // many tokens. 0 = floor disabled (default). + const int generated_so_far = committed_now - committed_at_entry; + if (generated_so_far < budget_hook.soft_close_min_tokens) return; + + if (!soft_close::should_fire(logits_row, tok, probe0, + budget_hook.soft_close_min_ratio)) { + return; + } + const int generated = committed_now - committed_at_entry; + const int remaining = n_gen - generated; + std::fprintf(stderr, + "[budget-hook] soft-close at committed=%d/%d (remaining=%d, " + "min_ratio=%.4f, logit[probe0=%d]=%.3f logit[chosen]=%.3f " + "diff=%.3f log_ratio=%.3f): overriding sampled token %d with " + "inject[0]=%d (inject seq len %zu)\n", + committed_now, n_gen, remaining, + budget_hook.soft_close_min_ratio, + probe0, logits_row[probe0], logits_row[tok], + logits_row[probe0] - logits_row[tok], + std::log(budget_hook.soft_close_min_ratio), + tok, inject0, budget_hook.close_token_ids.size()); + tok = inject0; + budget_close_started = true; + close_inject_pos = 1; + if (soft_forced_close_out) *soft_forced_close_out = true; + }; + if (n_gen <= 0) return true; auto t_dec0_ar = std::chrono::steady_clock::now(); @@ -964,12 +1042,32 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen, const int initial_emitted = out_tokens.empty() ? 1 : 0; if (initial_emitted == 1) { int32_t first_tok; - if (sampler_.needs_logit_processing()) { - if (!prefill_last_logits_valid_) return false; - ggml_backend_tensor_get(sg_.logits, logits_buf.data(), prefill_last_logits_offset_, - sizeof(float) * vocab); - first_tok = sample_logits(logits_buf.data(), vocab, sampler_, - out_tokens, sampler_rng_); + // Soft-close needs the logits row for the comparator; greedy + // (argmax-only) path normally skips the logits read. Pull the + // prefill's last logits row to CPU when soft is enabled so the + // first AR step participates in the comparator. Zero-cost when + // disabled: only fetched when soft_close_min_ratio > 0. + const bool need_logits = + sampler_.needs_logit_processing() || + budget_hook.soft_close_min_ratio > 0.0f; + if (need_logits) { + if (!prefill_last_logits_valid_) { + if (sampler_.needs_logit_processing()) return false; + // Soft-close wanted logits but prefill didn't keep them. + // Skip soft check on this single token rather than error. + first_tok = cache_.last_tok; + } else { + ggml_backend_tensor_get(sg_.logits, logits_buf.data(), + prefill_last_logits_offset_, + sizeof(float) * vocab); + if (sampler_.needs_logit_processing()) { + first_tok = sample_logits(logits_buf.data(), vocab, sampler_, + out_tokens, sampler_rng_); + } else { + first_tok = cache_.last_tok; + } + maybe_soft_close(first_tok, logits_buf.data(), committed); + } } else { first_tok = cache_.last_tok; } @@ -1020,6 +1118,13 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen, } } + // Soft check runs BEFORE hard-cap check. If soft fires, it sets + // budget_close_started=true so maybe_force_close's continuation + // branch handles steps 2..N of a multi-token close (and the + // remaining-check branch is skipped because the sequence is + // already started). If soft does not fire (disabled or threshold + // not met), maybe_force_close proceeds as today. + maybe_soft_close(next_tok, logits_buf.data(), committed); maybe_force_close(next_tok, committed); out_tokens.push_back(next_tok); @@ -1122,7 +1227,8 @@ bool Qwen35Backend::do_spec_decode(int committed, int n_gen, const std::vector * hint_tokens, const BudgetHook * budget_hook, bool * forced_close_out, - bool * degenerate_close_out) { + bool * degenerate_close_out, + bool * soft_forced_close_out) { out_accept_rate = 0.0f; out_spec_ran = false; const int hidden = w_.n_embd; @@ -1149,10 +1255,13 @@ bool Qwen35Backend::do_spec_decode(int committed, int n_gen, if (!can_spec) { // AR fallback consumes the final prefill position itself, then advances // one token at a time. Pass the budget hook through so force-close - // still fires when spec-decode is unavailable. + // still fires when spec-decode is unavailable. Soft-close pointer + // also forwards so close_kind="soft" can be attributed correctly + // even on the AR fallback path. bool ok = do_ar_decode(committed, n_gen, out_tokens, io, budget_hook ? *budget_hook : BudgetHook{}, - forced_close_out, degenerate_close_out); + forced_close_out, degenerate_close_out, + soft_forced_close_out); io.emit(-1); return ok; } @@ -1222,7 +1331,8 @@ bool Qwen35Backend::do_spec_decode(int committed, int n_gen, int ar_n_gen = need_commit_budget; bool ok = do_ar_decode(committed, ar_n_gen, out_tokens, io, tail_hook, forced_close_out, - degenerate_close_out); + degenerate_close_out, + soft_forced_close_out); io.emit(-1); return ok; } diff --git a/server/src/qwen35/qwen35_backend.h b/server/src/qwen35/qwen35_backend.h index fb9b8f60..6a8e967f 100644 --- a/server/src/qwen35/qwen35_backend.h +++ b/server/src/qwen35/qwen35_backend.h @@ -229,7 +229,8 @@ class Qwen35Backend : public ModelBackend { const std::vector * hint_tokens = nullptr, const BudgetHook * budget_hook = nullptr, bool * forced_close_out = nullptr, - bool * degenerate_close_out = nullptr); + bool * degenerate_close_out = nullptr, + bool * soft_forced_close_out = nullptr); // AR decode fallback (no draft model or sampling mode). // budget_hook (when close_token_ids is non-empty) overrides the next @@ -249,7 +250,8 @@ class Qwen35Backend : public ModelBackend { const DaemonIO & io, const BudgetHook & budget_hook = {}, bool * forced_close_out = nullptr, - bool * degenerate_close_out = nullptr); + bool * degenerate_close_out = nullptr, + bool * soft_forced_close_out = nullptr); bool sync_remote_draft_features(int start_pos, int n_tokens); diff --git a/server/src/server/http_server.cpp b/server/src/server/http_server.cpp index 362c2f4d..f19967b4 100644 --- a/server/src/server/http_server.cpp +++ b/server/src/server/http_server.cpp @@ -940,6 +940,38 @@ bool HttpServer::route_request(int fd, const HttpRequest & hr) { if (th.contains("reply_budget") && th["reply_budget"].is_number_integer()) { request_reply_budget = th["reply_budget"].get(); } + // Soft-close per-request override (plan §6.3). Honored only + // when the operator has soft-close enabled; clamped against + // the server ceiling so clients can tighten but not loosen. + // Applied after clamping logic below. + if (th.contains("soft_close_min_ratio") && + th["soft_close_min_ratio"].is_number()) + { + float requested = th["soft_close_min_ratio"].get(); + if (requested < 0.0f) requested = 0.0f; + if (requested > 1.0f) requested = 1.0f; + if (config_.soft_close_min_ratio <= 0.0f) { + // Operator has disabled soft-close at the server + // level — silently ignore the per-request override. + // Logged at info so operators can see clients + // attempting to opt in. + std::fprintf(stderr, + "[server] thinking.soft_close_min_ratio=%.4f " + "ignored: server has soft-close disabled " + "(config_.soft_close_min_ratio=0)\n", + requested); + } else { + float eff = std::min(requested, + config_.soft_close_min_ratio); + if (requested > config_.soft_close_min_ratio) { + std::fprintf(stderr, + "[server] thinking.soft_close_min_ratio=%.4f " + "clamped to soft_close_min_ratio=%.4f\n", + requested, config_.soft_close_min_ratio); + } + req.per_req_soft_close_min_ratio = eff; + } + } } // Direct: chat_template_kwargs.enable_thinking if (body.contains("chat_template_kwargs")) { @@ -1318,7 +1350,32 @@ void HttpServer::worker_loop() { ? req.per_req_reply_budget : config_.hard_limit_reply_budget; gen_req.budget_hook.close_token_ids = config_.think_close_token_ids; + gen_req.budget_hook.soft_close_probe_ids = + config_.think_close_probe_token_ids; gen_req.budget_hook.hard_limit_remaining = eff_reply_budget; + + // Soft-close min-ratio. Operator-gated: only forwarded when + // config_.soft_close_min_ratio > 0. Per-request value (if + // set and operator enabled) is already clamped to the + // server ceiling in the request parser. See plan §6.3. + if (config_.soft_close_min_ratio > 0.0f) { + gen_req.budget_hook.soft_close_min_ratio = + (req.per_req_soft_close_min_ratio >= 0.0f) + ? req.per_req_soft_close_min_ratio + : config_.soft_close_min_ratio; + } + + // Minimum-thinking-tokens floor: false-positive guard for + // soft-close. Server-policy only (no per-request override). + gen_req.budget_hook.soft_close_min_tokens = + config_.soft_close_min_tokens; + + // Diagnostic trajectory log — operator dial only. Carried + // through the BudgetHook so the AR loop can emit one line + // per thinking step regardless of whether soft-close is + // armed. See model_backend.h BudgetHook::debug_thinking_logits. + gen_req.budget_hook.debug_thinking_logits = + config_.debug_thinking_logits; } // Tool call hint generation: pre-tokenize predictable structural tokens @@ -1588,15 +1645,25 @@ void HttpServer::worker_loop() { } } - // close_kind reflects the Level 2 BudgetHook outcome: "hard" when - // the backend's AR/spec decode injected the close-token sequence - // at the budget boundary, "natural" when the model self-closed - // (or the request never opted in). Emitted as part of - // finish_details for thinking-budget callers. - std::string close_kind = - (req.thinking_opt_in && result.budget_forced_close) - ? "hard" - : "natural"; + // close_kind reflects the Level 2 BudgetHook outcome: + // "natural" — the model emitted on its own (or the + // request never opted in to the envelope). + // "soft" — the soft-close logit-ratio peek (Level 2.5) + // fired before the hard cap, indicating the + // model was willing to close. See + // docs/specs/thinking-budget.md §7. + // "hard" — the budget edge was reached without the model + // or the soft path agreeing; the AR loop forced + // in. Original Level 2 behavior. + // Soft wins ties against hard on the same step (see plan §4 + + // §12) — soft_forced_close and budget_forced_close are mutually + // exclusive per AR-loop step. Emitted as part of finish_details + // for thinking-budget callers. + std::string close_kind = "natural"; + if (req.thinking_opt_in) { + if (result.soft_forced_close) close_kind = "soft"; + else if (result.budget_forced_close) close_kind = "hard"; + } // Finalize. // Per-request wall-clock timings forwarded to the response's diff --git a/server/src/server/http_server.h b/server/src/server/http_server.h index 999eb5d9..1f363b6d 100644 --- a/server/src/server/http_server.h +++ b/server/src/server/http_server.h @@ -88,6 +88,40 @@ struct ServerConfig { // forwards into GenerateRequest.budget_hook when thinking is opted in. std::vector think_close_token_ids; + // Token IDs resolved at server startup for the soft-close PROBE. + // Tokenization of just the close MARKER substring (e.g. ``) + // — the bytes the soft-close logit-ratio peek compares against the + // chosen-token logit at each AR step. Conceptually separate from + // the inject sequence above: probing on the full directive's first + // token (typically a content lead-in like "Considering") forces + // soft-close to read a perpetually-low logit and never fire. + // Empty = legacy fallback: peek close_token_ids.front(). + std::vector think_close_probe_token_ids; + + // Soft-close min-ratio default. When > 0 AND a request opts into + // thinking, the AR loop force-emits early once + // prob[] / prob[chosen] >= this ratio. 0.0 = soft-close + // entirely disabled at the operator level; per-request overrides + // are silently ignored when this is zero (operator-policy gate). + // Range [0.0, 1.0]. See docs/specs/thinking-budget.md §7 and + // docs/experiments/soft-close-thinking-termination-plan.md. + float soft_close_min_ratio = 0.0f; + + // Minimum thinking tokens before soft-close is allowed to fire. The + // soft-close peek still runs every AR step (so trajectory logs + // remain complete), but the fire decision is suppressed until this + // many thinking tokens have been committed. False-positive guard. + // 0 = disabled (default — pre-floor behavior). + int soft_close_min_tokens = 0; + + // Diagnostic: when true, the AR loop emits one stderr line per + // thinking-phase step with the close-vs-chosen logit values, so a + // sliding-ratio curve can be tuned from real trajectory data. + // Operator-only flag; per-request overrides not exposed because + // the stderr volume is heavy. Plumbed through to + // BudgetHook::debug_thinking_logits when the budget hook is wired. + bool debug_thinking_logits = false; + // Phase-1 budgets per `reasoning.effort` tier (spec §4.2). Selected // by the request parser when `reasoning.effort` is present. Each // value is itself capped at `think_max_tokens` at startup. @@ -196,6 +230,14 @@ struct ParsedRequest { // hard_limit_reply_budget. Values are already clamped to those ceilings. int per_req_phase1_cap = -1; int per_req_reply_budget = -1; + // Per-request soft-close min-ratio override. -1.0 = not set (use + // server default). Honored only when the server has soft-close + // enabled (config_.soft_close_min_ratio > 0); when the operator has + // disabled soft-close, this is silently ignored. When honored, + // clamps to min(requested, server_default) — clients can tighten + // (lower the threshold) but never loosen (raise it). See spec §4.4 + // and plan §6.3. + float per_req_soft_close_min_ratio = -1.0f; // Stop sequences (OpenAI "stop" + Anthropic "stop_sequences") std::vector stop_sequences; // Bandit: per-session adaptive keep_ratio opt-in diff --git a/server/src/server/server_main.cpp b/server/src/server/server_main.cpp index 0f31739e..2b127722 100644 --- a/server/src/server/server_main.cpp +++ b/server/src/server/server_main.cpp @@ -195,6 +195,28 @@ static void print_usage(const char * prog) { " --reasoning-effort-max Phase-1 budget when request asks effort=max\n" " Defaults come from share/model_cards/.json;\n" " see docs/specs/thinking-budget.md §3.\n" + " --think-soft-close-min-ratio \n" + " Soft-close dial. When > 0 AND a request opts\n" + " into thinking, the AR loop force-emits \n" + " early once prob[]/prob[chosen] >= ratio,\n" + " reclaiming tokens the model would have spent\n" + " running to the hard cap. Range [0.0, 1.0]:\n" + " 0.0=disabled (default), 0.1=fire when within\n" + " 10x of argmax (mild), 0.5=fire at half-prob\n" + " (aggressive), 1.0=fire only when close is\n" + " argmax. See docs/specs/thinking-budget.md §7.\n" + " --think-soft-close-min-tokens \n" + " Minimum thinking tokens before soft-close\n" + " may fire. Floors the fire decision so a\n" + " brief close-marker logit spike early in\n" + " reasoning cannot prematurely terminate\n" + " thinking. 0 = disabled (default). Typical\n" + " values: 64-256 for qwen3.6-27b.\n" + " --debug-thinking-logits Emit one stderr line per AR step inside the\n" + " thinking phase recording committed/chosen/\n" + " logit[close]/logit[chosen]/diff/prob_ratio.\n" + " Use to record close-vs-chosen logit\n" + " trajectories. Stderr-heavy; operator only.\n" "\n" "KV cache:\n" " --cache-type-k KV cache K type (f16,bf16,q4_0,q4_1,q5_0,q5_1,q8_0,tq3_0)\n" @@ -257,6 +279,9 @@ int main(int argc, char ** argv) { bool effort_high = false; bool effort_x_high = false; bool effort_max = false; + bool soft_close_min_ratio = false; + bool soft_close_min_tokens = false; + bool debug_thinking_logits = false; } cli_set; // Track whether the operator passed the legacy --max-tokens alias. @@ -368,6 +393,37 @@ int main(int argc, char ** argv) { } else if (std::strcmp(argv[i], "--reasoning-effort-max") == 0 && i + 1 < argc) { sconfig.effort_tiers.max = std::atoi(argv[++i]); cli_set.effort_max = true; + } else if (std::strcmp(argv[i], "--think-soft-close-min-ratio") == 0 && i + 1 < argc) { + float r = std::strtof(argv[++i], nullptr); + // Clamp to [0, 1] with a warning if the operator passed + // something nonsensical. Bounded posture: the dial is + // operator-only, the bounds are tight by design. + if (r < 0.0f) { + std::fprintf(stderr, + "[server] --think-soft-close-min-ratio=%.4f < 0; " + "clamping to 0 (disabled)\n", r); + r = 0.0f; + } else if (r > 1.0f) { + std::fprintf(stderr, + "[server] --think-soft-close-min-ratio=%.4f > 1; " + "clamping to 1\n", r); + r = 1.0f; + } + sconfig.soft_close_min_ratio = r; + cli_set.soft_close_min_ratio = true; + } else if (std::strcmp(argv[i], "--think-soft-close-min-tokens") == 0 && i + 1 < argc) { + int n = std::atoi(argv[++i]); + if (n < 0) { + std::fprintf(stderr, + "[server] --think-soft-close-min-tokens=%d < 0; " + "clamping to 0 (disabled)\n", n); + n = 0; + } + sconfig.soft_close_min_tokens = n; + cli_set.soft_close_min_tokens = true; + } else if (std::strcmp(argv[i], "--debug-thinking-logits") == 0) { + sconfig.debug_thinking_logits = true; + cli_set.debug_thinking_logits = true; } else if (std::strcmp(argv[i], "--prefill-compression") == 0 && i + 1 < argc) { const char * mode = argv[++i]; if (std::strcmp(mode, "auto") == 0) @@ -716,6 +772,15 @@ int main(int argc, char ** argv) { std::fprintf(stderr, "[server] │ hard_limit_reply= %d (%s)\n", sconfig.hard_limit_reply_budget, src_of(cli_set.hard_limit_reply_budget)); + std::fprintf(stderr, "[server] │ soft_close_ratio= %.4f (%s)\n", + sconfig.soft_close_min_ratio, + cli_set.soft_close_min_ratio ? "from CLI" : "default (disabled)"); + std::fprintf(stderr, "[server] │ soft_close_floor= %d (%s)\n", + sconfig.soft_close_min_tokens, + cli_set.soft_close_min_tokens ? "from CLI" : "default (disabled)"); + std::fprintf(stderr, "[server] │ debug_think_log = %s (%s)\n", + sconfig.debug_thinking_logits ? "true" : "false", + cli_set.debug_thinking_logits ? "from CLI" : "default (off)"); std::fprintf(stderr, "[server] │ effort tiers = low=%d (%s)\n", sconfig.effort_tiers.low, src_of(cli_set.effort_low)); std::fprintf(stderr, "[server] │ medium=%d (%s)\n", @@ -874,6 +939,40 @@ int main(int argc, char ** argv) { } if (close_ids.size() > 16) std::fprintf(stderr, ",..."); std::fprintf(stderr, "\n"); + + // Probe-vs-inject split: when the inject sequence is the + // full directive hint (Qwen3.x-style trained lead-in), the + // first inject token is a content lead-in like "Considering" + // whose logit sits 19-35 nats below chosen during reasoning. + // Soft-close peeking that token never fires (empirical: see + // probe trajectory data). Tokenize JUST the marker substring + // and ship it as the probe sequence — at the AR boundary the + // marker's logit IS argmax-competitive (~prob_ratio>=0.5). + // When the hint and marker are identical (marker-only case), + // leave the probe field empty: BudgetHook::soft_close_probe_token() + // falls back to close_token_ids.front(), so this is a no-op. + if (!card.thinking_terminator_hint.empty() && + close_text.find(marker) != std::string::npos && + close_text != marker) + { + auto probe_ids = tokenizer.encode(marker); + if (!probe_ids.empty()) { + sconfig.think_close_probe_token_ids = probe_ids; + std::fprintf(stderr, + "[server] soft-close probe (marker=\"%s\", %zu tokens): ", + marker.c_str(), probe_ids.size()); + for (size_t i = 0; i < std::min(probe_ids.size(), 8); ++i) { + std::fprintf(stderr, "%s%d", i ? "," : "", probe_ids[i]); + } + if (probe_ids.size() > 8) std::fprintf(stderr, ",..."); + std::fprintf(stderr, "\n"); + } else { + std::fprintf(stderr, + "[server] soft-close probe DISABLED: marker \"%s\" " + "tokenizes to empty; legacy fallback (probe = inject[0]) " + "in effect.\n", marker.c_str()); + } + } } else { std::fprintf(stderr, "[server] level-2 force-close DISABLED: text %.40s... " diff --git a/server/test/test_server_unit.cpp b/server/test/test_server_unit.cpp index 275ec935..7fd49039 100644 --- a/server/test/test_server_unit.cpp +++ b/server/test/test_server_unit.cpp @@ -2560,6 +2560,519 @@ static void test_generate_result_accept_rate_zero_when_no_spec_decode() { TEST_ASSERT(r.accept_rate == 0.0f); } +// ─── Soft-close comparator + state machine ───────────────────────────── +// +// Tests the logit-ratio peek that lets the AR loop force early +// once the close-token logit comes within a configured probability ratio +// of the chosen-token logit. Default disabled. See +// docs/experiments/soft-close-thinking-termination-plan.md. +// +// We test two layers: +// 1. The pure comparator (`soft_close::should_fire`) — math-only. +// 2. A small state-machine helper that mimics the AR loop's +// precedence (soft first, then hard) so we can exercise the +// multi-token inject path and the soft/hard tie-break without a +// GPU. + +// Mirror of qwen35_backend.cpp's close-injection state for unit testing. +struct CloseState { + bool started = false; + int inject_pos = 0; + bool soft_fired = false; + bool hard_fired = false; +}; + +// Returns the (possibly overridden) token for this step, advancing +// CloseState. Mirrors the soft-then-hard ordering in the real loop. +// committed_now / committed_at_entry / n_gen track the budget arithmetic +// for the hard check identically to qwen35_backend.cpp:909-944. +static int32_t step_close_state(int32_t chosen_tok, + const float * logits, + const dflash::common::BudgetHook & hook, + int committed_now, + int committed_at_entry, + int n_gen, + CloseState & state) { + // Continue an in-progress close sequence. + if (state.started && + state.inject_pos < (int)hook.close_token_ids.size()) + { + int32_t inj = hook.close_token_ids[state.inject_pos]; + state.inject_pos++; + return inj; + } + if (state.started) return chosen_tok; // sequence already complete + + // Soft check (BEFORE hard, per plan §4). + // Probe-vs-inject split: peek hook.soft_close_probe_token(), write + // hook.close_token_ids.front() on fire. + // Min-tokens floor: suppress fire until generated_so_far >= + // hook.soft_close_min_tokens. Mirrors qwen35_backend.cpp gate. + const int generated_so_far = committed_now - committed_at_entry; + if (!hook.close_token_ids.empty() && + hook.soft_close_min_ratio > 0.0f && + generated_so_far >= hook.soft_close_min_tokens && + dflash::common::soft_close::should_fire( + logits, chosen_tok, + hook.soft_close_probe_token(), + hook.soft_close_min_ratio)) + { + state.started = true; + state.inject_pos = 1; + state.soft_fired = true; + return hook.close_token_ids.front(); // INJECT, not probe + } + + // Hard check: remaining <= hard_limit_remaining. + if (!hook.close_token_ids.empty()) { + const int generated = committed_now - committed_at_entry; + const int remaining = n_gen - generated; + if (remaining <= hook.hard_limit_remaining) { + int32_t close0 = hook.close_token_ids.front(); + if (chosen_tok == close0) { + // Model self-closed at boundary; consume as first of seq. + state.started = true; + state.inject_pos = 1; + return chosen_tok; + } + state.started = true; + state.inject_pos = 1; + state.hard_fired = true; + return close0; + } + } + return chosen_tok; +} + +// Build a logits row where the chosen-token gets logit `l_chosen` and the +// close token gets logit `l_close`; all other vocab tokens are far below. +static std::vector make_logits(int vocab, int chosen_tok, + int close_tok, + float l_chosen, float l_close) { + std::vector row(vocab, -100.0f); + row[chosen_tok] = l_chosen; + if (close_tok != chosen_tok) row[close_tok] = l_close; + return row; +} + +static void test_soft_close_disabled_default() { + // min_ratio=0 → never fires, regardless of logit configuration. + auto logits = make_logits(64, /*chosen=*/3, /*close=*/7, + /*l_chosen=*/2.0f, /*l_close=*/10.0f); + bool fired = dflash::common::soft_close::should_fire( + logits.data(), /*chosen=*/3, /*close0=*/7, /*min_ratio=*/0.0f); + TEST_ASSERT(fired == false); + // Even with close as argmax, disabled means false. + fired = dflash::common::soft_close::should_fire( + logits.data(), /*chosen=*/3, /*close0=*/3, /*min_ratio=*/0.0f); + TEST_ASSERT(fired == false); +} + +static void test_soft_close_strict_ratio_one() { + // min_ratio=1.0 → fires only when logit[close] >= logit[chosen] + // (i.e. close is the argmax or tied). chosen!=close already guarded. + auto eq = make_logits(64, 3, 7, /*l_chosen=*/5.0f, /*l_close=*/5.0f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + eq.data(), 3, 7, 1.0f) == true); + + auto below = make_logits(64, 3, 7, /*l_chosen=*/5.001f, /*l_close=*/5.0f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + below.data(), 3, 7, 1.0f) == false); + + auto above = make_logits(64, 3, 7, /*l_chosen=*/4.0f, /*l_close=*/5.0f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + above.data(), 3, 7, 1.0f) == true); +} + +static void test_soft_close_aggressive_half_prob() { + // min_ratio=0.5 — prob[close]/prob[chosen] >= 0.5 ⟺ + // logit_diff >= log(0.5) ≈ -0.6931. + const float ln_half = std::log(0.5f); + + // Boundary inclusive: diff exactly log(0.5). + auto boundary = make_logits(64, 3, 7, + /*l_chosen=*/5.0f, + /*l_close=*/5.0f + ln_half); + TEST_ASSERT(dflash::common::soft_close::should_fire( + boundary.data(), 3, 7, 0.5f) == true); + + // Just below: diff slightly less than log(0.5) (further negative). + auto below = make_logits(64, 3, 7, + /*l_chosen=*/5.0f, + /*l_close=*/5.0f + ln_half - 0.001f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + below.data(), 3, 7, 0.5f) == false); + + // Way above: close strongly favoured (but not argmax). + auto strong = make_logits(64, 3, 7, + /*l_chosen=*/5.0f, + /*l_close=*/4.9f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + strong.data(), 3, 7, 0.5f) == true); +} + +static void test_soft_close_below_threshold() { + // min_ratio=0.5, prob_ratio≈0.3 (well below) → no fire. + const float ln_03 = std::log(0.3f); + auto row = make_logits(64, 3, 7, + /*l_chosen=*/5.0f, + /*l_close=*/5.0f + ln_03); + TEST_ASSERT(dflash::common::soft_close::should_fire( + row.data(), 3, 7, 0.5f) == false); +} + +static void test_soft_close_chosen_is_close() { + // When the sampler already picks the close token, soft check never + // fires — natural-close path handles it. + auto row = make_logits(64, 7, 7, /*l_chosen=*/10.0f, /*l_close=*/10.0f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + row.data(), /*chosen=*/7, /*close0=*/7, /*min_ratio=*/0.5f) == false); + TEST_ASSERT(dflash::common::soft_close::should_fire( + row.data(), /*chosen=*/7, /*close0=*/7, /*min_ratio=*/1.0f) == false); +} + +static void test_soft_close_tiny_ratio_numerical() { + // min_ratio = 1e-6 ⇒ log_ratio ≈ -13.8155. Verify no NaN, threshold + // triggers when diff >= -13.8. + auto on = make_logits(64, 3, 7, /*l_chosen=*/5.0f, /*l_close=*/-8.5f); + auto off = make_logits(64, 3, 7, /*l_chosen=*/5.0f, /*l_close=*/-9.0f); + TEST_ASSERT(dflash::common::soft_close::should_fire( + on.data(), 3, 7, 1e-6f) == true); + TEST_ASSERT(dflash::common::soft_close::should_fire( + off.data(), 3, 7, 1e-6f) == false); +} + +// ── State-machine integration: soft + hard precedence ───────────────── + +static void test_soft_close_single_token_inject() { + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 248069 }; // Qwen3.6 single-token + hook.hard_limit_remaining = 16; + hook.soft_close_min_ratio = 0.1f; + + CloseState state; + // Step where soft should fire: close logit within 10% of chosen. + // log(0.1) ≈ -2.3026. + auto row = make_logits(/*vocab=*/250000, /*chosen=*/100, /*close=*/248069, + /*l_chosen=*/5.0f, /*l_close=*/3.0f); + int32_t out = step_close_state( + /*chosen=*/100, row.data(), + hook, + /*committed_now=*/100, /*committed_at_entry=*/50, /*n_gen=*/200, + state); + TEST_ASSERT(out == 248069); + TEST_ASSERT(state.started == true); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(state.hard_fired == false); + TEST_ASSERT(state.inject_pos == 1); + + // Next step: sequence is complete (single-token close); returns chosen. + auto row2 = make_logits(/*vocab=*/250000, /*chosen=*/200, /*close=*/248069, + /*l_chosen=*/5.0f, /*l_close=*/-50.0f); + out = step_close_state( + /*chosen=*/200, row2.data(), + hook, + /*committed_now=*/101, /*committed_at_entry=*/50, /*n_gen=*/200, + state); + TEST_ASSERT(out == 200); + TEST_ASSERT(state.soft_fired == true); // sticky +} + +static void test_soft_close_multi_token_inject() { + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 1718, 37947, 32 }; // Laguna-style multi-token + hook.hard_limit_remaining = 16; + hook.soft_close_min_ratio = 0.1f; + + CloseState state; + auto row = make_logits(/*vocab=*/250000, /*chosen=*/100, /*close=*/1718, + /*l_chosen=*/5.0f, /*l_close=*/3.0f); + int32_t out = step_close_state( + /*chosen=*/100, row.data(), + hook, + /*committed_now=*/100, /*committed_at_entry=*/50, /*n_gen=*/200, + state); + TEST_ASSERT(out == 1718); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(state.inject_pos == 1); + + // Step 2: inject 37947 regardless of chosen_tok. + auto row2 = make_logits(/*vocab=*/250000, /*chosen=*/300, /*close=*/1718, + /*l_chosen=*/5.0f, /*l_close=*/-50.0f); + out = step_close_state( + /*chosen=*/300, row2.data(), + hook, + /*committed_now=*/101, 50, 200, state); + TEST_ASSERT(out == 37947); + TEST_ASSERT(state.inject_pos == 2); + + // Step 3: inject 32. + out = step_close_state( + /*chosen=*/400, row2.data(), + hook, + /*committed_now=*/102, 50, 200, state); + TEST_ASSERT(out == 32); + TEST_ASSERT(state.inject_pos == 3); + + // Step 4: sequence complete, returns chosen. + out = step_close_state( + /*chosen=*/500, row2.data(), + hook, + /*committed_now=*/103, 50, 200, state); + TEST_ASSERT(out == 500); +} + +static void test_soft_close_then_hard_would_fire() { + // Soft fires at step 100; hard remaining-check would fire at + // committed_now=190 (remaining=10 <= hard_limit=16). Hard path + // skipped because state.started is already true. Telemetry: + // close_kind="soft". + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 248069 }; + hook.hard_limit_remaining = 16; + hook.soft_close_min_ratio = 0.1f; + + CloseState state; + // Soft trigger at committed_now=100. + auto soft_row = make_logits(/*vocab=*/250000, 100, 248069, 5.0f, 3.0f); + (void)step_close_state(100, soft_row.data(), hook, + /*committed_now=*/100, + /*committed_at_entry=*/50, + /*n_gen=*/200, state); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(state.hard_fired == false); + + // Now jump to committed_now=190 (remaining=10) — hard would have + // fired here but state.started=true so it's skipped. + auto far_row = make_logits(/*vocab=*/250000, 999, 248069, 5.0f, -100.0f); + int32_t out = step_close_state(999, far_row.data(), hook, + /*committed_now=*/190, + /*committed_at_entry=*/50, + /*n_gen=*/200, state); + // Single-token close already complete; returns chosen. + TEST_ASSERT(out == 999); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(state.hard_fired == false); +} + +static void test_soft_close_disabled_hard_still_fires() { + // min_ratio=0 (disabled): hard cap should still fire at the budget + // edge. Existing behavior preserved. + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 248069 }; + hook.hard_limit_remaining = 16; + hook.soft_close_min_ratio = 0.0f; // disabled + + CloseState state; + // Big gap between chosen and close — would fire soft if enabled. + auto row = make_logits(/*vocab=*/250000, 100, 248069, 5.0f, 4.99f); + int32_t out = step_close_state(100, row.data(), hook, + /*committed_now=*/100, 50, 200, state); + // Soft disabled: chosen passes through, not yet at hard boundary. + TEST_ASSERT(out == 100); + TEST_ASSERT(state.started == false); + + // At hard boundary: committed_now-entry=184 → remaining=16 ≤ hard. + // (entry=50, n_gen=200, hard_limit=16 ⇒ trigger at committed_now=234.) + out = step_close_state(100, row.data(), hook, + /*committed_now=*/234, 50, 200, state); + TEST_ASSERT(out == 248069); + TEST_ASSERT(state.hard_fired == true); + TEST_ASSERT(state.soft_fired == false); +} + +static void test_soft_close_natural_at_boundary() { + // Model picks close on its own (chosen == close0). Soft check skips + // (chosen==close0 guard); hard check also skips because the model + // self-emitted close. Neither flag set; close_kind would be + // "natural" downstream. + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 248069 }; + hook.hard_limit_remaining = 16; + hook.soft_close_min_ratio = 0.5f; + + CloseState state; + auto row = make_logits(/*vocab=*/250000, 248069, 248069, 5.0f, 5.0f); + // Far from hard boundary; chosen == close. + int32_t out = step_close_state(248069, row.data(), hook, + /*committed_now=*/100, 50, 200, state); + TEST_ASSERT(out == 248069); + TEST_ASSERT(state.soft_fired == false); + TEST_ASSERT(state.hard_fired == false); +} + +// Probe-vs-inject split. When soft_close_probe_ids is set, the +// comparator MUST peek the probe[0] logit, NOT inject[0]. Otherwise +// trained-hint sidecars (inject[0] = content lead-in token) keep +// the dial pinned at zero. +static void test_soft_close_probe_uses_probe_ids_not_inject_ids() { + using namespace dflash::common; + BudgetHook hook; + // Multi-token inject (mirrors a trained-hint sidecar). + hook.close_token_ids = { 99, 100, 101 }; + // Distinct single-token probe (the close marker). + hook.soft_close_probe_ids = { 42 }; + hook.soft_close_min_ratio = 0.5f; + + std::vector row(250000, -100.0f); + row[300] = 11.0f; // chosen + row[42] = 10.0f; // probe — within ratio 0.5 (exp(10-11)=0.37 < 0.5? no, 0.367) + row[42] = 10.31f; // exp(10.31-11) ≈ 0.502 — JUST fires at 0.5 + row[99] = -50.0f; // inject[0] far below — must not influence fire + + CloseState state; + int32_t out = step_close_state(/*chosen=*/300, row.data(), hook, + /*committed_now=*/200, 50, 500, state); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(out == 99); // wrote inject[0], not probe[0] + TEST_ASSERT(state.inject_pos == 1); +} + +// Empty soft_close_probe_ids ⇒ legacy fallback: peek close_token_ids +// front. Guarantees zero churn for any caller that doesn't set the +// new probe field. +static void test_soft_close_probe_ids_empty_falls_back_to_close_token_ids() { + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 248069 }; + // hook.soft_close_probe_ids left empty (legacy). + hook.soft_close_min_ratio = 0.5f; + + std::vector row(250000, -100.0f); + row[300] = 11.0f; + row[248069] = 10.31f; // close_token_ids[0]'s logit — same as before + + CloseState state; + int32_t out = step_close_state(/*chosen=*/300, row.data(), hook, + /*committed_now=*/200, 50, 500, state); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(out == 248069); + // Sanity: soft_close_probe_token() returns inject[0] when probe is empty. + TEST_ASSERT(hook.soft_close_probe_token() == 248069); +} + +// When soft-close fires, the WRITTEN sequence MUST be close_token_ids +// (the full inject), regardless of what soft_close_probe_ids contains. +// The probe is read-only — never appears in the output stream. +static void test_soft_close_inject_sequence_unchanged_when_fires() { + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 1718, 37947, 32 }; + hook.soft_close_probe_ids = { 42 }; + hook.soft_close_min_ratio = 0.1f; + + std::vector row(250000, -100.0f); + row[300] = 5.0f; + row[42] = 3.0f; // probe within ratio 0.1 + row[1718] = -80.0f; // inject[0] far below — must not matter + + CloseState state; + int32_t out = step_close_state(/*chosen=*/300, row.data(), hook, + /*committed_now=*/100, 50, 200, state); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(out == 1718); + + std::vector row2(250000, -100.0f); + row2[999] = 5.0f; + out = step_close_state(/*chosen=*/999, row2.data(), hook, + /*committed_now=*/101, 50, 200, state); + TEST_ASSERT(out == 37947); + out = step_close_state(/*chosen=*/999, row2.data(), hook, + /*committed_now=*/102, 50, 200, state); + TEST_ASSERT(out == 32); + out = step_close_state(/*chosen=*/999, row2.data(), hook, + /*committed_now=*/103, 50, 200, state); + TEST_ASSERT(out == 999); +} + +// min_thinking_tokens floor: when set, fire is suppressed until +// generated_so_far >= soft_close_min_tokens. +static void test_soft_close_min_tokens_blocks_early_fire() { + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 1718 }; + hook.soft_close_probe_ids = { 42 }; + hook.soft_close_min_ratio = 0.5f; + hook.soft_close_min_tokens = 100; + + std::vector row(250000, -100.0f); + row[300] = 5.0f; + row[42] = 5.0f; // prob_ratio = 1.0 ≫ 0.5 + + // Below floor: generated_so_far = 90 - 50 = 40 < 100 ⇒ no fire. + CloseState state_early; + int32_t out = step_close_state(/*chosen=*/300, row.data(), hook, + /*committed_now=*/90, 50, 500, + state_early); + TEST_ASSERT(state_early.soft_fired == false); + TEST_ASSERT(out == 300); + + // Above floor: generated_so_far = 200 - 50 = 150 >= 100 ⇒ fires. + CloseState state_late; + out = step_close_state(/*chosen=*/300, row.data(), hook, + /*committed_now=*/200, 50, 500, + state_late); + TEST_ASSERT(state_late.soft_fired == true); + TEST_ASSERT(out == 1718); +} + +// Default soft_close_min_tokens=0 ⇒ no floor ⇒ fire as soon as +// qualifying logits show up. Confirms the floor is opt-in. +static void test_soft_close_min_tokens_default_zero_unchanged_behavior() { + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 1718 }; + hook.soft_close_probe_ids = { 42 }; + hook.soft_close_min_ratio = 0.5f; + // soft_close_min_tokens left at default 0. + + std::vector row(250000, -100.0f); + row[300] = 5.0f; + row[42] = 5.0f; + + CloseState state; + int32_t out = step_close_state(/*chosen=*/300, row.data(), hook, + /*committed_now=*/1, 0, 500, state); + TEST_ASSERT(state.soft_fired == true); + TEST_ASSERT(out == 1718); +} + +static void test_soft_close_determinism_when_disabled() { + // Byte-identical generation invariant: with min_ratio=0, the + // override token MUST equal the chosen token for every step, for + // any logit configuration. This is the "zero-cost-when-disabled" + // generation determinism guarantee from plan §3.6. + using namespace dflash::common; + BudgetHook hook; + hook.close_token_ids = { 248069 }; + hook.hard_limit_remaining = 0; // disable hard too + hook.soft_close_min_ratio = 0.0f; // disabled + + CloseState state; + std::mt19937 rng(12345); + for (int step = 0; step < 100; step++) { + int32_t chosen = (int32_t)(rng() % 1000); + float l_chosen = (float)(rng() % 100) / 10.0f - 5.0f; + float l_close = (float)(rng() % 100) / 10.0f - 5.0f; + // vocab=250000 covers close_tok=248069. Pre-existing OOB on the + // 1000-element row was silently passing in Release builds; new + // tests perturbing heap layout could turn it into a crash. + auto row = make_logits(/*vocab=*/250000, chosen, /*close=*/248069, + l_chosen, l_close); + int32_t out = step_close_state(chosen, row.data(), hook, + /*committed_now=*/step, 0, 200, + state); + TEST_ASSERT(out == chosen); + } + TEST_ASSERT(state.soft_fired == false); + TEST_ASSERT(state.hard_fired == false); +} + int main() { std::fprintf(stderr, "══════════════════════════════════════════\n"); std::fprintf(stderr, " Server Unit Tests\n"); @@ -2726,6 +3239,25 @@ int main() { RUN_TEST(test_generate_result_accept_rate_in_usage_anthropic); RUN_TEST(test_generate_result_accept_rate_zero_when_no_spec_decode); + std::fprintf(stderr, "\n── Soft-close comparator + state machine ──\n"); + RUN_TEST(test_soft_close_disabled_default); + RUN_TEST(test_soft_close_strict_ratio_one); + RUN_TEST(test_soft_close_aggressive_half_prob); + RUN_TEST(test_soft_close_below_threshold); + RUN_TEST(test_soft_close_chosen_is_close); + RUN_TEST(test_soft_close_tiny_ratio_numerical); + RUN_TEST(test_soft_close_single_token_inject); + RUN_TEST(test_soft_close_multi_token_inject); + RUN_TEST(test_soft_close_then_hard_would_fire); + RUN_TEST(test_soft_close_disabled_hard_still_fires); + RUN_TEST(test_soft_close_natural_at_boundary); + RUN_TEST(test_soft_close_probe_uses_probe_ids_not_inject_ids); + RUN_TEST(test_soft_close_probe_ids_empty_falls_back_to_close_token_ids); + RUN_TEST(test_soft_close_inject_sequence_unchanged_when_fires); + RUN_TEST(test_soft_close_min_tokens_blocks_early_fire); + RUN_TEST(test_soft_close_min_tokens_default_zero_unchanged_behavior); + RUN_TEST(test_soft_close_determinism_when_disabled); + std::fprintf(stderr, "\n══════════════════════════════════════════\n"); std::fprintf(stderr, " Results: %d assertions, %d failures\n", test_count, test_failures);