Luce-Org · easel · Jun 1, 2026 · Jun 1, 2026 · Jun 3, 2026
diff --git a/docs/experiments/soft-close-thinking-termination-plan.md b/docs/experiments/soft-close-thinking-termination-plan.md
diff --git a/docs/specs/thinking-budget.md b/docs/specs/thinking-budget.md
@@ -538,13 +538,44 @@ The current taxonomy is:
 | Value | Meaning |
 |---|---|
 | `natural` | The model emitted `</think>` on its own, either before reaching the phase-1 cap or before Level 2 had to force-close. |
-| `hard` | The phase-1 cap was reached without a model-emitted `</think>`. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. |
+| `soft` | The soft-close logit-ratio peek (Level 2.5) fired before the hard cap — `prob[</think>] / prob[chosen_tok]` cleared the operator-configured `soft_close_min_ratio` threshold, and the AR loop injected `</think>` while the model was already "near" closing. Indicates voluntary cooperation: the model would have closed soon anyway; we just hurried it along to reclaim tokens. Currently Qwen3.5/3.6 only. |
+| `hard` | The phase-1 cap was reached without a model-emitted `</think>` and without the soft path triggering. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. |
+
+When both `soft` and `hard` could fire on the same AR step (the
+soft threshold cleared at exactly the budget-edge step), `soft`
+wins — the soft trigger carries more information (the model agreed
+it was time) than the hard trigger (which only reports coercion).
+See `docs/experiments/soft-close-thinking-termination-plan.md` §4 +
+§12 for the design rationale.
+
+Soft-close is enabled by the operator via the CLI flag
+`--think-soft-close-min-ratio <F>`. Default `0.0` keeps the legacy
+two-value taxonomy (`natural` / `hard`); any positive value
+activates the third. The dial is a probability ratio in `[0, 1]`:
+
+| `min_ratio` | Behaviour |
+|---|---|
+| `0.0` | Disabled. Soft path inert; per-request overrides silently ignored. |
+| `0.05`–`0.2` | Conservative — fires only when `</think>` is within 5×–20× of the argmax probability. Recommended starting range. |
+| `0.5` | Aggressive — fires when `</think>` has at least half the probability of the chosen token. |
+| `1.0` | Strict — fires only when `</think>` IS the most-likely token. Useful as a safety check. |
+
+Per-request override (Anthropic envelope, see §4.1):
+
+```jsonc
+{
+  "thinking": {
+    "type": "enabled",
+    "soft_close_min_ratio": 0.1
+  }
+}
+```
 
-A third value `soft` is reserved for a future voluntary-close
-mechanism (logit-biasing the model toward `</think>` as the cap
-approaches, before forcing it). Reserved so consumers can switch on
-the value without an exhaustive-match warning when a future server
-version adds it; not emitted today.
+The per-request value clamps to `min(requested, server_default)` —
+clients can tighten (lower the threshold, fire more aggressively)
+but not loosen (raise it above the operator's ceiling). When the
+server has the dial disabled (`0.0`), per-request overrides are
+silently ignored — the feature is operator-policy gated.
 
 ## 8. Streaming
 
@@ -564,9 +595,18 @@ in the terminal `message_delta` event for Anthropic.
   server-configured ceiling, never looser. Allowing full override
   would re-create the silent-truncation footgun of middleboxes that
   drop unknown fields.
-- **Soft close-kind / soft-budget hint.** The mechanism (logit bias
-  to nudge `</think>` selection before the hard cap) is sketched in
-  §7 but not specified.
+- **Spec-decode soft-close peek.** Soft-close fires inside the AR
+  loop. When spec-decode is in use, the close still triggers at the
+  spec-decode → AR tail-off boundary (slightly later than pure-AR
+  mode); the verify/accept inner loop does not run the comparator.
+  Gemma 4 and Laguna are pure-AR; this only matters for Qwen3.5/3.6
+  with a draft model.
+- **Multi-token close joint probability.** When `</think>` tokenizes
+  to multiple ids, the soft-close comparator peeks only the FIRST
+  id's logit (the existing multi-token inject machinery drives the
+  remainder of the sequence on subsequent steps). The joint
+  `P(t_0, t_1, …)` peek is left to a v2 if false-positive rates
+  warrant it.
 - **Per-token close-info metadata.** The upstream reference exposes
   `(token_index, remaining_budget, rank)` for the close event. The
   current `finish_details` reports aggregate counts only.

diff --git a/server/src/common/model_backend.h b/server/src/common/model_backend.h
@@ -10,6 +10,7 @@
 
 #pragma once
 
+#include <cmath>
 #include <cstdint>
 #include <cstdio>
 #include <functional>
@@ -71,15 +72,91 @@ struct DaemonIO {
 // decode) — the perf trade-off is acceptable since this only kicks in
 // for thinking-enabled requests. Spec-decode integration is a follow-up.
 struct BudgetHook {
-    // Multi-token close sequence injected when `(n_gen - committed)`
-    // drops to `hard_limit_remaining`. For Qwen3.x this is the
-    // canonical "Considering the limited time..." summarize-and-stop
-    // lead-in (tokenized at server startup); for non-qwen arches it's
-    // a single close-tag token. Empty = hook disabled.
+    // Inject sequence written when the hard cap fires OR when soft-close
+    // fires. This is the verbatim tokenization of the model card's
+    // `thinking_terminator_hint` (e.g. for Qwen3.6 the lead-in
+    // "Considering the limited time by the user, ... </think>\n\n").
+    // May be many tokens long; the first element is what the AR loop
+    // writes on the firing step, with the rest streamed out on
+    // subsequent steps. Empty = disabled.
     std::vector<int32_t> close_token_ids;
+    // Short PROBE sequence used by the soft-close logit-ratio peek.
+    // Conceptually this is the tokenization of just the close MARKER
+    // (e.g. `</think>` — a single token id 248069 on Qwen3.6) rather
+    // than the full inject directive above. Splitting probe-vs-inject
+    // matters because the inject sequence for trained-hint models
+    // starts with a content token like "Considering" whose logit is
+    // 19-35 nats below the chosen token at every step, masking the
+    // close-marker's true probability and preventing soft-close from
+    // ever firing.
+    // When empty, the soft-close peek falls back to
+    // `close_token_ids.front()` (legacy behavior — kept so models that
+    // haven't been updated keep working identically to before the split).
+    std::vector<int32_t> soft_close_probe_ids;
     int                  hard_limit_remaining = 0;
+    // Soft-close (Level 2 voluntary). When > 0, at each AR step the
+    // loop compares the probe-token logit against the chosen-token
+    // logit; if `prob[probe[0]] / prob[chosen] >= soft_close_min_ratio`
+    // (equivalently `logit[probe[0]] - logit[chosen] >= log(min_ratio)`),
+    // the inject sequence (close_token_ids) is written BEFORE the hard
+    // limit is reached. 0.0 = disabled (default); 1.0 = fire only when
+    // the probe token is already the most-likely token; lower values =
+    // fire more aggressively. See docs/specs/thinking-budget.md §7 and
+    // docs/experiments/soft-close-thinking-termination-plan.md.
+    float                soft_close_min_ratio = 0.0f;
+    // Minimum thinking tokens before soft-close is allowed to fire.
+    // Soft-close peek runs on every AR step but the fire decision is
+    // gated by this floor — protects against premature termination on
+    // prompts where the close-marker logit briefly spikes mid-thought.
+    // 0 = floor disabled (default). Per empirical trajectory data on
+    // qwen3.6-27b (5 diverse prompts), </think> only becomes
+    // argmax-competitive at 66-94% of natural reasoning length — so a
+    // floor in the 64-256 range is the typical operating point.
+    int                  soft_close_min_tokens = 0;
+    // Diagnostic: when true, emit one stderr line per AR step inside the
+    // thinking phase with (committed, chosen_tok, logit[probe0],
+    // logit[chosen], diff). Used to record the close-vs-chosen logit
+    // trajectory across a full thinking run so a sliding-threshold curve
+    // can be designed from empirical data rather than guessed. Zero cost
+    // when off. See server_main.cpp --debug-thinking-logits.
+    bool                 debug_thinking_logits = false;
+
+    // Probe token id used by the soft-close peek. Returns the first
+    // element of soft_close_probe_ids when set, otherwise falls back to
+    // close_token_ids.front() (legacy behavior). Callers must guard
+    // against an empty hook before calling this.
+    int32_t soft_close_probe_token() const {
+        if (!soft_close_probe_ids.empty()) return soft_close_probe_ids.front();
+        return close_token_ids.front();
+    }
 };
 
+namespace soft_close {
+
+// Returns true when the soft-close comparator would fire on this AR
+// step. Side-effect free; safe to call from unit tests.
+//
+// Fast path: returns false in O(1) when min_ratio <= 0 (the disabled
+// default). When the model has already chosen the close token on its
+// own, also returns false — the natural-close path handles that.
+//
+// Math: `prob[i]/prob[j] = exp(logit[i] - logit[j])`, so
+// `prob[close]/prob[chosen] >= min_ratio` ⟺
+// `logit[close] - logit[chosen] >= log(min_ratio)`. We compare on
+// logits to avoid `exp()` and full-softmax cost; this is numerically
+// stable in fp32 for typical LLM logit ranges (~±20).
+inline bool should_fire(const float * logits,
+                        int32_t       chosen_tok,
+                        int32_t       close0_tok,
+                        float         min_ratio) {
+    if (min_ratio <= 0.0f)          return false;
+    if (chosen_tok == close0_tok)    return false;
+    const float log_ratio = std::log(min_ratio);
+    return (logits[close0_tok] - logits[chosen_tok]) >= log_ratio;
+}
+
+}  // namespace soft_close
+
 struct GenerateRequest {
     std::vector<int32_t>       prompt;
     int                        n_gen       = 0;
@@ -121,6 +198,13 @@ struct GenerateResult {
     // stream and grepping for "</think>" cannot distinguish the two
     // (the injected close decodes identically).
     bool                       budget_forced_close = false;
+    // True when the soft-close path (logit-ratio peek) injected the
+    // </think> close sequence in this generation. Mutually exclusive
+    // with budget_forced_close: when both could fire on the same step,
+    // soft wins and budget_forced_close stays false. The server uses
+    // this to attribute close_kind="soft" (vs "hard"). See
+    // docs/specs/thinking-budget.md §7.
+    bool                       soft_forced_close = false;
     // True iff the AR decode loop's post-close watchdog detected an n-gram
     // repetition loop and broke out early. Caller surfaces this so clients
     // can mark the answer as unreliable rather than treating the
@@ -212,6 +296,8 @@ struct ModelBackend {
         retry.spec_decode_ran = first.spec_decode_ran || retry.spec_decode_ran;
         retry.budget_forced_close =
             first.budget_forced_close || retry.budget_forced_close;
+        retry.soft_forced_close =
+            first.soft_forced_close || retry.soft_forced_close;
         retry.degenerate_decode_close =
             first.degenerate_decode_close || retry.degenerate_decode_close;
         return retry;