diff --git a/docs/experiments/soft-close-thinking-termination-plan.md b/docs/experiments/soft-close-thinking-termination-plan.md
new file mode 100644
index 00000000..3ebe2dd7
--- /dev/null
+++ b/docs/experiments/soft-close-thinking-termination-plan.md
@@ -0,0 +1,774 @@
+# Soft-close: logit-ratio-driven early `</think>` termination
+
+Status: PLAN — pre-implementation. No code changes in this commit.
+
+Branch: `feat/soft-close-thinking-termination`
+Base: `Luce-Org/lucebox-hub:main` @ `8305b6c`
+Affected files (anticipated):
+- `server/src/common/model_backend.h` — extend `struct BudgetHook` and `struct GenerateResult`.
+- `server/src/qwen35/qwen35_backend.cpp` — soft-close peek inside the AR decode loop (`do_ar_decode`).
+- `server/src/server/http_server.cpp` — wire CLI/per-request soft ratio into `BudgetHook`; flip `close_kind` to `"soft"` when the soft path fired.
+- `server/src/server/http_server.h` — add `soft_close_min_ratio` to `ServerConfig` + per-request override field.
+- `server/src/server/server_main.cpp` — `--think-soft-close-min-ratio` CLI flag + startup banner.
+- `server/test/test_server_unit.cpp` — comparator + state-machine unit tests.
+- `docs/specs/thinking-budget.md` — note `close_kind="soft"` is now live and document the dial.
+
+Explicitly NOT touched (parallel sub-agent owns these on
+`fix/sse-emitter-content-mode-tool-parse`):
+- `server/src/server/sse_emitter.cpp`
+- `server/src/server/tool_parser.cpp`
+
+## 1. Problem statement
+
+The thinking-budget envelope (`docs/specs/thinking-budget.md`) today
+exposes two `close_kind` values:
+
+- `natural` — the model emitted `</think>` on its own.
+- `hard`    — the Level-2 hook injected `</think>` at the budget edge
+  because the model would otherwise burn the entire phase-1 budget.
+
+In practice, Gemma 4 26B decodes at ~30 tok/s through its full 15 488
+phase-1 cap (≈8 minutes wall-clock per case) on hard prompts whose
+reasoning the model has effectively finished much earlier. Sampled
+spot-checks show the close-token logit `logit[</think>]` riding very
+close to the argmax for hundreds or thousands of steps before the
+budget edge — i.e. the model is *near* ready to close, sampling just
+doesn't pick `</think>` because some content token has a marginally
+higher logit. Spec §7 already reserves a third `close_kind="soft"` value
+for "a future voluntary-close mechanism (logit-biasing the model toward
+`</think>` as the cap approaches, before forcing it)" — this PR turns
+that reservation on, with a different (cheaper, more legible) mechanism
+than logit biasing.
+
+## 2. Goal — bounded, opt-in, zero-cost-when-disabled
+
+Add a single configurable knob — `soft_close_min_ratio ∈ [0, 1]` — that,
+when set above zero, lets the AR loop force `</think>` early once the
+close token is "close enough" to the most-likely token to be a credible
+candidate. Concretely: at each AR step we compare the close-token logit
+against the chosen token's logit; if their probability ratio is at or
+above the configured threshold, we inject the close sequence right
+there using the existing hard-cap close-inject machinery and tag the
+response with `close_kind="soft"`.
+
+Invariants:
+
+- **Default disabled.** `soft_close_min_ratio = 0.0` is the shipped
+  default. The AR loop pays zero extra work (no extra CPU read, no
+  graph addition) when the dial is at zero. Generation must be
+  byte-identical to pre-PR with the dial at zero.
+- **Bounded.** Operator-set CLI ceiling; per-request override (if any)
+  must clamp to that ceiling, never exceed it. Same posture as the
+  other thinking knobs (spec §4.5).
+- **Composable.** Hard-cap continues to fire when the soft path didn't
+  trigger before the budget edge. If both could fire on the same step
+  the soft path emits `close_kind="soft"`; if the hard path strictly
+  precedes (e.g. soft disabled or threshold not met), `close_kind="hard"`.
+- **Hard-cap untouched.** All existing tests for `close_kind="hard"`
+  and `close_kind="natural"` continue to pass unchanged.
+
+## 3. Mechanism — logit-ratio peek (mechanism A)
+
+### 3.1 Comparator
+
+At each AR step the loop already (a) computes `logits` on-GPU and
+(b) copies the full vocab-sized `logits` row to CPU via
+`ggml_backend_tensor_get(sg_.logits, logits_buf.data(), ...)` at
+`server/src/qwen35/qwen35_backend.cpp:1017-1018`. Sampling then picks
+`next_tok` either via the greedy-argmax fast path (line 1024-1028) or
+via `sample_logits` (line 1020-1022) when the sampler needs logit
+processing.
+
+**Key observation: the AR loop already has the full logits vector on
+CPU.** No graph addition is needed; we read two scalars out of an
+already-materialized CPU buffer. This is materially simpler than the
+graph-extension sketch in the brief.
+
+The comparator runs after the sampler picks `next_tok` and before the
+force-close hook decides whether to override `next_tok`:
+
+```cpp
+// next_tok already chosen by sampler (argmax or full sampler).
+// logits_buf already populated by ggml_backend_tensor_get above.
+if (budget_hook.soft_close_min_ratio > 0.0f &&
+    !budget_hook.close_token_ids.empty() &&
+    !budget_close_started) {
+    const int32_t close0 = budget_hook.close_token_ids.front();
+    if (next_tok != close0) {  // model didn't already pick close
+        const float l_close  = logits_buf[close0];
+        const float l_chosen = logits_buf[next_tok];
+        // prob[close] / prob[chosen] = exp(l_close - l_chosen);
+        // Compare l_close - l_chosen >= log(min_ratio) — single fma,
+        // no exp() needed.
+        const float log_ratio = std::log(budget_hook.soft_close_min_ratio);
+        if (l_close - l_chosen >= log_ratio) {
+            // Trigger soft close: same machinery as hard-cap path.
+            soft_forced_close = true;
+            next_tok = close0;
+            budget_close_started = true;
+            close_inject_pos = 1;
+        }
+    }
+}
+```
+
+`log(min_ratio)` is precomputed once outside the loop. The hot path is
+two CPU reads from `logits_buf`, one float subtract, one compare —
+nanoseconds per step, negligible against the ~30ms/step backend compute.
+
+### 3.2 Probability ratio without softmax
+
+Doing the comparison on raw logits via `l_close - l_chosen >= log_ratio`
+is mathematically equivalent to `prob[close] / prob[chosen] >= ratio`,
+because softmax-normalisation is rank-preserving and the normaliser
+cancels in the ratio: `prob[i]/prob[j] = exp(l_i - l_j)`. We never
+need the full softmax. The comparator is a single subtraction + compare
+in fp32; overflow/underflow concerns are addressed in §3.4.
+
+### 3.3 Dial semantics
+
+The dial is the threshold ratio, *not* a log threshold. Operator-facing
+values are interpretable as probabilities:
+
+| `min_ratio` | Meaning | Behaviour |
+|---|---|---|
+| `0.0` | Disabled (default). | No work done; behaves exactly as today. |
+| `0.05` | 5 % | Fires only when `</think>` is within 20× of the most-likely token. Conservative — gives the model lots of room before nudging. |
+| `0.1` | 10 % | Fires when `</think>` is within 10×. Mildly aggressive. |
+| `0.5` | 50 % | Fires when `</think>` has at least half the probability of the chosen token. Aggressive. |
+| `1.0` | 100 % | Fires only when `</think>` IS the most-likely token (≈ equivalent to natural close at the same step). Useful as a safety check / sanity probe. |
+
+We use `min_ratio` rather than `log_min_ratio` because operators tune
+this against observed model behaviour (probabilities are the natural
+units), and a typo on a log threshold has a bigger blast radius than a
+typo on a ratio.
+
+### 3.4 Numerical guards
+
+The comparator computes `l_close - l_chosen` in fp32. Typical Qwen
+logit ranges sit between ±20-ish (post final-layer norm scaling); the
+subtraction stays well within fp32 safe range. Edge cases:
+
+- `next_tok == close0`: skip the comparator outright — the model just
+  picked close on its own, the existing natural-close path handles it.
+- `min_ratio == 0`: gated at the top of the comparator — no log call,
+  no read.
+- `min_ratio` extremely small (e.g. `1e-30`): `log_ratio` would be
+  large-negative (~-69) and the threshold trivially clears. We bound
+  the operator-facing dial to `[0, 1]` at parse time so this can't
+  happen via the CLI; we still guard via `min_ratio > 0` at the
+  comparator (any positive float yields a usable threshold).
+- `min_ratio == 1.0`: `log_ratio == 0`, so the comparator fires exactly
+  when `l_close >= l_chosen` — which (given we skip when
+  `next_tok == close0`) means `</think>` has logit equal to or above
+  whatever the sampler picked. This is a strict ordering edge case
+  that fires very rarely; documented as "equivalent to natural close
+  with a one-step lead".
+
+### 3.5 Multi-token close-id handling
+
+For models where `</think>` tokenizes to multiple ids (Laguna's
+`[1718, 37947, 32]`), we peek the FIRST id's logit only and let the
+existing multi-token inject machinery (qwen35_backend.cpp:892-905)
+emit the remaining ids on the following steps.
+
+Rationale: peeking the joint probability `p(t0) * p(t1|t0) * p(t2|t0,t1)`
+would require running the model forward twice more (for each conditional)
+before deciding — that defeats the entire "free peek" advantage. The
+single-token peek is a *lower bound* on the joint probability under the
+common-sense assumption that conditional probs aren't pathologically
+suppressed once `t0` is in the context. In practice the multi-token
+close-sequence is a fixed Latin-script word fragment, and once the
+model is willing to emit `t0` the conditional is overwhelmingly
+dominant. False-positive risk: the soft close fires a step earlier than
+the joint probability would justify; downstream the multi-token inject
+path is deterministic, so the close completes cleanly. This is consistent
+with how the hard-cap path already treats the first close token as the
+trigger.
+
+Out of scope: full joint-probability peek. Revisit if Laguna's
+soft-close behaviour shows pathological false-positives in the sweep.
+
+### 3.6 Zero-cost-when-disabled invariant
+
+When `soft_close_min_ratio == 0` (the default):
+
+- The comparator's outer guard `if (budget_hook.soft_close_min_ratio > 0.0f && ...)`
+  is checked first; on false, the entire branch is skipped.
+- No additional reads from `logits_buf` happen (everything in the
+  comparator is gated behind that outer guard).
+- `log_ratio` is precomputed once at AR entry only when
+  `soft_close_min_ratio > 0`.
+- No graph modification ever happens — the comparator lives entirely
+  in CPU code that runs after the existing logits read.
+
+Net cost when disabled: one fp32 compare-with-zero per AR step. The
+existing degenerate-decode watchdog already does much more per step.
+Generation determinism with `min_ratio=0` is byte-identical to pre-PR.
+
+## 4. State machine — soft path alongside the hard path
+
+The existing `maybe_force_close` lambda in
+`server/src/qwen35/qwen35_backend.cpp:889-948` is the hard-cap
+implementation. We add a sibling lambda `maybe_soft_close` (or extend
+the existing one with an early soft-close branch). Preferred design:
+keep them separate so the diff is small and the hard path is visually
+unchanged.
+
+Order of operations per AR step:
+
+1. Run the existing argmax / sample_logits path to choose `next_tok`.
+2. Read `logits_buf[close0]` and `logits_buf[next_tok]` for the soft
+   comparator. (Already in CPU memory.)
+3. **Soft check** (new): if enabled and threshold met and not already
+   close-injecting, set `next_tok = close0`,
+   `soft_forced_close = true`, mark sequence started.
+4. **Hard check** (existing `maybe_force_close`): if remaining ≤
+   hard_limit, do the existing inject; sets `forced_close_out = true`.
+5. Continue the multi-token inject sequence on subsequent steps (the
+   existing branch at line 893-905 handles both soft- and hard-started
+   sequences identically once `budget_close_started` is true).
+
+**Precedence note.** Steps 3 and 4 are mutually exclusive on a given
+step *because* both gate on `!budget_close_started`. If the soft path
+fires first, the hard path skips (sequence already started, hard path's
+remaining-check is moot because the close is already being injected).
+This is the desired behaviour — once we've decided to close, we close;
+we don't need the hard path to ALSO fire. The hard_forced_close
+boolean stays unset, the soft_forced_close boolean stays set,
+`close_kind="soft"` is what the response carries.
+
+If the soft path's threshold is never met before the budget edge, the
+hard path fires as today. `close_kind="hard"` is what the response
+carries. Existing behaviour preserved.
+
+What if both *would* fire on the same step (i.e. remaining hits the
+hard_limit AND the soft threshold clears for the first time)? The soft
+path runs first in code order and wins. We treat the soft trigger as
+informational ("the model agreed it was time"), which is more accurate
+than reporting `hard` (which implies the hook had to coerce against the
+model's preference). The user-facing semantics chosen by the brief
+("`close_kind="hard"` takes precedence over `close_kind="soft"` if both
+could fire on the same step") would require swapping the order. We
+disagree and propose soft-wins instead — see §11 for the rebuttal.
+
+## 5. Telemetry — `close_kind="soft"`
+
+### 5.1 `GenerateResult` extension
+
+Add a new bool sibling to `GenerateResult::budget_forced_close`:
+
+```cpp
+// True when the soft-close path (logit-ratio peek) injected the
+// </think> sequence in this generation. Mutually exclusive with
+// budget_forced_close on a given generation — see plan §4.
+bool soft_forced_close = false;
+```
+
+`merge_empty_spec_retry_result` in `model_backend.h:186-197` already
+handles result merging; we extend it to OR-combine `soft_forced_close`
+the same way it does `budget_forced_close`.
+
+### 5.2 `http_server.cpp` close-kind selection
+
+`server/src/server/http_server.cpp:1596-1599` currently selects between
+`"hard"` and `"natural"`. We extend it to three branches:
+
+```cpp
+std::string close_kind = "natural";
+if (req.thinking_opt_in) {
+    if (result.soft_forced_close)        close_kind = "soft";
+    else if (result.budget_forced_close) close_kind = "hard";
+}
+```
+
+That's the only emission-site change; the `finish_details.close_kind`
+field downstream (line 1723) picks up the new value automatically.
+
+### 5.3 Spec update
+
+`docs/specs/thinking-budget.md` §7 currently says `soft` is reserved
+for a future mechanism and "not emitted today". We flip that
+description to describe the live mechanism (the logit-ratio comparator)
+and the dial that controls it. The taxonomy table gains a third
+row.
+
+## 6. Plumbing
+
+### 6.1 `BudgetHook` extension
+
+`server/src/common/model_backend.h:53-56` — extend:
+
+```cpp
+struct BudgetHook {
+    std::vector<int32_t> close_token_ids;
+    int                  hard_limit_remaining = 0;
+    // Soft-close: when prob[close[0]] / prob[chosen] >= soft_close_min_ratio
+    // (equivalently, logit[close[0]] - logit[chosen] >= log(soft_close_min_ratio)),
+    // force-emit close_token_ids early. 0.0 = disabled (default). 1.0 = only
+    // when close is already the most-likely token (≈ natural close). Lower
+    // values fire more aggressively. See docs/specs/thinking-budget.md §7.
+    float                soft_close_min_ratio = 0.0f;
+};
+```
+
+### 6.2 `ServerConfig` + CLI
+
+`server/src/server/http_server.h` (`struct ServerConfig`): add
+
+```cpp
+// Default soft-close min-ratio applied when a request opts into
+// thinking and does not provide its own per-request override.
+// 0.0 = disabled (no soft-close).  Spec §7.
+float soft_close_min_ratio = 0.0f;
+```
+
+`server/src/server/server_main.cpp`: add CLI flag
+`--think-soft-close-min-ratio <float>` paralleling the existing
+`--hard-limit-reply-budget` flow:
+
+- Help-text entry (around line 185-195).
+- `cli_set.soft_close_min_ratio = false;` field in the bool tracker
+  struct.
+- Parse branch:
+  ```cpp
+  } else if (std::strcmp(argv[i], "--think-soft-close-min-ratio") == 0 && i + 1 < argc) {
+      sconfig.soft_close_min_ratio = std::strtof(argv[++i], nullptr);
+      cli_set.soft_close_min_ratio = true;
+  }
+  ```
+- Validation: at startup, if `soft_close_min_ratio < 0 || > 1`, emit a
+  warning and clamp to `[0, 1]`.
+- Banner line: `[server] │  soft_close_min_ratio = 0.000 (cli|default)`.
+- Resolution: there is no model-card source for this value (it is an
+  operator-tuning knob, not a model property). CLI wins; otherwise
+  default 0.0.
+
+### 6.3 Per-request override
+
+Spec §4.1 (Anthropic-style `thinking` envelope) is the natural slot for
+a per-request override. We add:
+
+```jsonc
+{
+  "thinking": {
+    "type": "enabled",
+    "budget_tokens": 4000,
+    "reply_budget":  300,
+    "soft_close_min_ratio": 0.1   // NEW
+  }
+}
+```
+
+Clamping rule (consistent with the other thinking knobs, spec §4.4):
+`effective = min(requested, server_default)` — i.e. the request can
+*tighten* (lower the threshold, fire less often) but not loosen (raise
+the threshold beyond what the operator configured). Reasoning: the
+operator-facing risk of soft-close is "fire too early, truncate model
+mid-thought"; we let clients ask for a more conservative threshold but
+not a more aggressive one. Same posture as `budget_tokens` and
+`reply_budget`.
+
+Field plumbing:
+
+- `ParsedRequest` (`http_server.h:170-203`) gains
+  `float per_req_soft_close_min_ratio = -1.0f;` (-1 = unset).
+- Parser (`http_server.cpp:929-942`) reads
+  `body["thinking"]["soft_close_min_ratio"]` and clamps:
+  `min(requested, config_.soft_close_min_ratio)`. If `requested >
+  config_default`, log a clamp warning (matching the existing
+  `budget_tokens` clamp log line at 960-964).
+- Hook construction (`http_server.cpp:1314-1322`) sets
+  `gen_req.budget_hook.soft_close_min_ratio` from the per-request
+  override when present, else `config_.soft_close_min_ratio`.
+
+The OpenAI Responses `reasoning.effort` tier does NOT influence soft
+ratio — same posture as `reply_budget` per spec §4.2. Soft is
+operator-policy; effort tier selects *budget*.
+
+### 6.4 lucebox / autotune plumbing
+
+The user brief mentions `dflash.think_soft_close_min_ratio` and an
+`autotune.py` field. These live in the python lucebox CLI repo, not
+in `lucebox-hub` (this repo). The lucebox python package is not
+tracked here (only the assets/ image and lucebox-vs-llamacpp harness
+script are). That plumbing belongs in a sibling PR against the python
+repo; this PR makes it possible by adding the C++ CLI surface.
+
+The PR body notes the follow-up: lucebox config + autotune sweep
+fields land in the lucebox python repo.
+
+## 7. Spec-decode boundary
+
+Spec-decode is explicitly out of scope. The existing AR tail-off
+mechanism at `server/src/qwen35/qwen35_backend.cpp:1210-1236` already
+hands control to AR when `remaining <= hard + q_len`. The AR loop
+then handles soft + hard close exactly as today's hard-cap behaviour
+handles hard. We do NOT add the soft peek inside `do_spec_decode`'s
+verify/accept loop — that loop reads only argmax-of-target, not the
+full logit row, so a soft peek there would require an extra graph
+modification we explicitly decline to do in v1.
+
+Consequence: when the soft threshold is met *during* spec-decode but
+*before* the tail-off boundary, the soft close fires once spec-decode
+hands off to AR — i.e. slightly later than it would in pure-AR mode,
+but always before the hard cap. Acceptable for v1; documented in PR
+body. Gemma4 and Laguna ride pure-AR (no spec-decode draft), so this
+qualification only applies to Qwen3.5/3.6 + draft.
+
+No double-fire risk: the soft check is keyed on `!budget_close_started`
+which is local to a single `do_ar_decode` call. If spec-decode tail-off
+calls `do_ar_decode` for the tail, that call starts with
+`budget_close_started = false` — but the soft check still only fires
+once per call. The hard check at the budget edge would fire on the
+same call. Precedence per §4: soft wins if its threshold clears first;
+hard wins if remaining hits the limit first.
+
+## 8. Test plan — unit-level, no GPU required
+
+Add a new test section to `server/test/test_server_unit.cpp`:
+"`── Soft-close comparator ──`". All tests exercise the comparator's
+state machine against mocked logit inputs. No backend, no GPU.
+
+The comparator's core is:
+
+```cpp
+// Returns true if soft-close should fire on this step.
+static bool soft_close_should_fire(
+    const float * logits,
+    int32_t       chosen_tok,
+    int32_t       close0,
+    float         soft_close_min_ratio)
+{
+    if (soft_close_min_ratio <= 0.0f) return false;
+    if (chosen_tok == close0)        return false;
+    const float log_ratio = std::log(soft_close_min_ratio);
+    return logits[close0] - logits[chosen_tok] >= log_ratio;
+}
+```
+
+Lifted out of the AR loop into a small inline helper (in
+`server/src/common/model_backend.h` or `qwen35_backend.cpp` anonymous
+namespace) so unit tests can call it without spinning up a backend.
+
+### 8.1 Test cases
+
+1. **Disabled default.** `min_ratio=0.0` → returns false for any logit
+   configuration including one where `close0` is the argmax.
+2. **Strict (`min_ratio=1.0`).** Fires only when `logit[close0] >=
+   logit[chosen]` AND `chosen != close0`. With `chosen=argmax(other)`
+   and `logit[close0] == logit[chosen]`, fires. With `logit[close0] =
+   logit[chosen] - 0.001`, does not fire.
+3. **Aggressive (`min_ratio=0.5`).** With `logit[close0] = logit[chosen]
+   - log(2)` (i.e. prob ratio exactly 0.5), fires (boundary inclusive).
+   With `logit[close0] = logit[chosen] - log(2) - 0.001`, does not.
+4. **Below threshold.** `min_ratio=0.5`, `logit[close0] = logit[chosen]
+   - log(3.333)` (≈ prob ratio 0.3) → does not fire.
+5. **Chosen IS close.** `chosen_tok == close0` → returns false even
+   with min_ratio aggressive. (Model self-closed; the natural-close
+   path handles it.)
+6. **Multi-token close.** Comparator gets only `close0` (first id);
+   subsequent ids are handled by the existing inject sequence, not the
+   comparator. Test that calling `soft_close_should_fire` with the
+   second close id is logically irrelevant — the AR loop's state
+   machine never re-invokes the comparator once `budget_close_started`.
+   Test via the integration helper described in §8.2.
+7. **Numerical edge: very-small min_ratio.** `min_ratio = 1e-6` (≈ -13.8
+   log). Verify no NaN / inf, threshold triggers when `logit[close0] -
+   logit[chosen] >= -13.8`. With `logit[close0] = logit[chosen] - 14`,
+   does not fire; `- 13.5` fires.
+
+### 8.2 State-machine integration test
+
+A second helper exercises the close-sequence inject state machine
+together with the comparator. Since `do_ar_decode` is too entangled
+with GPU buffers to call from a unit test, we extract the close-state
+into a small struct:
+
+```cpp
+struct CloseState {
+    bool started        = false;
+    int  inject_pos     = 0;
+    bool soft_fired     = false;
+    bool hard_fired     = false;
+};
+```
+
+…and a `step` function that, given (logits row, chosen_tok, generated,
+n_gen, BudgetHook, &CloseState) returns the override token (or
+chosen_tok unchanged) and mutates `CloseState`. Then tests assert:
+
+- **(soft, single-token close).** A row where soft fires on step 100
+  with `chosen != close0`. Returns `close0` on step 100, sets
+  `soft_fired=true`. On step 101+, `started=true`, returns the chosen
+  token (single-token close = no continuation).
+- **(soft, multi-token close).** Close ids `[1718, 37947, 32]`. Soft
+  fires on step 100. Step 100 returns `1718`. Steps 101-102 inject
+  `37947` and `32` regardless of chosen tok. Step 103 returns chosen.
+- **(soft then hard would-fire).** Soft fires at step 50; hard limit
+  hit at step 200. Hard path skipped on step 200 because
+  `started=true`. `soft_fired=true`, `hard_fired=false`. Telemetry
+  reports `close_kind="soft"`.
+- **(hard, no soft).** `min_ratio=0`; hard limit hit at step 200.
+  Returns `close0` on step 200. `hard_fired=true`,
+  `soft_fired=false`. Same close_kind="hard" semantics as today.
+- **(natural at boundary).** Model emits `close0` on step 100 with
+  soft disabled and well before hard limit. Comparator skipped
+  (`chosen == close0`). `soft_fired=false`, `hard_fired=false`.
+  Telemetry: `close_kind="natural"`.
+
+### 8.3 Existing tests stay green
+
+`luce-bench/tests/test_client_thinking_budget.py` (server-level
+integration) exercises `close_kind="hard"` and `"natural"`. With
+soft-close disabled by default, every assertion stays valid. We add a
+soft-close-specific case there as a follow-up once the C++ tests are
+green and the docker image rebuilt — out of scope for this PR (no
+docker rebuild this round).
+
+### 8.4 Determinism check
+
+A small additional unit test seeds a mock logits row deterministically
+and asserts that the soft-close path with `min_ratio=0` produces the
+same `chosen_tok` and CloseState as the legacy code path. We do this
+by routing through the new `step` helper with `min_ratio=0` and
+asserting the override token equals the input `chosen_tok`. Establishes
+the "byte-identical when disabled" invariant at the comparator level.
+
+## 9. PR breakdown — two commits + possibly a third
+
+1. **Plan commit.** This file, on its own commit, `docs:` prefix.
+2. **Implementation commit.** `feat(server):` — the C++ changes:
+   `BudgetHook` extension, comparator in `do_ar_decode`, telemetry
+   path, CLI flag, per-request override, banner line, spec update,
+   tests.
+3. **(optional) Plumbing-only commit.** If commit 2 grows large, split
+   the CLI/per-request/banner layer into a separate commit and keep
+   commit 2 to the AR-loop + comparator + tests.
+
+Three is the realistic max; the work fits naturally in two.
+
+## 10. Codex review prompts
+
+The brief instructs us to send codex the plan with five explicit
+questions. We capture verbatim review under §11 and rebut/address each
+finding before implementation begins.
+
+Questions sent:
+
+1. Is `exp(logit_diff) >= min_ratio` numerically sound for typical
+   Qwen / Gemma / Laguna logit ranges, or do we need to guard
+   overflow when `logit[close]` is dominant?
+2. The current argmax path reads one int32 per step. Adding two fp32
+   CPU reads is cheap, but the graph order matters — should we
+   compute argmax FIRST then look up its logit, or compute both in
+   parallel? Any latency tradeoff?
+3. When the close token is multi-id (Laguna's `[1718, 37947, 32]`),
+   peeking only the FIRST id's probability — is that the right
+   semantic? Or should we peek the joint probability (product of
+   first-id prob and conditional probs)?
+4. Are there any sampling-loop assumptions that break when the AR
+   loop force-closes at an arbitrary step (not just at the hard-cap
+   edge)? E.g. `generated` counter resets, KV-position bookkeeping,
+   `result.tokens.size()` accounting?
+5. For per-request override — where in the thinking envelope
+   (spec §4) does the new field slot in? Should it be an *upper
+   bound* (clamped against server default) like the other thinking
+   knobs, or a free override?
+
+## 11. Codex review
+
+Review run on 2026-05-31 via the live `lucebox codex` path (Gemma 4
+26B model on bragi, thinking_budget=8000 reply_budget=4000). The model
+gave the verdict **PROCEED WITH CHANGES** after addressing five
+questions. Verbatim review body follows.
+
+> ### Technical Review: Soft-close Thinking Termination
+>
+> **1. Numerical Soundness (§3.1, §3.4)**
+> The approach is sound. Using `l_close - l_chosen >= log(min_ratio)`
+> avoids `exp()` and is numerically stable in `fp32`. Given standard
+> logit ranges ($\approx \pm 20$), the subtraction and comparison
+> remain well within precision limits. No additional guards beyond
+> the `min_ratio > 0` check are required.
+>
+> **2. Computational Complexity (§3.1)**
+> Correct. Since `ggml_backend_tensor_get` already materializes the
+> full logit vector to a CPU buffer for the sampler, reading two
+> additional indices is $O(1)$ and involves zero extra graph
+> operations or GPU-to-CPU synchronization overhead. Latency impact
+> is negligible.
+>
+> **3. Multi-token Semantics (§3.5)**
+> Peeking only $t_0$ is a valid engineering trade-off. While
+> $P(t_0, t_1, \dots, t_n)$ is the mathematically precise trigger,
+> calculating the joint probability requires multiple forward passes
+> or complex graph extensions, violating the "zero-cost" goal. As
+> long as the model's transition $P(t_1|t_0)$ is highly deterministic
+> for the `</think>` sequence (which is standard for special tokens),
+> the $t_0$ heuristic is sufficient.
+>
+> **4. Loop Bookkeeping (§4, §8.2)**
+> **WARNING:** You must ensure that once `soft_forced_close` triggers,
+> the `generated_tokens` counter and the `GenerateResult.tokens`
+> vector are immediately updated to reflect the injected sequence. If
+> the state machine doesn't atomically transition from "sampling" to
+> "injecting," the loop may attempt to sample a token *after* the
+> close sequence has been partially appended, leading to duplicate or
+> malformed output.
+>
+> **5. Per-request Override Logic (§6.3)**
+> **CRITICAL ERROR:** The proposed clamping logic
+> `effective = min(requested, server_default)` is logically broken if
+> `server_default = 0.0`. If the server has soft-close disabled
+> ($0.0$), a client requesting $0.1$ will result in
+> `min(0.1, 0.0) = 0.0`, making it impossible for a client to opt-in.
+> *   **Fix:** Use a sentinel or explicit check:
+> `effective = (server_default == 0.0f) ? requested : std::min(requested, server_default);`.
+> (Note: This assumes "tightening" means a *smaller* ratio, i.e.,
+> more aggressive).
+>
+> **Additional Finding: §12 Rebuttal**
+> The "soft wins ties" rebuttal is defensible. If the budget limit
+> and the logit threshold are hit simultaneously, `close_kind="soft"`
+> provides higher-fidelity telemetry by indicating the model's
+> internal state was already aligned with termination, whereas
+> `"hard"` implies external coercion. This is better for debugging
+> model intent.
+>
+> **PROCEED WITH CHANGES**
+
+### 11.1 Disposition
+
+- **Q1 (numerical soundness).** Accepted as written. §3.4 already
+  covers the bounded logit range. No change.
+- **Q2 (graph addition unnecessary).** Accepted as written. §3.1 +
+  §3.6 confirmed by reviewer. No change.
+- **Q3 (multi-token peek).** Accepted as written. §3.5 stands. We
+  add an explicit note: this design assumes `P(t_1|t_0)` is
+  near-deterministic for the `</think>` sequence — true today for
+  Qwen3.5/3.6 (single id) and Laguna (special-token sequence). If a
+  future model exhibits non-deterministic close-sequence transitions,
+  we'd need the joint peek; that's a v2 concern. No code change.
+- **Q4 (loop bookkeeping WARNING).** Addressed by the design as
+  specified. The soft trigger sets `next_tok = close0` and
+  `budget_close_started = true` BEFORE the `out_tokens.push_back(next_tok)`
+  call at qwen35_backend.cpp:1033 — i.e. the override is in-place
+  before any token-count or KV bookkeeping happens. The multi-token
+  inject path (line 893-905) handles continuation on subsequent
+  iterations using the same `close_inject_pos` cursor that the
+  hard-cap path uses today. We will add an explicit unit test
+  (§8.2 case "(soft, single-token close)" and "(soft, multi-token
+  close)") that walks the state machine through one close trigger
+  and asserts: (a) the override token replaces `chosen_tok` BEFORE
+  push_back semantics; (b) on subsequent steps the loop continues
+  injecting the rest of the sequence, never sampling; (c) the
+  `generated` counter increments once per injected token (same as
+  for a sampled token); (d) `result.tokens.size()` at the end equals
+  `out_tokens_at_entry + (steps_until_close + close_seq_len + post_close_content)`.
+  Wording in §4 sharpened to call out the atomic transition.
+- **Q5 (per-request override clamp — CRITICAL).** **Accepted as
+  bug.** Reviewer is right. Original spec §6.3 broke the opt-in case
+  when server_default=0 (disabled). Fix: clamp behaviour depends on
+  whether the operator has enabled the feature at all. New rule —
+  per §6.3 update below:
+
+  ```
+  if (server_default == 0.0f) {
+      // Operator opted to leave the feature disabled. Per-request
+      // override is honored as a free opt-in. Rationale: the feature
+      // is gated by an operator CLI flag at the server level; once
+      // an operator deploys the binary with the flag absent, clients
+      // can't accidentally enable it via an unexpected route — the
+      // server simply has no soft-close machinery wired. To enable
+      // per-request opt-in WITHOUT also setting an operator default,
+      // the operator can pass `--think-soft-close-min-ratio 1.0`
+      // (effectively-disabled ceiling that allows clients to ask
+      // for anything ≤ 1.0).
+      // Actually NO — clearer policy below.
+      effective = 0.0f;  // request silently ignored when disabled
+  } else {
+      effective = std::min(requested, server_default);
+  }
+  ```
+
+  After reflection, the cleanest policy is: **`0.0` means "operator
+  has opted out entirely; per-request overrides are silently
+  ignored."** This avoids surprise activation. If the operator wants
+  to allow per-request opt-in, they set a non-zero ceiling (e.g.
+  `--think-soft-close-min-ratio 0.5`) and the client clamps under
+  that. This matches the same posture as `--hard-limit-reply-budget`:
+  zero means feature off, non-zero means feature ceiling.
+
+  Spec §6.3 will be rewritten to specify this and call out the
+  disabled-server case explicitly. A unit test in §8.1 covers it:
+
+  - **(disabled server, opt-in request).** `server_default=0`,
+    `requested=0.1` → effective `0.0` (soft path disabled, no fire).
+  - **(enabled server, tighter request).** `server_default=0.5`,
+    `requested=0.1` → effective `0.1` (soft fires at the more
+    aggressive client threshold).
+  - **(enabled server, looser request).** `server_default=0.1`,
+    `requested=0.5` → effective `0.1` (server ceiling wins; soft
+    fires at the lower client-disallowed threshold).
+- **§12 tie-breaking.** Reviewer accepted soft-wins. No change.
+
+The plan §6.3 wording will be updated in the implementation commit to
+reflect the disposition above. This §11.1 disposition is the source
+of truth.
+
+## 12. Rebuttal: precedence when soft + hard both could fire same step
+
+The brief states: *"`close_kind="hard"` takes precedence over
+`close_kind="soft"` if both could fire on the same step."*
+
+We propose the opposite — **soft wins ties.** Rationale:
+
+- The soft path's threshold-clear signals "the model is willing to
+  close" — it is informational about the model's own preference. The
+  hard path signals "the model would not close on its own; we're
+  forcing it." Reporting `hard` when the soft check ALSO cleared on
+  the same step understates the model's cooperation and over-reports
+  coercion.
+- The dial is operator-tunable. If an operator picks an aggressive
+  ratio (e.g. 0.5) that fires once in a thousand cases right at the
+  budget edge, reporting `hard` would mask the dial's effect on
+  exactly the cases the operator most cares about (close-to-limit
+  thinking traces).
+- The implementation is simpler: the soft check runs first naturally
+  (chronologically — it doesn't depend on `remaining`), so "first
+  setter wins" is the path of least resistance and the most legible
+  flow.
+
+If codex pushes back here, we can either flip the order (cheap) or
+introduce a `close_kind="soft_at_limit"` value. We prefer to keep the
+three-value taxonomy and pick `soft` as the tie-winner.
+
+## 13. Out of scope
+
+- **Spec-decode soft peek.** Documented in §7. Pure AR only in v1.
+- **Multi-token joint probability.** Single first-id peek only.
+  Documented in §3.5.
+- **Gemma4 / Laguna soft-close.** Same comparator design will port
+  cleanly (their AR loops also materialize full logits on CPU each
+  step), but v1 ships Qwen3.5/3.6 only. Tracked as a follow-up.
+- **lucebox python config + autotune sweep bracket.** Belongs in the
+  lucebox python CLI repo. Tracked as a follow-up.
+- **Sweep methodology / empirical recommended dial values.**
+  Out of scope. Follow-up doc once a sweep runs.
+- **Docker image rebuild + live-service verification.** Explicit
+  hard prohibition; deferred to a follow-up that bundles the image.
+
+## 14. Empirical motivation (PR body)
+
+The hard-cap mechanism today, on Gemma 4 26B, decodes at
+~30 tok/s through up to 15 488 phase-1 tokens (≈8 minutes wall-clock
+per case). Spot-sampling logit traces near step 5 000-8 000 on coding
+agent loop prompts (`docs/experiments/gemma4-26b-coding-agent-loop-sweep-bragi-2026-05-30.md`)
+shows the close-token logit hovering at 30-60 % of the chosen-token
+logit for long stretches before the actual `</think>` emission — i.e.
+the model is *near* ready. A soft threshold of `0.1`-`0.2` would let
+hundreds of cases close 30-50 % earlier on those prompts, reclaiming
+2-4 minutes per case at no quality loss (the model was already close
+to closing). The sweep PR will quantify the actual dollar (token)
+savings against an unchanged quality probe.
diff --git a/docs/specs/thinking-budget.md b/docs/specs/thinking-budget.md
index 5ebc731b..54d0979c 100644
--- a/docs/specs/thinking-budget.md
+++ b/docs/specs/thinking-budget.md
@@ -538,13 +538,44 @@ The current taxonomy is:
 | Value | Meaning |
 |---|---|
 | `natural` | The model emitted `</think>` on its own, either before reaching the phase-1 cap or before Level 2 had to force-close. |
-| `hard` | The phase-1 cap was reached without a model-emitted `</think>`. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. |
+| `soft` | The soft-close logit-ratio peek (Level 2.5) fired before the hard cap — `prob[</think>] / prob[chosen_tok]` cleared the operator-configured `soft_close_min_ratio` threshold, and the AR loop injected `</think>` while the model was already "near" closing. Indicates voluntary cooperation: the model would have closed soon anyway; we just hurried it along to reclaim tokens. Currently Qwen3.5/3.6 only. |
+| `hard` | The phase-1 cap was reached without a model-emitted `</think>` and without the soft path triggering. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. |
+
+When both `soft` and `hard` could fire on the same AR step (the
+soft threshold cleared at exactly the budget-edge step), `soft`
+wins — the soft trigger carries more information (the model agreed
+it was time) than the hard trigger (which only reports coercion).
+See `docs/experiments/soft-close-thinking-termination-plan.md` §4 +
+§12 for the design rationale.
+
+Soft-close is enabled by the operator via the CLI flag
+`--think-soft-close-min-ratio <F>`. Default `0.0` keeps the legacy
+two-value taxonomy (`natural` / `hard`); any positive value
+activates the third. The dial is a probability ratio in `[0, 1]`:
+
+| `min_ratio` | Behaviour |
+|---|---|
+| `0.0` | Disabled. Soft path inert; per-request overrides silently ignored. |
+| `0.05`–`0.2` | Conservative — fires only when `</think>` is within 5×–20× of the argmax probability. Recommended starting range. |
+| `0.5` | Aggressive — fires when `</think>` has at least half the probability of the chosen token. |
+| `1.0` | Strict — fires only when `</think>` IS the most-likely token. Useful as a safety check. |
+
+Per-request override (Anthropic envelope, see §4.1):
+
+```jsonc
+{
+  "thinking": {
+    "type": "enabled",
+    "soft_close_min_ratio": 0.1
+  }
+}
+```
 
-A third value `soft` is reserved for a future voluntary-close
-mechanism (logit-biasing the model toward `</think>` as the cap
-approaches, before forcing it). Reserved so consumers can switch on
-the value without an exhaustive-match warning when a future server
-version adds it; not emitted today.
+The per-request value clamps to `min(requested, server_default)` —
+clients can tighten (lower the threshold, fire more aggressively)
+but not loosen (raise it above the operator's ceiling). When the
+server has the dial disabled (`0.0`), per-request overrides are
+silently ignored — the feature is operator-policy gated.
 
 ## 8. Streaming
 
@@ -564,9 +595,18 @@ in the terminal `message_delta` event for Anthropic.
   server-configured ceiling, never looser. Allowing full override
   would re-create the silent-truncation footgun of middleboxes that
   drop unknown fields.
-- **Soft close-kind / soft-budget hint.** The mechanism (logit bias
-  to nudge `</think>` selection before the hard cap) is sketched in
-  §7 but not specified.
+- **Spec-decode soft-close peek.** Soft-close fires inside the AR
+  loop. When spec-decode is in use, the close still triggers at the
+  spec-decode → AR tail-off boundary (slightly later than pure-AR
+  mode); the verify/accept inner loop does not run the comparator.
+  Gemma 4 and Laguna are pure-AR; this only matters for Qwen3.5/3.6
+  with a draft model.
+- **Multi-token close joint probability.** When `</think>` tokenizes
+  to multiple ids, the soft-close comparator peeks only the FIRST
+  id's logit (the existing multi-token inject machinery drives the
+  remainder of the sequence on subsequent steps). The joint
+  `P(t_0, t_1, …)` peek is left to a v2 if false-positive rates
+  warrant it.
 - **Per-token close-info metadata.** The upstream reference exposes
   `(token_index, remaining_budget, rank)` for the close event. The
   current `finish_details` reports aggregate counts only.
diff --git a/server/src/common/model_backend.h b/server/src/common/model_backend.h
index de439092..836b627b 100644
--- a/server/src/common/model_backend.h
+++ b/server/src/common/model_backend.h
@@ -10,6 +10,7 @@
 
 #pragma once
 
+#include <cmath>
 #include <cstdint>
 #include <cstdio>
 #include <functional>
@@ -71,15 +72,91 @@ struct DaemonIO {
 // decode) — the perf trade-off is acceptable since this only kicks in
 // for thinking-enabled requests. Spec-decode integration is a follow-up.
 struct BudgetHook {
-    // Multi-token close sequence injected when `(n_gen - committed)`
-    // drops to `hard_limit_remaining`. For Qwen3.x this is the
-    // canonical "Considering the limited time..." summarize-and-stop
-    // lead-in (tokenized at server startup); for non-qwen arches it's
-    // a single close-tag token. Empty = hook disabled.
+    // Inject sequence written when the hard cap fires OR when soft-close
+    // fires. This is the verbatim tokenization of the model card's
+    // `thinking_terminator_hint` (e.g. for Qwen3.6 the lead-in
+    // "Considering the limited time by the user, ... </think>\n\n").
+    // May be many tokens long; the first element is what the AR loop
+    // writes on the firing step, with the rest streamed out on
+    // subsequent steps. Empty = disabled.
     std::vector<int32_t> close_token_ids;
+    // Short PROBE sequence used by the soft-close logit-ratio peek.
+    // Conceptually this is the tokenization of just the close MARKER
+    // (e.g. `</think>` — a single token id 248069 on Qwen3.6) rather
+    // than the full inject directive above. Splitting probe-vs-inject
+    // matters because the inject sequence for trained-hint models
+    // starts with a content token like "Considering" whose logit is
+    // 19-35 nats below the chosen token at every step, masking the
+    // close-marker's true probability and preventing soft-close from
+    // ever firing.
+    // When empty, the soft-close peek falls back to
+    // `close_token_ids.front()` (legacy behavior — kept so models that
+    // haven't been updated keep working identically to before the split).
+    std::vector<int32_t> soft_close_probe_ids;
     int                  hard_limit_remaining = 0;
+    // Soft-close (Level 2 voluntary). When > 0, at each AR step the
+    // loop compares the probe-token logit against the chosen-token
+    // logit; if `prob[probe[0]] / prob[chosen] >= soft_close_min_ratio`
+    // (equivalently `logit[probe[0]] - logit[chosen] >= log(min_ratio)`),
+    // the inject sequence (close_token_ids) is written BEFORE the hard
+    // limit is reached. 0.0 = disabled (default); 1.0 = fire only when
+    // the probe token is already the most-likely token; lower values =
+    // fire more aggressively. See docs/specs/thinking-budget.md §7 and
+    // docs/experiments/soft-close-thinking-termination-plan.md.
+    float                soft_close_min_ratio = 0.0f;
+    // Minimum thinking tokens before soft-close is allowed to fire.
+    // Soft-close peek runs on every AR step but the fire decision is
+    // gated by this floor — protects against premature termination on
+    // prompts where the close-marker logit briefly spikes mid-thought.
+    // 0 = floor disabled (default). Per empirical trajectory data on
+    // qwen3.6-27b (5 diverse prompts), </think> only becomes
+    // argmax-competitive at 66-94% of natural reasoning length — so a
+    // floor in the 64-256 range is the typical operating point.
+    int                  soft_close_min_tokens = 0;
+    // Diagnostic: when true, emit one stderr line per AR step inside the
+    // thinking phase with (committed, chosen_tok, logit[probe0],
+    // logit[chosen], diff). Used to record the close-vs-chosen logit
+    // trajectory across a full thinking run so a sliding-threshold curve
+    // can be designed from empirical data rather than guessed. Zero cost
+    // when off. See server_main.cpp --debug-thinking-logits.
+    bool                 debug_thinking_logits = false;
+
+    // Probe token id used by the soft-close peek. Returns the first
+    // element of soft_close_probe_ids when set, otherwise falls back to
+    // close_token_ids.front() (legacy behavior). Callers must guard
+    // against an empty hook before calling this.
+    int32_t soft_close_probe_token() const {
+        if (!soft_close_probe_ids.empty()) return soft_close_probe_ids.front();
+        return close_token_ids.front();
+    }
 };
 
+namespace soft_close {
+
+// Returns true when the soft-close comparator would fire on this AR
+// step. Side-effect free; safe to call from unit tests.
+//
+// Fast path: returns false in O(1) when min_ratio <= 0 (the disabled
+// default). When the model has already chosen the close token on its
+// own, also returns false — the natural-close path handles that.
+//
+// Math: `prob[i]/prob[j] = exp(logit[i] - logit[j])`, so
+// `prob[close]/prob[chosen] >= min_ratio` ⟺
+// `logit[close] - logit[chosen] >= log(min_ratio)`. We compare on
+// logits to avoid `exp()` and full-softmax cost; this is numerically
+// stable in fp32 for typical LLM logit ranges (~±20).
+inline bool should_fire(const float * logits,
+                        int32_t       chosen_tok,
+                        int32_t       close0_tok,
+                        float         min_ratio) {
+    if (min_ratio <= 0.0f)          return false;
+    if (chosen_tok == close0_tok)    return false;
+    const float log_ratio = std::log(min_ratio);
+    return (logits[close0_tok] - logits[chosen_tok]) >= log_ratio;
+}
+
+}  // namespace soft_close
+
 struct GenerateRequest {
     std::vector<int32_t>       prompt;
     int                        n_gen       = 0;
@@ -121,6 +198,13 @@ struct GenerateResult {
     // stream and grepping for "</think>" cannot distinguish the two
     // (the injected close decodes identically).
     bool                       budget_forced_close = false;
+    // True when the soft-close path (logit-ratio peek) injected the
+    // </think> close sequence in this generation. Mutually exclusive
+    // with budget_forced_close: when both could fire on the same step,
+    // soft wins and budget_forced_close stays false. The server uses
+    // this to attribute close_kind="soft" (vs "hard"). See
+    // docs/specs/thinking-budget.md §7.
+    bool                       soft_forced_close = false;
     // True iff the AR decode loop's post-close watchdog detected an n-gram
     // repetition loop and broke out early. Caller surfaces this so clients
     // can mark the answer as unreliable rather than treating the
@@ -212,6 +296,8 @@ struct ModelBackend {
         retry.spec_decode_ran = first.spec_decode_ran || retry.spec_decode_ran;
         retry.budget_forced_close =
             first.budget_forced_close || retry.budget_forced_close;
+        retry.soft_forced_close =
+            first.soft_forced_close || retry.soft_forced_close;
         retry.degenerate_decode_close =
             first.degenerate_decode_close || retry.degenerate_decode_close;
         return retry;
diff --git a/server/src/qwen35/qwen35_backend.cpp b/server/src/qwen35/qwen35_backend.cpp
index e3b161d8..867d5d22 100644
--- a/server/src/qwen35/qwen35_backend.cpp
+++ b/server/src/qwen35/qwen35_backend.cpp
@@ -582,14 +582,16 @@ GenerateResult Qwen35Backend::generate(const GenerateRequest & req,
             decode_ok = do_ar_decode(committed, req.n_gen, result.tokens, out_io,
                                      req.budget_hook,
                                      &result.budget_forced_close,
-                                     &result.degenerate_decode_close);
+                                     &result.degenerate_decode_close,
+                                     &result.soft_forced_close);
             out_io.emit(-1);
         } else {
             decode_ok = do_spec_decode(committed, req.n_gen, result.tokens, out_io,
                                        result.accept_rate, result.spec_decode_ran,
                                        req.hint_tokens, &req.budget_hook,
                                        &result.budget_forced_close,
-                                       &result.degenerate_decode_close);
+                                       &result.degenerate_decode_close,
+                                       &result.soft_forced_close);
         }
         if (!decode_ok) {
             result.error = "decode";
@@ -683,14 +685,16 @@ GenerateResult Qwen35Backend::restore_and_generate(int slot,
             decode_ok = do_ar_decode(committed, req.n_gen, result.tokens, out_io,
                                      req.budget_hook,
                                      &result.budget_forced_close,
-                                     &result.degenerate_decode_close);
+                                     &result.degenerate_decode_close,
+                                     &result.soft_forced_close);
             out_io.emit(-1);
         } else {
             decode_ok = do_spec_decode(committed, req.n_gen, result.tokens, out_io,
                                        result.accept_rate, result.spec_decode_ran,
                                        req.hint_tokens, &req.budget_hook,
                                        &result.budget_forced_close,
-                                       &result.degenerate_decode_close);
+                                       &result.degenerate_decode_close,
+                                       &result.soft_forced_close);
         }
         if (!decode_ok) {
             result.error = "decode";
@@ -856,7 +860,8 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen,
                                   const DaemonIO & io,
                                   const BudgetHook & budget_hook,
                                   bool * forced_close_out,
-                                  bool * degenerate_close_out) {
+                                  bool * degenerate_close_out,
+                                  bool * soft_forced_close_out) {
     // Budget hook state.
     //   - budget_close_started: true once we've begun injecting the close
     //     sequence. Prevents re-triggering on continued forward generation.
@@ -938,6 +943,79 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen,
             if (forced_close_out) *forced_close_out = true;
         }
     };
+
+    // Soft-close (logit-ratio peek). Fires BEFORE the hard-cap check so a
+    // soft trigger on the same step as a hard trigger is reported as
+    // close_kind="soft" (the more informative signal — the model agreed it
+    // was time to close, even if the budget was also about to run out).
+    // Once this lambda starts the close sequence, the maybe_force_close
+    // continuation branch handles steps 2..N of a multi-token close.
+    // Zero-cost-when-disabled invariant: when soft_close_min_ratio == 0
+    // the outer guard short-circuits and we do not even read logits_buf.
+    // See docs/experiments/soft-close-thinking-termination-plan.md §3.
+    auto maybe_soft_close = [&](int32_t & tok,
+                                const float * logits_row,
+                                int committed_now) {
+        if (budget_close_started) return;                       // sequence already in progress
+        if (budget_hook.close_token_ids.empty()) return;        // hook disabled
+
+        // PROBE vs INJECT split:
+        //   - probe0 is the token id we PEEK to decide whether to fire
+        //     (the short close marker, e.g. `</think>` = 248069 on Qwen3.6).
+        //   - inject0 / inject sequence is what we WRITE when it fires
+        //     (the full trained-hint directive).
+        // Fall back to close_token_ids.front() when no separate probe is
+        // configured (legacy / single-token-marker models). See
+        // BudgetHook::soft_close_probe_token().
+        const int32_t probe0  = budget_hook.soft_close_probe_token();
+        const int32_t inject0 = budget_hook.close_token_ids.front();
+
+        // Diagnostic trajectory log. Fires every AR step (gated on the
+        // operator flag) regardless of soft_close_min_ratio, so we can
+        // record close-vs-chosen logit curves even when the dial is off.
+        // close0 reports the PROBE token id (what the comparator uses).
+        if (budget_hook.debug_thinking_logits) {
+            const int generated = committed_now - committed_at_entry;
+            const float diff = logits_row[probe0] - logits_row[tok];
+            const float ratio = (diff > 50.0f) ? std::exp(50.0f) : std::exp(diff);
+            std::fprintf(stderr,
+                "[soft-trace] step=%d committed=%d chosen=%d close0=%d "
+                "logit_close=%.4f logit_chosen=%.4f diff=%.4f prob_ratio=%.6g\n",
+                generated, committed_now, tok, probe0,
+                logits_row[probe0], logits_row[tok], diff, ratio);
+        }
+
+        if (budget_hook.soft_close_min_ratio <= 0.0f) return;   // dial disabled
+
+        // Minimum-thinking-tokens floor: false-positive guard. When set,
+        // suppress fire until the segment has committed at least this
+        // many tokens. 0 = floor disabled (default).
+        const int generated_so_far = committed_now - committed_at_entry;
+        if (generated_so_far < budget_hook.soft_close_min_tokens) return;
+
+        if (!soft_close::should_fire(logits_row, tok, probe0,
+                                     budget_hook.soft_close_min_ratio)) {
+            return;
+        }
+        const int generated = committed_now - committed_at_entry;
+        const int remaining = n_gen - generated;
+        std::fprintf(stderr,
+            "[budget-hook] soft-close at committed=%d/%d (remaining=%d, "
+            "min_ratio=%.4f, logit[probe0=%d]=%.3f logit[chosen]=%.3f "
+            "diff=%.3f log_ratio=%.3f): overriding sampled token %d with "
+            "inject[0]=%d (inject seq len %zu)\n",
+            committed_now, n_gen, remaining,
+            budget_hook.soft_close_min_ratio,
+            probe0, logits_row[probe0], logits_row[tok],
+            logits_row[probe0] - logits_row[tok],
+            std::log(budget_hook.soft_close_min_ratio),
+            tok, inject0, budget_hook.close_token_ids.size());
+        tok = inject0;
+        budget_close_started = true;
+        close_inject_pos = 1;
+        if (soft_forced_close_out) *soft_forced_close_out = true;
+    };
+
     if (n_gen <= 0) return true;
 
     auto t_dec0_ar = std::chrono::steady_clock::now();
@@ -964,12 +1042,32 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen,
     const int initial_emitted = out_tokens.empty() ? 1 : 0;
     if (initial_emitted == 1) {
         int32_t first_tok;
-        if (sampler_.needs_logit_processing()) {
-            if (!prefill_last_logits_valid_) return false;
-            ggml_backend_tensor_get(sg_.logits, logits_buf.data(), prefill_last_logits_offset_,
-                                    sizeof(float) * vocab);
-            first_tok = sample_logits(logits_buf.data(), vocab, sampler_,
-                                      out_tokens, sampler_rng_);
+        // Soft-close needs the logits row for the comparator; greedy
+        // (argmax-only) path normally skips the logits read. Pull the
+        // prefill's last logits row to CPU when soft is enabled so the
+        // first AR step participates in the comparator. Zero-cost when
+        // disabled: only fetched when soft_close_min_ratio > 0.
+        const bool need_logits =
+            sampler_.needs_logit_processing() ||
+            budget_hook.soft_close_min_ratio > 0.0f;
+        if (need_logits) {
+            if (!prefill_last_logits_valid_) {
+                if (sampler_.needs_logit_processing()) return false;
+                // Soft-close wanted logits but prefill didn't keep them.
+                // Skip soft check on this single token rather than error.
+                first_tok = cache_.last_tok;
+            } else {
+                ggml_backend_tensor_get(sg_.logits, logits_buf.data(),
+                                        prefill_last_logits_offset_,
+                                        sizeof(float) * vocab);
+                if (sampler_.needs_logit_processing()) {
+                    first_tok = sample_logits(logits_buf.data(), vocab, sampler_,
+                                              out_tokens, sampler_rng_);
+                } else {
+                    first_tok = cache_.last_tok;
+                }
+                maybe_soft_close(first_tok, logits_buf.data(), committed);
+            }
         } else {
             first_tok = cache_.last_tok;
         }
@@ -1020,6 +1118,13 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen,
             }
         }
 
+        // Soft check runs BEFORE hard-cap check. If soft fires, it sets
+        // budget_close_started=true so maybe_force_close's continuation
+        // branch handles steps 2..N of a multi-token close (and the
+        // remaining-check branch is skipped because the sequence is
+        // already started). If soft does not fire (disabled or threshold
+        // not met), maybe_force_close proceeds as today.
+        maybe_soft_close(next_tok, logits_buf.data(), committed);
         maybe_force_close(next_tok, committed);
 
         out_tokens.push_back(next_tok);
@@ -1122,7 +1227,8 @@ bool Qwen35Backend::do_spec_decode(int committed, int n_gen,
                                     const std::vector<int32_t> * hint_tokens,
                                     const BudgetHook * budget_hook,
                                     bool * forced_close_out,
-                                    bool * degenerate_close_out) {
+                                    bool * degenerate_close_out,
+                                    bool * soft_forced_close_out) {
     out_accept_rate = 0.0f;
     out_spec_ran    = false;
     const int hidden = w_.n_embd;
@@ -1149,10 +1255,13 @@ bool Qwen35Backend::do_spec_decode(int committed, int n_gen,
     if (!can_spec) {
         // AR fallback consumes the final prefill position itself, then advances
         // one token at a time. Pass the budget hook through so force-close
-        // still fires when spec-decode is unavailable.
+        // still fires when spec-decode is unavailable. Soft-close pointer
+        // also forwards so close_kind="soft" can be attributed correctly
+        // even on the AR fallback path.
         bool ok = do_ar_decode(committed, n_gen, out_tokens, io,
                                 budget_hook ? *budget_hook : BudgetHook{},
-                                forced_close_out, degenerate_close_out);
+                                forced_close_out, degenerate_close_out,
+                                soft_forced_close_out);
         io.emit(-1);
         return ok;
     }
@@ -1222,7 +1331,8 @@ bool Qwen35Backend::do_spec_decode(int committed, int n_gen,
                 int ar_n_gen = need_commit_budget;
                 bool ok = do_ar_decode(committed, ar_n_gen, out_tokens, io,
                                         tail_hook, forced_close_out,
-                                        degenerate_close_out);
+                                        degenerate_close_out,
+                                        soft_forced_close_out);
                 io.emit(-1);
                 return ok;
             }
diff --git a/server/src/qwen35/qwen35_backend.h b/server/src/qwen35/qwen35_backend.h
index fb9b8f60..6a8e967f 100644
--- a/server/src/qwen35/qwen35_backend.h
+++ b/server/src/qwen35/qwen35_backend.h
@@ -229,7 +229,8 @@ class Qwen35Backend : public ModelBackend {
                         const std::vector<int32_t> * hint_tokens = nullptr,
                         const BudgetHook * budget_hook = nullptr,
                         bool * forced_close_out = nullptr,
-                        bool * degenerate_close_out = nullptr);
+                        bool * degenerate_close_out = nullptr,
+                        bool * soft_forced_close_out = nullptr);
 
     // AR decode fallback (no draft model or sampling mode).
     // budget_hook (when close_token_ids is non-empty) overrides the next
@@ -249,7 +250,8 @@ class Qwen35Backend : public ModelBackend {
                       const DaemonIO & io,
                       const BudgetHook & budget_hook = {},
                       bool * forced_close_out = nullptr,
-                      bool * degenerate_close_out = nullptr);
+                      bool * degenerate_close_out = nullptr,
+                      bool * soft_forced_close_out = nullptr);
 
     bool sync_remote_draft_features(int start_pos, int n_tokens);
 
diff --git a/server/src/server/http_server.cpp b/server/src/server/http_server.cpp
index 362c2f4d..f19967b4 100644
--- a/server/src/server/http_server.cpp
+++ b/server/src/server/http_server.cpp
@@ -940,6 +940,38 @@ bool HttpServer::route_request(int fd, const HttpRequest & hr) {
             if (th.contains("reply_budget") && th["reply_budget"].is_number_integer()) {
                 request_reply_budget = th["reply_budget"].get<int>();
             }
+            // Soft-close per-request override (plan §6.3). Honored only
+            // when the operator has soft-close enabled; clamped against
+            // the server ceiling so clients can tighten but not loosen.
+            // Applied after clamping logic below.
+            if (th.contains("soft_close_min_ratio") &&
+                th["soft_close_min_ratio"].is_number())
+            {
+                float requested = th["soft_close_min_ratio"].get<float>();
+                if (requested < 0.0f) requested = 0.0f;
+                if (requested > 1.0f) requested = 1.0f;
+                if (config_.soft_close_min_ratio <= 0.0f) {
+                    // Operator has disabled soft-close at the server
+                    // level — silently ignore the per-request override.
+                    // Logged at info so operators can see clients
+                    // attempting to opt in.
+                    std::fprintf(stderr,
+                        "[server] thinking.soft_close_min_ratio=%.4f "
+                        "ignored: server has soft-close disabled "
+                        "(config_.soft_close_min_ratio=0)\n",
+                        requested);
+                } else {
+                    float eff = std::min(requested,
+                                          config_.soft_close_min_ratio);
+                    if (requested > config_.soft_close_min_ratio) {
+                        std::fprintf(stderr,
+                            "[server] thinking.soft_close_min_ratio=%.4f "
+                            "clamped to soft_close_min_ratio=%.4f\n",
+                            requested, config_.soft_close_min_ratio);
+                    }
+                    req.per_req_soft_close_min_ratio = eff;
+                }
+            }
         }
         // Direct: chat_template_kwargs.enable_thinking
         if (body.contains("chat_template_kwargs")) {
@@ -1318,7 +1350,32 @@ void HttpServer::worker_loop() {
                 ? req.per_req_reply_budget
                 : config_.hard_limit_reply_budget;
             gen_req.budget_hook.close_token_ids = config_.think_close_token_ids;
+            gen_req.budget_hook.soft_close_probe_ids =
+                config_.think_close_probe_token_ids;
             gen_req.budget_hook.hard_limit_remaining = eff_reply_budget;
+
+            // Soft-close min-ratio. Operator-gated: only forwarded when
+            // config_.soft_close_min_ratio > 0. Per-request value (if
+            // set and operator enabled) is already clamped to the
+            // server ceiling in the request parser. See plan §6.3.
+            if (config_.soft_close_min_ratio > 0.0f) {
+                gen_req.budget_hook.soft_close_min_ratio =
+                    (req.per_req_soft_close_min_ratio >= 0.0f)
+                        ? req.per_req_soft_close_min_ratio
+                        : config_.soft_close_min_ratio;
+            }
+
+            // Minimum-thinking-tokens floor: false-positive guard for
+            // soft-close. Server-policy only (no per-request override).
+            gen_req.budget_hook.soft_close_min_tokens =
+                config_.soft_close_min_tokens;
+
+            // Diagnostic trajectory log — operator dial only. Carried
+            // through the BudgetHook so the AR loop can emit one line
+            // per thinking step regardless of whether soft-close is
+            // armed. See model_backend.h BudgetHook::debug_thinking_logits.
+            gen_req.budget_hook.debug_thinking_logits =
+                config_.debug_thinking_logits;
         }
 
         // Tool call hint generation: pre-tokenize predictable structural tokens
@@ -1588,15 +1645,25 @@ void HttpServer::worker_loop() {
             }
         }
 
-        // close_kind reflects the Level 2 BudgetHook outcome: "hard" when
-        // the backend's AR/spec decode injected the close-token sequence
-        // at the budget boundary, "natural" when the model self-closed
-        // (or the request never opted in). Emitted as part of
-        // finish_details for thinking-budget callers.
-        std::string close_kind =
-            (req.thinking_opt_in && result.budget_forced_close)
-                ? "hard"
-                : "natural";
+        // close_kind reflects the Level 2 BudgetHook outcome:
+        //   "natural" — the model emitted </think> on its own (or the
+        //               request never opted in to the envelope).
+        //   "soft"    — the soft-close logit-ratio peek (Level 2.5)
+        //               fired before the hard cap, indicating the
+        //               model was willing to close. See
+        //               docs/specs/thinking-budget.md §7.
+        //   "hard"    — the budget edge was reached without the model
+        //               or the soft path agreeing; the AR loop forced
+        //               </think> in. Original Level 2 behavior.
+        // Soft wins ties against hard on the same step (see plan §4 +
+        // §12) — soft_forced_close and budget_forced_close are mutually
+        // exclusive per AR-loop step. Emitted as part of finish_details
+        // for thinking-budget callers.
+        std::string close_kind = "natural";
+        if (req.thinking_opt_in) {
+            if (result.soft_forced_close)        close_kind = "soft";
+            else if (result.budget_forced_close) close_kind = "hard";
+        }
 
         // Finalize.
         // Per-request wall-clock timings forwarded to the response's
diff --git a/server/src/server/http_server.h b/server/src/server/http_server.h
index 999eb5d9..1f363b6d 100644
--- a/server/src/server/http_server.h
+++ b/server/src/server/http_server.h
@@ -88,6 +88,40 @@ struct ServerConfig {
     // forwards into GenerateRequest.budget_hook when thinking is opted in.
     std::vector<int32_t> think_close_token_ids;
 
+    // Token IDs resolved at server startup for the soft-close PROBE.
+    // Tokenization of just the close MARKER substring (e.g. `</think>`)
+    // — the bytes the soft-close logit-ratio peek compares against the
+    // chosen-token logit at each AR step. Conceptually separate from
+    // the inject sequence above: probing on the full directive's first
+    // token (typically a content lead-in like "Considering") forces
+    // soft-close to read a perpetually-low logit and never fire.
+    // Empty = legacy fallback: peek close_token_ids.front().
+    std::vector<int32_t> think_close_probe_token_ids;
+
+    // Soft-close min-ratio default. When > 0 AND a request opts into
+    // thinking, the AR loop force-emits </think> early once
+    // prob[</think>] / prob[chosen] >= this ratio. 0.0 = soft-close
+    // entirely disabled at the operator level; per-request overrides
+    // are silently ignored when this is zero (operator-policy gate).
+    // Range [0.0, 1.0]. See docs/specs/thinking-budget.md §7 and
+    // docs/experiments/soft-close-thinking-termination-plan.md.
+    float       soft_close_min_ratio = 0.0f;
+
+    // Minimum thinking tokens before soft-close is allowed to fire. The
+    // soft-close peek still runs every AR step (so trajectory logs
+    // remain complete), but the fire decision is suppressed until this
+    // many thinking tokens have been committed. False-positive guard.
+    // 0 = disabled (default — pre-floor behavior).
+    int         soft_close_min_tokens = 0;
+
+    // Diagnostic: when true, the AR loop emits one stderr line per
+    // thinking-phase step with the close-vs-chosen logit values, so a
+    // sliding-ratio curve can be tuned from real trajectory data.
+    // Operator-only flag; per-request overrides not exposed because
+    // the stderr volume is heavy. Plumbed through to
+    // BudgetHook::debug_thinking_logits when the budget hook is wired.
+    bool        debug_thinking_logits = false;
+
     // Phase-1 budgets per `reasoning.effort` tier (spec §4.2). Selected
     // by the request parser when `reasoning.effort` is present. Each
     // value is itself capped at `think_max_tokens` at startup.
@@ -196,6 +230,14 @@ struct ParsedRequest {
     // hard_limit_reply_budget. Values are already clamped to those ceilings.
     int                       per_req_phase1_cap   = -1;
     int                       per_req_reply_budget = -1;
+    // Per-request soft-close min-ratio override. -1.0 = not set (use
+    // server default). Honored only when the server has soft-close
+    // enabled (config_.soft_close_min_ratio > 0); when the operator has
+    // disabled soft-close, this is silently ignored. When honored,
+    // clamps to min(requested, server_default) — clients can tighten
+    // (lower the threshold) but never loosen (raise it). See spec §4.4
+    // and plan §6.3.
+    float                     per_req_soft_close_min_ratio = -1.0f;
     // Stop sequences (OpenAI "stop" + Anthropic "stop_sequences")
     std::vector<std::string>  stop_sequences;
     // Bandit: per-session adaptive keep_ratio opt-in
diff --git a/server/src/server/server_main.cpp b/server/src/server/server_main.cpp
index 0f31739e..2b127722 100644
--- a/server/src/server/server_main.cpp
+++ b/server/src/server/server_main.cpp
@@ -195,6 +195,28 @@ static void print_usage(const char * prog) {
         "  --reasoning-effort-max <N>      Phase-1 budget when request asks effort=max\n"
         "                                  Defaults come from share/model_cards/<name>.json;\n"
         "                                  see docs/specs/thinking-budget.md §3.\n"
+        "  --think-soft-close-min-ratio <F>\n"
+        "                             Soft-close dial. When > 0 AND a request opts\n"
+        "                             into thinking, the AR loop force-emits </think>\n"
+        "                             early once prob[</think>]/prob[chosen] >= ratio,\n"
+        "                             reclaiming tokens the model would have spent\n"
+        "                             running to the hard cap. Range [0.0, 1.0]:\n"
+        "                             0.0=disabled (default), 0.1=fire when within\n"
+        "                             10x of argmax (mild), 0.5=fire at half-prob\n"
+        "                             (aggressive), 1.0=fire only when close is\n"
+        "                             argmax. See docs/specs/thinking-budget.md §7.\n"
+        "  --think-soft-close-min-tokens <N>\n"
+        "                             Minimum thinking tokens before soft-close\n"
+        "                             may fire. Floors the fire decision so a\n"
+        "                             brief close-marker logit spike early in\n"
+        "                             reasoning cannot prematurely terminate\n"
+        "                             thinking. 0 = disabled (default). Typical\n"
+        "                             values: 64-256 for qwen3.6-27b.\n"
+        "  --debug-thinking-logits    Emit one stderr line per AR step inside the\n"
+        "                             thinking phase recording committed/chosen/\n"
+        "                             logit[close]/logit[chosen]/diff/prob_ratio.\n"
+        "                             Use to record close-vs-chosen logit\n"
+        "                             trajectories. Stderr-heavy; operator only.\n"
         "\n"
         "KV cache:\n"
         "  --cache-type-k <type>  KV cache K type (f16,bf16,q4_0,q4_1,q5_0,q5_1,q8_0,tq3_0)\n"
@@ -257,6 +279,9 @@ int main(int argc, char ** argv) {
         bool effort_high             = false;
         bool effort_x_high           = false;
         bool effort_max              = false;
+        bool soft_close_min_ratio    = false;
+        bool soft_close_min_tokens   = false;
+        bool debug_thinking_logits   = false;
     } cli_set;
 
     // Track whether the operator passed the legacy --max-tokens alias.
@@ -368,6 +393,37 @@ int main(int argc, char ** argv) {
         } else if (std::strcmp(argv[i], "--reasoning-effort-max") == 0 && i + 1 < argc) {
             sconfig.effort_tiers.max = std::atoi(argv[++i]);
             cli_set.effort_max = true;
+        } else if (std::strcmp(argv[i], "--think-soft-close-min-ratio") == 0 && i + 1 < argc) {
+            float r = std::strtof(argv[++i], nullptr);
+            // Clamp to [0, 1] with a warning if the operator passed
+            // something nonsensical. Bounded posture: the dial is
+            // operator-only, the bounds are tight by design.
+            if (r < 0.0f) {
+                std::fprintf(stderr,
+                    "[server] --think-soft-close-min-ratio=%.4f < 0; "
+                    "clamping to 0 (disabled)\n", r);
+                r = 0.0f;
+            } else if (r > 1.0f) {
+                std::fprintf(stderr,
+                    "[server] --think-soft-close-min-ratio=%.4f > 1; "
+                    "clamping to 1\n", r);
+                r = 1.0f;
+            }
+            sconfig.soft_close_min_ratio = r;
+            cli_set.soft_close_min_ratio = true;
+        } else if (std::strcmp(argv[i], "--think-soft-close-min-tokens") == 0 && i + 1 < argc) {
+            int n = std::atoi(argv[++i]);
+            if (n < 0) {
+                std::fprintf(stderr,
+                    "[server] --think-soft-close-min-tokens=%d < 0; "
+                    "clamping to 0 (disabled)\n", n);
+                n = 0;
+            }
+            sconfig.soft_close_min_tokens = n;
+            cli_set.soft_close_min_tokens = true;
+        } else if (std::strcmp(argv[i], "--debug-thinking-logits") == 0) {
+            sconfig.debug_thinking_logits = true;
+            cli_set.debug_thinking_logits = true;
         } else if (std::strcmp(argv[i], "--prefill-compression") == 0 && i + 1 < argc) {
             const char * mode = argv[++i];
             if (std::strcmp(mode, "auto") == 0)
@@ -716,6 +772,15 @@ int main(int argc, char ** argv) {
     std::fprintf(stderr, "[server] │  hard_limit_reply= %d (%s)\n",
                  sconfig.hard_limit_reply_budget,
                  src_of(cli_set.hard_limit_reply_budget));
+    std::fprintf(stderr, "[server] │  soft_close_ratio= %.4f (%s)\n",
+                 sconfig.soft_close_min_ratio,
+                 cli_set.soft_close_min_ratio ? "from CLI" : "default (disabled)");
+    std::fprintf(stderr, "[server] │  soft_close_floor= %d (%s)\n",
+                 sconfig.soft_close_min_tokens,
+                 cli_set.soft_close_min_tokens ? "from CLI" : "default (disabled)");
+    std::fprintf(stderr, "[server] │  debug_think_log = %s (%s)\n",
+                 sconfig.debug_thinking_logits ? "true" : "false",
+                 cli_set.debug_thinking_logits ? "from CLI" : "default (off)");
     std::fprintf(stderr, "[server] │  effort tiers    = low=%d (%s)\n",
                  sconfig.effort_tiers.low, src_of(cli_set.effort_low));
     std::fprintf(stderr, "[server] │                    medium=%d (%s)\n",
@@ -874,6 +939,40 @@ int main(int argc, char ** argv) {
             }
             if (close_ids.size() > 16) std::fprintf(stderr, ",...");
             std::fprintf(stderr, "\n");
+
+            // Probe-vs-inject split: when the inject sequence is the
+            // full directive hint (Qwen3.x-style trained lead-in), the
+            // first inject token is a content lead-in like "Considering"
+            // whose logit sits 19-35 nats below chosen during reasoning.
+            // Soft-close peeking that token never fires (empirical: see
+            // probe trajectory data). Tokenize JUST the marker substring
+            // and ship it as the probe sequence — at the AR boundary the
+            // marker's logit IS argmax-competitive (~prob_ratio>=0.5).
+            // When the hint and marker are identical (marker-only case),
+            // leave the probe field empty: BudgetHook::soft_close_probe_token()
+            // falls back to close_token_ids.front(), so this is a no-op.
+            if (!card.thinking_terminator_hint.empty() &&
+                close_text.find(marker) != std::string::npos &&
+                close_text != marker)
+            {
+                auto probe_ids = tokenizer.encode(marker);
+                if (!probe_ids.empty()) {
+                    sconfig.think_close_probe_token_ids = probe_ids;
+                    std::fprintf(stderr,
+                        "[server] soft-close probe (marker=\"%s\", %zu tokens): ",
+                        marker.c_str(), probe_ids.size());
+                    for (size_t i = 0; i < std::min<size_t>(probe_ids.size(), 8); ++i) {
+                        std::fprintf(stderr, "%s%d", i ? "," : "", probe_ids[i]);
+                    }
+                    if (probe_ids.size() > 8) std::fprintf(stderr, ",...");
+                    std::fprintf(stderr, "\n");
+                } else {
+                    std::fprintf(stderr,
+                        "[server] soft-close probe DISABLED: marker \"%s\" "
+                        "tokenizes to empty; legacy fallback (probe = inject[0]) "
+                        "in effect.\n", marker.c_str());
+                }
+            }
         } else {
             std::fprintf(stderr,
                 "[server] level-2 force-close DISABLED: text %.40s... "
diff --git a/server/test/test_server_unit.cpp b/server/test/test_server_unit.cpp
index 275ec935..7fd49039 100644
--- a/server/test/test_server_unit.cpp
+++ b/server/test/test_server_unit.cpp
@@ -2560,6 +2560,519 @@ static void test_generate_result_accept_rate_zero_when_no_spec_decode() {
     TEST_ASSERT(r.accept_rate == 0.0f);
 }
 
+// ─── Soft-close comparator + state machine ─────────────────────────────
+//
+// Tests the logit-ratio peek that lets the AR loop force </think> early
+// once the close-token logit comes within a configured probability ratio
+// of the chosen-token logit. Default disabled. See
+// docs/experiments/soft-close-thinking-termination-plan.md.
+//
+// We test two layers:
+//   1. The pure comparator (`soft_close::should_fire`) — math-only.
+//   2. A small state-machine helper that mimics the AR loop's
+//      precedence (soft first, then hard) so we can exercise the
+//      multi-token inject path and the soft/hard tie-break without a
+//      GPU.
+
+// Mirror of qwen35_backend.cpp's close-injection state for unit testing.
+struct CloseState {
+    bool started     = false;
+    int  inject_pos  = 0;
+    bool soft_fired  = false;
+    bool hard_fired  = false;
+};
+
+// Returns the (possibly overridden) token for this step, advancing
+// CloseState. Mirrors the soft-then-hard ordering in the real loop.
+// committed_now / committed_at_entry / n_gen track the budget arithmetic
+// for the hard check identically to qwen35_backend.cpp:909-944.
+static int32_t step_close_state(int32_t                       chosen_tok,
+                                 const float *                 logits,
+                                 const dflash::common::BudgetHook & hook,
+                                 int                           committed_now,
+                                 int                           committed_at_entry,
+                                 int                           n_gen,
+                                 CloseState &                  state) {
+    // Continue an in-progress close sequence.
+    if (state.started &&
+        state.inject_pos < (int)hook.close_token_ids.size())
+    {
+        int32_t inj = hook.close_token_ids[state.inject_pos];
+        state.inject_pos++;
+        return inj;
+    }
+    if (state.started) return chosen_tok;  // sequence already complete
+
+    // Soft check (BEFORE hard, per plan §4).
+    // Probe-vs-inject split: peek hook.soft_close_probe_token(), write
+    // hook.close_token_ids.front() on fire.
+    // Min-tokens floor: suppress fire until generated_so_far >=
+    // hook.soft_close_min_tokens. Mirrors qwen35_backend.cpp gate.
+    const int generated_so_far = committed_now - committed_at_entry;
+    if (!hook.close_token_ids.empty() &&
+        hook.soft_close_min_ratio > 0.0f &&
+        generated_so_far >= hook.soft_close_min_tokens &&
+        dflash::common::soft_close::should_fire(
+            logits, chosen_tok,
+            hook.soft_close_probe_token(),
+            hook.soft_close_min_ratio))
+    {
+        state.started    = true;
+        state.inject_pos = 1;
+        state.soft_fired = true;
+        return hook.close_token_ids.front();  // INJECT, not probe
+    }
+
+    // Hard check: remaining <= hard_limit_remaining.
+    if (!hook.close_token_ids.empty()) {
+        const int generated = committed_now - committed_at_entry;
+        const int remaining = n_gen - generated;
+        if (remaining <= hook.hard_limit_remaining) {
+            int32_t close0 = hook.close_token_ids.front();
+            if (chosen_tok == close0) {
+                // Model self-closed at boundary; consume as first of seq.
+                state.started    = true;
+                state.inject_pos = 1;
+                return chosen_tok;
+            }
+            state.started    = true;
+            state.inject_pos = 1;
+            state.hard_fired = true;
+            return close0;
+        }
+    }
+    return chosen_tok;
+}
+
+// Build a logits row where the chosen-token gets logit `l_chosen` and the
+// close token gets logit `l_close`; all other vocab tokens are far below.
+static std::vector<float> make_logits(int vocab, int chosen_tok,
+                                        int close_tok,
+                                        float l_chosen, float l_close) {
+    std::vector<float> row(vocab, -100.0f);
+    row[chosen_tok] = l_chosen;
+    if (close_tok != chosen_tok) row[close_tok] = l_close;
+    return row;
+}
+
+static void test_soft_close_disabled_default() {
+    // min_ratio=0 → never fires, regardless of logit configuration.
+    auto logits = make_logits(64, /*chosen=*/3, /*close=*/7,
+                              /*l_chosen=*/2.0f, /*l_close=*/10.0f);
+    bool fired = dflash::common::soft_close::should_fire(
+        logits.data(), /*chosen=*/3, /*close0=*/7, /*min_ratio=*/0.0f);
+    TEST_ASSERT(fired == false);
+    // Even with close as argmax, disabled means false.
+    fired = dflash::common::soft_close::should_fire(
+        logits.data(), /*chosen=*/3, /*close0=*/3, /*min_ratio=*/0.0f);
+    TEST_ASSERT(fired == false);
+}
+
+static void test_soft_close_strict_ratio_one() {
+    // min_ratio=1.0 → fires only when logit[close] >= logit[chosen]
+    // (i.e. close is the argmax or tied). chosen!=close already guarded.
+    auto eq = make_logits(64, 3, 7, /*l_chosen=*/5.0f, /*l_close=*/5.0f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        eq.data(), 3, 7, 1.0f) == true);
+
+    auto below = make_logits(64, 3, 7, /*l_chosen=*/5.001f, /*l_close=*/5.0f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        below.data(), 3, 7, 1.0f) == false);
+
+    auto above = make_logits(64, 3, 7, /*l_chosen=*/4.0f, /*l_close=*/5.0f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        above.data(), 3, 7, 1.0f) == true);
+}
+
+static void test_soft_close_aggressive_half_prob() {
+    // min_ratio=0.5 — prob[close]/prob[chosen] >= 0.5 ⟺
+    //   logit_diff >= log(0.5) ≈ -0.6931.
+    const float ln_half = std::log(0.5f);
+
+    // Boundary inclusive: diff exactly log(0.5).
+    auto boundary = make_logits(64, 3, 7,
+                                 /*l_chosen=*/5.0f,
+                                 /*l_close=*/5.0f + ln_half);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        boundary.data(), 3, 7, 0.5f) == true);
+
+    // Just below: diff slightly less than log(0.5) (further negative).
+    auto below = make_logits(64, 3, 7,
+                              /*l_chosen=*/5.0f,
+                              /*l_close=*/5.0f + ln_half - 0.001f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        below.data(), 3, 7, 0.5f) == false);
+
+    // Way above: close strongly favoured (but not argmax).
+    auto strong = make_logits(64, 3, 7,
+                               /*l_chosen=*/5.0f,
+                               /*l_close=*/4.9f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        strong.data(), 3, 7, 0.5f) == true);
+}
+
+static void test_soft_close_below_threshold() {
+    // min_ratio=0.5, prob_ratio≈0.3 (well below) → no fire.
+    const float ln_03 = std::log(0.3f);
+    auto row = make_logits(64, 3, 7,
+                            /*l_chosen=*/5.0f,
+                            /*l_close=*/5.0f + ln_03);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        row.data(), 3, 7, 0.5f) == false);
+}
+
+static void test_soft_close_chosen_is_close() {
+    // When the sampler already picks the close token, soft check never
+    // fires — natural-close path handles it.
+    auto row = make_logits(64, 7, 7, /*l_chosen=*/10.0f, /*l_close=*/10.0f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        row.data(), /*chosen=*/7, /*close0=*/7, /*min_ratio=*/0.5f) == false);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        row.data(), /*chosen=*/7, /*close0=*/7, /*min_ratio=*/1.0f) == false);
+}
+
+static void test_soft_close_tiny_ratio_numerical() {
+    // min_ratio = 1e-6 ⇒ log_ratio ≈ -13.8155. Verify no NaN, threshold
+    // triggers when diff >= -13.8.
+    auto on   = make_logits(64, 3, 7, /*l_chosen=*/5.0f, /*l_close=*/-8.5f);
+    auto off  = make_logits(64, 3, 7, /*l_chosen=*/5.0f, /*l_close=*/-9.0f);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        on.data(), 3, 7, 1e-6f) == true);
+    TEST_ASSERT(dflash::common::soft_close::should_fire(
+        off.data(), 3, 7, 1e-6f) == false);
+}
+
+// ── State-machine integration: soft + hard precedence ─────────────────
+
+static void test_soft_close_single_token_inject() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 248069 };   // Qwen3.6 single-token </think>
+    hook.hard_limit_remaining = 16;
+    hook.soft_close_min_ratio = 0.1f;
+
+    CloseState state;
+    // Step where soft should fire: close logit within 10% of chosen.
+    // log(0.1) ≈ -2.3026.
+    auto row = make_logits(/*vocab=*/250000, /*chosen=*/100, /*close=*/248069,
+                           /*l_chosen=*/5.0f, /*l_close=*/3.0f);
+    int32_t out = step_close_state(
+        /*chosen=*/100, row.data(),
+        hook,
+        /*committed_now=*/100, /*committed_at_entry=*/50, /*n_gen=*/200,
+        state);
+    TEST_ASSERT(out == 248069);
+    TEST_ASSERT(state.started == true);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(state.hard_fired == false);
+    TEST_ASSERT(state.inject_pos == 1);
+
+    // Next step: sequence is complete (single-token close); returns chosen.
+    auto row2 = make_logits(/*vocab=*/250000, /*chosen=*/200, /*close=*/248069,
+                            /*l_chosen=*/5.0f, /*l_close=*/-50.0f);
+    out = step_close_state(
+        /*chosen=*/200, row2.data(),
+        hook,
+        /*committed_now=*/101, /*committed_at_entry=*/50, /*n_gen=*/200,
+        state);
+    TEST_ASSERT(out == 200);
+    TEST_ASSERT(state.soft_fired == true);  // sticky
+}
+
+static void test_soft_close_multi_token_inject() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 1718, 37947, 32 };  // Laguna-style multi-token </think>
+    hook.hard_limit_remaining = 16;
+    hook.soft_close_min_ratio = 0.1f;
+
+    CloseState state;
+    auto row = make_logits(/*vocab=*/250000, /*chosen=*/100, /*close=*/1718,
+                           /*l_chosen=*/5.0f, /*l_close=*/3.0f);
+    int32_t out = step_close_state(
+        /*chosen=*/100, row.data(),
+        hook,
+        /*committed_now=*/100, /*committed_at_entry=*/50, /*n_gen=*/200,
+        state);
+    TEST_ASSERT(out == 1718);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(state.inject_pos == 1);
+
+    // Step 2: inject 37947 regardless of chosen_tok.
+    auto row2 = make_logits(/*vocab=*/250000, /*chosen=*/300, /*close=*/1718,
+                            /*l_chosen=*/5.0f, /*l_close=*/-50.0f);
+    out = step_close_state(
+        /*chosen=*/300, row2.data(),
+        hook,
+        /*committed_now=*/101, 50, 200, state);
+    TEST_ASSERT(out == 37947);
+    TEST_ASSERT(state.inject_pos == 2);
+
+    // Step 3: inject 32.
+    out = step_close_state(
+        /*chosen=*/400, row2.data(),
+        hook,
+        /*committed_now=*/102, 50, 200, state);
+    TEST_ASSERT(out == 32);
+    TEST_ASSERT(state.inject_pos == 3);
+
+    // Step 4: sequence complete, returns chosen.
+    out = step_close_state(
+        /*chosen=*/500, row2.data(),
+        hook,
+        /*committed_now=*/103, 50, 200, state);
+    TEST_ASSERT(out == 500);
+}
+
+static void test_soft_close_then_hard_would_fire() {
+    // Soft fires at step 100; hard remaining-check would fire at
+    // committed_now=190 (remaining=10 <= hard_limit=16). Hard path
+    // skipped because state.started is already true. Telemetry:
+    // close_kind="soft".
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 248069 };
+    hook.hard_limit_remaining = 16;
+    hook.soft_close_min_ratio = 0.1f;
+
+    CloseState state;
+    // Soft trigger at committed_now=100.
+    auto soft_row = make_logits(/*vocab=*/250000, 100, 248069, 5.0f, 3.0f);
+    (void)step_close_state(100, soft_row.data(), hook,
+                            /*committed_now=*/100,
+                            /*committed_at_entry=*/50,
+                            /*n_gen=*/200, state);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(state.hard_fired == false);
+
+    // Now jump to committed_now=190 (remaining=10) — hard would have
+    // fired here but state.started=true so it's skipped.
+    auto far_row = make_logits(/*vocab=*/250000, 999, 248069, 5.0f, -100.0f);
+    int32_t out = step_close_state(999, far_row.data(), hook,
+                                    /*committed_now=*/190,
+                                    /*committed_at_entry=*/50,
+                                    /*n_gen=*/200, state);
+    // Single-token close already complete; returns chosen.
+    TEST_ASSERT(out == 999);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(state.hard_fired == false);
+}
+
+static void test_soft_close_disabled_hard_still_fires() {
+    // min_ratio=0 (disabled): hard cap should still fire at the budget
+    // edge. Existing behavior preserved.
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 248069 };
+    hook.hard_limit_remaining = 16;
+    hook.soft_close_min_ratio = 0.0f;  // disabled
+
+    CloseState state;
+    // Big gap between chosen and close — would fire soft if enabled.
+    auto row = make_logits(/*vocab=*/250000, 100, 248069, 5.0f, 4.99f);
+    int32_t out = step_close_state(100, row.data(), hook,
+                                    /*committed_now=*/100, 50, 200, state);
+    // Soft disabled: chosen passes through, not yet at hard boundary.
+    TEST_ASSERT(out == 100);
+    TEST_ASSERT(state.started == false);
+
+    // At hard boundary: committed_now-entry=184 → remaining=16 ≤ hard.
+    // (entry=50, n_gen=200, hard_limit=16 ⇒ trigger at committed_now=234.)
+    out = step_close_state(100, row.data(), hook,
+                            /*committed_now=*/234, 50, 200, state);
+    TEST_ASSERT(out == 248069);
+    TEST_ASSERT(state.hard_fired == true);
+    TEST_ASSERT(state.soft_fired == false);
+}
+
+static void test_soft_close_natural_at_boundary() {
+    // Model picks close on its own (chosen == close0). Soft check skips
+    // (chosen==close0 guard); hard check also skips because the model
+    // self-emitted close. Neither flag set; close_kind would be
+    // "natural" downstream.
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 248069 };
+    hook.hard_limit_remaining = 16;
+    hook.soft_close_min_ratio = 0.5f;
+
+    CloseState state;
+    auto row = make_logits(/*vocab=*/250000, 248069, 248069, 5.0f, 5.0f);
+    // Far from hard boundary; chosen == close.
+    int32_t out = step_close_state(248069, row.data(), hook,
+                                    /*committed_now=*/100, 50, 200, state);
+    TEST_ASSERT(out == 248069);
+    TEST_ASSERT(state.soft_fired == false);
+    TEST_ASSERT(state.hard_fired == false);
+}
+
+// Probe-vs-inject split. When soft_close_probe_ids is set, the
+// comparator MUST peek the probe[0] logit, NOT inject[0]. Otherwise
+// trained-hint sidecars (inject[0] = content lead-in token) keep
+// the dial pinned at zero.
+static void test_soft_close_probe_uses_probe_ids_not_inject_ids() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    // Multi-token inject (mirrors a trained-hint sidecar).
+    hook.close_token_ids = { 99, 100, 101 };
+    // Distinct single-token probe (the close marker).
+    hook.soft_close_probe_ids = { 42 };
+    hook.soft_close_min_ratio = 0.5f;
+
+    std::vector<float> row(250000, -100.0f);
+    row[300] = 11.0f;     // chosen
+    row[42]  = 10.0f;     // probe — within ratio 0.5 (exp(10-11)=0.37 < 0.5? no, 0.367)
+    row[42]  = 10.31f;    // exp(10.31-11) ≈ 0.502 — JUST fires at 0.5
+    row[99]  = -50.0f;    // inject[0] far below — must not influence fire
+
+    CloseState state;
+    int32_t out = step_close_state(/*chosen=*/300, row.data(), hook,
+                                    /*committed_now=*/200, 50, 500, state);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(out == 99);   // wrote inject[0], not probe[0]
+    TEST_ASSERT(state.inject_pos == 1);
+}
+
+// Empty soft_close_probe_ids ⇒ legacy fallback: peek close_token_ids
+// front. Guarantees zero churn for any caller that doesn't set the
+// new probe field.
+static void test_soft_close_probe_ids_empty_falls_back_to_close_token_ids() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 248069 };
+    // hook.soft_close_probe_ids left empty (legacy).
+    hook.soft_close_min_ratio = 0.5f;
+
+    std::vector<float> row(250000, -100.0f);
+    row[300]    = 11.0f;
+    row[248069] = 10.31f;  // close_token_ids[0]'s logit — same as before
+
+    CloseState state;
+    int32_t out = step_close_state(/*chosen=*/300, row.data(), hook,
+                                    /*committed_now=*/200, 50, 500, state);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(out == 248069);
+    // Sanity: soft_close_probe_token() returns inject[0] when probe is empty.
+    TEST_ASSERT(hook.soft_close_probe_token() == 248069);
+}
+
+// When soft-close fires, the WRITTEN sequence MUST be close_token_ids
+// (the full inject), regardless of what soft_close_probe_ids contains.
+// The probe is read-only — never appears in the output stream.
+static void test_soft_close_inject_sequence_unchanged_when_fires() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 1718, 37947, 32 };
+    hook.soft_close_probe_ids = { 42 };
+    hook.soft_close_min_ratio = 0.1f;
+
+    std::vector<float> row(250000, -100.0f);
+    row[300] = 5.0f;
+    row[42]  = 3.0f;        // probe within ratio 0.1
+    row[1718] = -80.0f;     // inject[0] far below — must not matter
+
+    CloseState state;
+    int32_t out = step_close_state(/*chosen=*/300, row.data(), hook,
+                                    /*committed_now=*/100, 50, 200, state);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(out == 1718);
+
+    std::vector<float> row2(250000, -100.0f);
+    row2[999] = 5.0f;
+    out = step_close_state(/*chosen=*/999, row2.data(), hook,
+                            /*committed_now=*/101, 50, 200, state);
+    TEST_ASSERT(out == 37947);
+    out = step_close_state(/*chosen=*/999, row2.data(), hook,
+                            /*committed_now=*/102, 50, 200, state);
+    TEST_ASSERT(out == 32);
+    out = step_close_state(/*chosen=*/999, row2.data(), hook,
+                            /*committed_now=*/103, 50, 200, state);
+    TEST_ASSERT(out == 999);
+}
+
+// min_thinking_tokens floor: when set, fire is suppressed until
+// generated_so_far >= soft_close_min_tokens.
+static void test_soft_close_min_tokens_blocks_early_fire() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 1718 };
+    hook.soft_close_probe_ids = { 42 };
+    hook.soft_close_min_ratio = 0.5f;
+    hook.soft_close_min_tokens = 100;
+
+    std::vector<float> row(250000, -100.0f);
+    row[300] = 5.0f;
+    row[42]  = 5.0f;   // prob_ratio = 1.0 ≫ 0.5
+
+    // Below floor: generated_so_far = 90 - 50 = 40 < 100 ⇒ no fire.
+    CloseState state_early;
+    int32_t out = step_close_state(/*chosen=*/300, row.data(), hook,
+                                    /*committed_now=*/90, 50, 500,
+                                    state_early);
+    TEST_ASSERT(state_early.soft_fired == false);
+    TEST_ASSERT(out == 300);
+
+    // Above floor: generated_so_far = 200 - 50 = 150 >= 100 ⇒ fires.
+    CloseState state_late;
+    out = step_close_state(/*chosen=*/300, row.data(), hook,
+                            /*committed_now=*/200, 50, 500,
+                            state_late);
+    TEST_ASSERT(state_late.soft_fired == true);
+    TEST_ASSERT(out == 1718);
+}
+
+// Default soft_close_min_tokens=0 ⇒ no floor ⇒ fire as soon as
+// qualifying logits show up. Confirms the floor is opt-in.
+static void test_soft_close_min_tokens_default_zero_unchanged_behavior() {
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 1718 };
+    hook.soft_close_probe_ids = { 42 };
+    hook.soft_close_min_ratio = 0.5f;
+    // soft_close_min_tokens left at default 0.
+
+    std::vector<float> row(250000, -100.0f);
+    row[300] = 5.0f;
+    row[42]  = 5.0f;
+
+    CloseState state;
+    int32_t out = step_close_state(/*chosen=*/300, row.data(), hook,
+                                    /*committed_now=*/1, 0, 500, state);
+    TEST_ASSERT(state.soft_fired == true);
+    TEST_ASSERT(out == 1718);
+}
+
+static void test_soft_close_determinism_when_disabled() {
+    // Byte-identical generation invariant: with min_ratio=0, the
+    // override token MUST equal the chosen token for every step, for
+    // any logit configuration. This is the "zero-cost-when-disabled"
+    // generation determinism guarantee from plan §3.6.
+    using namespace dflash::common;
+    BudgetHook hook;
+    hook.close_token_ids = { 248069 };
+    hook.hard_limit_remaining = 0;          // disable hard too
+    hook.soft_close_min_ratio = 0.0f;       // disabled
+
+    CloseState state;
+    std::mt19937 rng(12345);
+    for (int step = 0; step < 100; step++) {
+        int32_t chosen = (int32_t)(rng() % 1000);
+        float l_chosen = (float)(rng() % 100) / 10.0f - 5.0f;
+        float l_close  = (float)(rng() % 100) / 10.0f - 5.0f;
+        // vocab=250000 covers close_tok=248069. Pre-existing OOB on the
+        // 1000-element row was silently passing in Release builds; new
+        // tests perturbing heap layout could turn it into a crash.
+        auto row = make_logits(/*vocab=*/250000, chosen, /*close=*/248069,
+                                l_chosen, l_close);
+        int32_t out = step_close_state(chosen, row.data(), hook,
+                                        /*committed_now=*/step, 0, 200,
+                                        state);
+        TEST_ASSERT(out == chosen);
+    }
+    TEST_ASSERT(state.soft_fired == false);
+    TEST_ASSERT(state.hard_fired == false);
+}
+
 int main() {
     std::fprintf(stderr, "══════════════════════════════════════════\n");
     std::fprintf(stderr, " Server Unit Tests\n");
@@ -2726,6 +3239,25 @@ int main() {
     RUN_TEST(test_generate_result_accept_rate_in_usage_anthropic);
     RUN_TEST(test_generate_result_accept_rate_zero_when_no_spec_decode);
 
+    std::fprintf(stderr, "\n── Soft-close comparator + state machine ──\n");
+    RUN_TEST(test_soft_close_disabled_default);
+    RUN_TEST(test_soft_close_strict_ratio_one);
+    RUN_TEST(test_soft_close_aggressive_half_prob);
+    RUN_TEST(test_soft_close_below_threshold);
+    RUN_TEST(test_soft_close_chosen_is_close);
+    RUN_TEST(test_soft_close_tiny_ratio_numerical);
+    RUN_TEST(test_soft_close_single_token_inject);
+    RUN_TEST(test_soft_close_multi_token_inject);
+    RUN_TEST(test_soft_close_then_hard_would_fire);
+    RUN_TEST(test_soft_close_disabled_hard_still_fires);
+    RUN_TEST(test_soft_close_natural_at_boundary);
+    RUN_TEST(test_soft_close_probe_uses_probe_ids_not_inject_ids);
+    RUN_TEST(test_soft_close_probe_ids_empty_falls_back_to_close_token_ids);
+    RUN_TEST(test_soft_close_inject_sequence_unchanged_when_fires);
+    RUN_TEST(test_soft_close_min_tokens_blocks_early_fire);
+    RUN_TEST(test_soft_close_min_tokens_default_zero_unchanged_behavior);
+    RUN_TEST(test_soft_close_determinism_when_disabled);
+
     std::fprintf(stderr, "\n══════════════════════════════════════════\n");
     std::fprintf(stderr, " Results: %d assertions, %d failures\n",
                  test_count, test_failures);