Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
774 changes: 774 additions & 0 deletions docs/experiments/soft-close-thinking-termination-plan.md

Large diffs are not rendered by default.

58 changes: 49 additions & 9 deletions docs/specs/thinking-budget.md
Original file line number Diff line number Diff line change
Expand Up @@ -538,13 +538,44 @@ The current taxonomy is:
| Value | Meaning |
|---|---|
| `natural` | The model emitted `</think>` on its own, either before reaching the phase-1 cap or before Level 2 had to force-close. |
| `hard` | The phase-1 cap was reached without a model-emitted `</think>`. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. |
| `soft` | The soft-close logit-ratio peek (Level 2.5) fired before the hard cap — `prob[</think>] / prob[chosen_tok]` cleared the operator-configured `soft_close_min_ratio` threshold, and the AR loop injected `</think>` while the model was already "near" closing. Indicates voluntary cooperation: the model would have closed soon anyway; we just hurried it along to reclaim tokens. Currently Qwen3.5/3.6 only. |
| `hard` | The phase-1 cap was reached without a model-emitted `</think>` and without the soft path triggering. Either Level 2 force-closed the block in-loop (preserving KV) or Level 1 ran the phase-2 reprompt. |

When both `soft` and `hard` could fire on the same AR step (the
soft threshold cleared at exactly the budget-edge step), `soft`
wins — the soft trigger carries more information (the model agreed
it was time) than the hard trigger (which only reports coercion).
See `docs/experiments/soft-close-thinking-termination-plan.md` §4 +
§12 for the design rationale.

Soft-close is enabled by the operator via the CLI flag
`--think-soft-close-min-ratio <F>`. Default `0.0` keeps the legacy
two-value taxonomy (`natural` / `hard`); any positive value
activates the third. The dial is a probability ratio in `[0, 1]`:

| `min_ratio` | Behaviour |
|---|---|
| `0.0` | Disabled. Soft path inert; per-request overrides silently ignored. |
| `0.05`–`0.2` | Conservative — fires only when `</think>` is within 5×–20× of the argmax probability. Recommended starting range. |
| `0.5` | Aggressive — fires when `</think>` has at least half the probability of the chosen token. |
| `1.0` | Strict — fires only when `</think>` IS the most-likely token. Useful as a safety check. |

Per-request override (Anthropic envelope, see §4.1):

```jsonc
{
"thinking": {
"type": "enabled",
"soft_close_min_ratio": 0.1
}
}
```

A third value `soft` is reserved for a future voluntary-close
mechanism (logit-biasing the model toward `</think>` as the cap
approaches, before forcing it). Reserved so consumers can switch on
the value without an exhaustive-match warning when a future server
version adds it; not emitted today.
The per-request value clamps to `min(requested, server_default)` —
clients can tighten (lower the threshold, fire more aggressively)
but not loosen (raise it above the operator's ceiling). When the
server has the dial disabled (`0.0`), per-request overrides are
silently ignored — the feature is operator-policy gated.

## 8. Streaming

Expand All @@ -564,9 +595,18 @@ in the terminal `message_delta` event for Anthropic.
server-configured ceiling, never looser. Allowing full override
would re-create the silent-truncation footgun of middleboxes that
drop unknown fields.
- **Soft close-kind / soft-budget hint.** The mechanism (logit bias
to nudge `</think>` selection before the hard cap) is sketched in
§7 but not specified.
- **Spec-decode soft-close peek.** Soft-close fires inside the AR
loop. When spec-decode is in use, the close still triggers at the
spec-decode → AR tail-off boundary (slightly later than pure-AR
mode); the verify/accept inner loop does not run the comparator.
Gemma 4 and Laguna are pure-AR; this only matters for Qwen3.5/3.6
with a draft model.
- **Multi-token close joint probability.** When `</think>` tokenizes
to multiple ids, the soft-close comparator peeks only the FIRST
id's logit (the existing multi-token inject machinery drives the
remainder of the sequence on subsequent steps). The joint
`P(t_0, t_1, …)` peek is left to a v2 if false-positive rates
warrant it.
- **Per-token close-info metadata.** The upstream reference exposes
`(token_index, remaining_budget, rank)` for the close event. The
current `finish_details` reports aggregate counts only.
Expand Down
96 changes: 91 additions & 5 deletions server/src/common/model_backend.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

#pragma once

#include <cmath>
#include <cstdint>
#include <cstdio>
#include <functional>
Expand Down Expand Up @@ -71,15 +72,91 @@ struct DaemonIO {
// decode) — the perf trade-off is acceptable since this only kicks in
// for thinking-enabled requests. Spec-decode integration is a follow-up.
struct BudgetHook {
// Multi-token close sequence injected when `(n_gen - committed)`
// drops to `hard_limit_remaining`. For Qwen3.x this is the
// canonical "Considering the limited time..." summarize-and-stop
// lead-in (tokenized at server startup); for non-qwen arches it's
// a single close-tag token. Empty = hook disabled.
// Inject sequence written when the hard cap fires OR when soft-close
// fires. This is the verbatim tokenization of the model card's
// `thinking_terminator_hint` (e.g. for Qwen3.6 the lead-in
// "Considering the limited time by the user, ... </think>\n\n").
// May be many tokens long; the first element is what the AR loop
// writes on the firing step, with the rest streamed out on
// subsequent steps. Empty = disabled.
std::vector<int32_t> close_token_ids;
// Short PROBE sequence used by the soft-close logit-ratio peek.
// Conceptually this is the tokenization of just the close MARKER
// (e.g. `</think>` — a single token id 248069 on Qwen3.6) rather
// than the full inject directive above. Splitting probe-vs-inject
// matters because the inject sequence for trained-hint models
// starts with a content token like "Considering" whose logit is
// 19-35 nats below the chosen token at every step, masking the
// close-marker's true probability and preventing soft-close from
// ever firing.
// When empty, the soft-close peek falls back to
// `close_token_ids.front()` (legacy behavior — kept so models that
// haven't been updated keep working identically to before the split).
std::vector<int32_t> soft_close_probe_ids;
int hard_limit_remaining = 0;
// Soft-close (Level 2 voluntary). When > 0, at each AR step the
// loop compares the probe-token logit against the chosen-token
// logit; if `prob[probe[0]] / prob[chosen] >= soft_close_min_ratio`
// (equivalently `logit[probe[0]] - logit[chosen] >= log(min_ratio)`),
// the inject sequence (close_token_ids) is written BEFORE the hard
// limit is reached. 0.0 = disabled (default); 1.0 = fire only when
// the probe token is already the most-likely token; lower values =
// fire more aggressively. See docs/specs/thinking-budget.md §7 and
// docs/experiments/soft-close-thinking-termination-plan.md.
float soft_close_min_ratio = 0.0f;
// Minimum thinking tokens before soft-close is allowed to fire.
// Soft-close peek runs on every AR step but the fire decision is
// gated by this floor — protects against premature termination on
// prompts where the close-marker logit briefly spikes mid-thought.
// 0 = floor disabled (default). Per empirical trajectory data on
// qwen3.6-27b (5 diverse prompts), </think> only becomes
// argmax-competitive at 66-94% of natural reasoning length — so a
// floor in the 64-256 range is the typical operating point.
int soft_close_min_tokens = 0;
// Diagnostic: when true, emit one stderr line per AR step inside the
// thinking phase with (committed, chosen_tok, logit[probe0],
// logit[chosen], diff). Used to record the close-vs-chosen logit
// trajectory across a full thinking run so a sliding-threshold curve
// can be designed from empirical data rather than guessed. Zero cost
// when off. See server_main.cpp --debug-thinking-logits.
bool debug_thinking_logits = false;

// Probe token id used by the soft-close peek. Returns the first
// element of soft_close_probe_ids when set, otherwise falls back to
// close_token_ids.front() (legacy behavior). Callers must guard
// against an empty hook before calling this.
int32_t soft_close_probe_token() const {
if (!soft_close_probe_ids.empty()) return soft_close_probe_ids.front();
return close_token_ids.front();
}
};

namespace soft_close {

// Returns true when the soft-close comparator would fire on this AR
// step. Side-effect free; safe to call from unit tests.
//
// Fast path: returns false in O(1) when min_ratio <= 0 (the disabled
// default). When the model has already chosen the close token on its
// own, also returns false — the natural-close path handles that.
//
// Math: `prob[i]/prob[j] = exp(logit[i] - logit[j])`, so
// `prob[close]/prob[chosen] >= min_ratio` ⟺
// `logit[close] - logit[chosen] >= log(min_ratio)`. We compare on
// logits to avoid `exp()` and full-softmax cost; this is numerically
// stable in fp32 for typical LLM logit ranges (~±20).
inline bool should_fire(const float * logits,
int32_t chosen_tok,
int32_t close0_tok,
float min_ratio) {
if (min_ratio <= 0.0f) return false;
if (chosen_tok == close0_tok) return false;
const float log_ratio = std::log(min_ratio);
return (logits[close0_tok] - logits[chosen_tok]) >= log_ratio;
}

} // namespace soft_close

struct GenerateRequest {
std::vector<int32_t> prompt;
int n_gen = 0;
Expand Down Expand Up @@ -121,6 +198,13 @@ struct GenerateResult {
// stream and grepping for "</think>" cannot distinguish the two
// (the injected close decodes identically).
bool budget_forced_close = false;
// True when the soft-close path (logit-ratio peek) injected the
// </think> close sequence in this generation. Mutually exclusive
// with budget_forced_close: when both could fire on the same step,
// soft wins and budget_forced_close stays false. The server uses
// this to attribute close_kind="soft" (vs "hard"). See
// docs/specs/thinking-budget.md §7.
bool soft_forced_close = false;
// True iff the AR decode loop's post-close watchdog detected an n-gram
// repetition loop and broke out early. Caller surfaces this so clients
// can mark the answer as unreliable rather than treating the
Expand Down Expand Up @@ -212,6 +296,8 @@ struct ModelBackend {
retry.spec_decode_ran = first.spec_decode_ran || retry.spec_decode_ran;
retry.budget_forced_close =
first.budget_forced_close || retry.budget_forced_close;
retry.soft_forced_close =
first.soft_forced_close || retry.soft_forced_close;
retry.degenerate_decode_close =
first.degenerate_decode_close || retry.degenerate_decode_close;
return retry;
Expand Down
Loading
Loading