Skip to content

feat(server): soft-close thinking termination via logit-ratio peek#326

Open
easel wants to merge 3 commits into
Luce-Org:mainfrom
easel:feat/soft-close-thinking-termination
Open

feat(server): soft-close thinking termination via logit-ratio peek#326
easel wants to merge 3 commits into
Luce-Org:mainfrom
easel:feat/soft-close-thinking-termination

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 1, 2026

Summary

Adds an operator-configurable soft-close dial that lets the AR
decode loop terminate </think> early when the close-token logit comes
within a configured probability ratio of the most-likely-token logit.
Default 0.0 keeps current behaviour byte-identical; any positive
value activates the third close_kind="soft" taxonomy entry that
docs/specs/thinking-budget.md §7 has reserved since the v2 work.

Motivation. Gemma 4 26B decodes at ~30 tok/s through up to 15 488
phase-1 thinking tokens (~8 min wall-clock / case) before the hard-cap
hook fires. Spot-checks of close-token logits in the late phase-1
window show </think> riding at 10-60 % of the chosen-token
probability for thousands of steps — i.e. the model is near ready
to close. A soft-ratio dial in the [0.05, 0.5] range can reclaim
30-50 % of those tokens at no quality loss. Sweep methodology and
recommended dial values land in a follow-up PR (out of scope here).

What's in the PR

  1. Plan doc (commit 1) at
    docs/experiments/soft-close-thinking-termination-plan.md, with a
    verbatim codex review and per-finding dispositions.
  2. Implementation (commit 2):
    • BudgetHook::soft_close_min_ratio + GenerateResult::soft_forced_close.
    • Soft-close comparator + state machine in qwen35_backend.cpp::do_ar_decode.
    • CLI flag --think-soft-close-min-ratio <F> + startup banner line.
    • Per-request override via Anthropic envelope's
      thinking.soft_close_min_ratio, clamped to the server ceiling.
    • close_kind="soft" emitted in finish_details when the soft path
      fired (precedence: soft > hard > natural on tie — see plan §12).
    • Spec doc updated.

Mechanism — zero-cost when disabled

The AR loop already materialises the full logits row to CPU each step
for the sampler. The comparator reads two scalars from that buffer
and runs logit[close0] - logit[chosen] >= log(min_ratio) — no graph
modification, no extra GPU work. When min_ratio == 0 the outer
guard short-circuits before any work happens. Generation determinism
is byte-identical to pre-PR with the dial off.

For the math: prob[i]/prob[j] = exp(logit[i] - logit[j]), so
comparing logit_diff >= log(min_ratio) is identical to comparing
prob_ratio >= min_ratio but skips the softmax. Numerically stable
in fp32 for typical LLM logit ranges (codex confirmed §3.4).

Scope (v1)

  • Qwen3.5/3.6 only. Gemma 4 and Laguna's AR loops follow the same
    pattern (full logits already on CPU per step) but get their own PRs
    to keep the diff reviewable.
  • Pure AR. Spec-decode's verify/accept inner loop reads only the
    argmax-of-target — soft-peek there would require graph extension.
    Spec-decode tails off into AR before the budget edge, so the soft
    trigger still fires correctly on the AR tail.
  • Single-token close peek. Multi-token </think> sequences (Laguna)
    peek only the first id; the existing inject machinery handles the
    rest. Codex agreed this is the right engineering trade-off (review
    Q3).

Codex review

Sent to the live Gemma 4 26B service via lucebox codex; verbatim
review and per-finding dispositions are recorded in plan §11. Codex
verdict: PROCEED WITH CHANGES. The single critical finding (Q5,
per-request clamp logic broken when server_default=0) is fixed: the
operator-disabled-server case now silently ignores per-request opt-in
attempts rather than enabling them via the clamp loophole.

Tests

server/test/test_server_unit.cpp gains 12 new test functions
(~135 new assertions). The comparator is small and inline in
model_backend.h::soft_close::should_fire, called from a unit-test
mirror of the AR loop's close-state machine. No GPU required.

Standalone smoke-test of the comparator math (uncommitted) confirmed
135/135 assertions pass before committing.

Test plan

  • Unit tests pass in a fresh build: cmake --build server/build --target test_server_unit && ctest -R server_unit
  • Existing thinking-budget integration tests in luce-bench/tests/test_client_thinking_budget.py pass unchanged (default dial=0 → behaviour byte-identical).
  • Operator sets --think-soft-close-min-ratio 0.1 in a smoke deploy, confirms server banner shows the new line, runs a Qwen3.6 thinking-enabled probe, and verifies a finish_details.close_kind="soft" appears for at least one case.
  • Empirical sweep (follow-up PR) quantifies token savings vs quality across a sweep bracket of min_ratio ∈ {0.05, 0.1, 0.2, 0.5} on the existing coding-agent-loop probes.

Follow-ups (NOT in this PR)

  • Gemma 4 26B soft-close port (same mechanism, separate backend file).
  • Laguna soft-close port (separate backend file).
  • dflash.think_soft_close_min_ratio knob in the lucebox python CLI repo + autotune sweep bracket entry.
  • Empirical sweep + recommended dial values per model.
  • Spec-decode joint-peek (if Laguna multi-token false-positives warrant it).

🤖 Generated with Claude Code

easel added 2 commits May 31, 2026 22:49
…io peek

Settle the design for a configurable soft-close dial that lets the AR
loop terminate `</think>` early once its close-token logit comes within
a configurable probability ratio of the chosen-token logit. Default
disabled (zero cost when off); operator opt-in via
`--think-soft-close-min-ratio`; per-request override clamps to the
server ceiling like other thinking knobs.

Key design choices documented:
- Reuse the existing per-step CPU logits read (no graph addition).
- Compare via `logit_diff >= log(min_ratio)` — no softmax required.
- Multi-token close peeks first id only; existing inject machinery
  drives the rest of the sequence.
- Soft wins ties against hard on same-step trigger (rebuttal in §12).
- Spec-decode boundary unchanged — pure-AR only in v1.

Next steps: codex review (§11 placeholder), implementation, tests.
Add an operator-configurable dial (`--think-soft-close-min-ratio`) that
lets the AR loop terminate `</think>` early when its close-token logit
comes within a configured probability ratio of the chosen-token logit.
Default `0.0` (disabled) is byte-identical to pre-change behaviour.

Mechanism (Qwen3.5/3.6 AR loop only in v1):
- Comparator runs after sampling, before the existing hard-cap hook,
  using the logits row that's already on CPU for the sampler — no
  graph addition, no extra GPU work.
- Threshold check uses `logit[close] - logit[chosen] >= log(min_ratio)`,
  which is mathematically equivalent to a probability-ratio compare
  but avoids softmax / exp() cost.
- Per-request override (`thinking.soft_close_min_ratio`) clamps to
  `min(requested, server_default)`; ignored entirely when the operator
  has the dial at 0 (codex review Q5 fix).
- Multi-token close peeks first id only; existing inject machinery
  drives the remaining ids.
- New `close_kind="soft"` value in `finish_details`; spec §7 updated.
  Soft wins ties against hard on the same step (plan §4 + §12).

Plumbing:
- `BudgetHook::soft_close_min_ratio` (model_backend.h).
- `GenerateResult::soft_forced_close`.
- `ServerConfig::soft_close_min_ratio` + `--think-soft-close-min-ratio`
  CLI flag + startup banner line.
- `ParsedRequest::per_req_soft_close_min_ratio` parsed from
  `thinking.soft_close_min_ratio`.
- `do_ar_decode` / `do_spec_decode` signatures extended with a
  `soft_forced_close_out` pointer; existing hard-cap path untouched.

Tests (12 new, 17 RUN_TEST invocations adding ~135 assertions):
- Comparator math: disabled/strict/aggressive/below-threshold/
  chosen-is-close/tiny-ratio edge cases.
- State machine: single-token + multi-token inject, soft-preempts-hard,
  disabled-hard-still-fires, natural-at-boundary, byte-identical
  determinism when disabled.

Spec-decode boundary documented as v1 limitation (out of scope).
Gemma4 + Laguna soft-close are follow-ups; lucebox python config and
autotune sweep brackets land in the lucebox CLI repo.

See docs/experiments/soft-close-thinking-termination-plan.md for the
full design (with verbatim codex review + dispositions).
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 9 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="docs/experiments/soft-close-thinking-termination-plan.md">

<violation number="1" location="docs/experiments/soft-close-thinking-termination-plan.md:103">
P3: Plan document contradicts itself on whether `log(min_ratio)` is precomputed once outside the AR loop or computed each step. §3.1's code snippet computes `std::log(budget_hook.soft_close_min_ratio)` inside the loop's if-block (every step the comparator runs), but the text immediately after says it is 'precomputed once outside the loop' and §3.6 repeats 'precomputed once at AR entry'. The actual implementation in `soft_close::should_fire` (model_backend.h:108) also computes `std::log(min_ratio)` on each call rather than caching it. A reader trying to implement from the plan would get contradictory guidance about where to place the `log()` call.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:983">
P1: Soft-close skips the first token of multi-token close sequences because `maybe_force_close` immediately overwrites `close[0]` with `close[1]` in the same step.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

tok, close0, budget_hook.close_token_ids.size());
tok = close0;
budget_close_started = true;
close_inject_pos = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Soft-close skips the first token of multi-token close sequences because maybe_force_close immediately overwrites close[0] with close[1] in the same step.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 983:

<comment>Soft-close skips the first token of multi-token close sequences because `maybe_force_close` immediately overwrites `close[0]` with `close[1]` in the same step.</comment>

<file context>
@@ -938,6 +943,47 @@ bool Qwen35Backend::do_ar_decode(int committed, int n_gen,
+            tok, close0, budget_hook.close_token_ids.size());
+        tok = close0;
+        budget_close_started = true;
+        close_inject_pos = 1;
+        if (soft_forced_close_out) *soft_forced_close_out = true;
+    };
</file context>
Suggested change
close_inject_pos = 1;
close_inject_pos = 0;

// prob[close] / prob[chosen] = exp(l_close - l_chosen);
// Compare l_close - l_chosen >= log(min_ratio) — single fma,
// no exp() needed.
const float log_ratio = std::log(budget_hook.soft_close_min_ratio);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Plan document contradicts itself on whether log(min_ratio) is precomputed once outside the AR loop or computed each step. §3.1's code snippet computes std::log(budget_hook.soft_close_min_ratio) inside the loop's if-block (every step the comparator runs), but the text immediately after says it is 'precomputed once outside the loop' and §3.6 repeats 'precomputed once at AR entry'. The actual implementation in soft_close::should_fire (model_backend.h:108) also computes std::log(min_ratio) on each call rather than caching it. A reader trying to implement from the plan would get contradictory guidance about where to place the log() call.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/experiments/soft-close-thinking-termination-plan.md, line 103:

<comment>Plan document contradicts itself on whether `log(min_ratio)` is precomputed once outside the AR loop or computed each step. §3.1's code snippet computes `std::log(budget_hook.soft_close_min_ratio)` inside the loop's if-block (every step the comparator runs), but the text immediately after says it is 'precomputed once outside the loop' and §3.6 repeats 'precomputed once at AR entry'. The actual implementation in `soft_close::should_fire` (model_backend.h:108) also computes `std::log(min_ratio)` on each call rather than caching it. A reader trying to implement from the plan would get contradictory guidance about where to place the `log()` call.</comment>

<file context>
@@ -0,0 +1,774 @@
+        // prob[close] / prob[chosen] = exp(l_close - l_chosen);
+        // Compare l_close - l_chosen >= log(min_ratio) — single fma,
+        // no exp() needed.
+        const float log_ratio = std::log(budget_hook.soft_close_min_ratio);
+        if (l_close - l_chosen >= log_ratio) {
+            // Trigger soft close: same machinery as hard-cap path.
</file context>

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Integrate soft-close thinking termination while preserving the existing empty-visible-output retry path, stall guards, MoE AR dispatch path, and C2 gate tests.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record PR Luce-Org#326 integration, current PR-head coverage, retained conflict probes, and Luce-Org#321 target-shard IPC feasibility findings.
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
…ebox-docker

Brings the soft-close logit-ratio peek mechanism onto feat/lucebox-docker
so the cuda12 image can be rebuilt with both the call:<verb>{} parser+
emitter fix (Luce-Org#329) AND the auto-thinking-cap dial available in a single
sweep.

Folded:
- 1552495 docs(experiments): plan soft-close thinking termination
- d799d00 feat(server): soft-close thinking termination via logit-ratio peek

Conflicts resolved:
- server/src/qwen35/qwen35_backend.cpp: do_ar_decode signature kept
  HEAD's terse comment + soft-close's new bool *soft_forced_close_out
  parameter.
- server/test/test_server_unit.cpp: concatenated HEAD's C2-gate tests
  with soft-close's comparator/state-machine tests; merged both
  RUN_TEST blocks.

Plumbing added in this merge (not on the source branch):
- DFLASH_THINK_SOFT_CLOSE_MIN_RATIO env var in entrypoint.sh, emitted
  to the server CLI as --think-soft-close-min-ratio only when nonzero
  (preserves byte-identical-when-disabled invariant).
- DflashRuntime.think_soft_close_min_ratio (float, default 0.0) in
  lucebox types/config/docker_run so `lucebox config set
  dflash.think_soft_close_min_ratio=0.5` propagates through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
…ring PR Luce-Org#326 merge

Cherry-pick artifact from resolving the conflict in test_server_unit.cpp
during the soft-close merge — `sed -i '4155d'` deleted the closing
brace of test_soft_close_natural_at_boundary instead of the leftover
conflict-marker line. Compile fails with 'a function-definition is not
allowed here before `{` token' at the int main() that follows.

Restores the brace; no logic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Adds an operator-only flag that emits one stderr line per AR step
inside the thinking phase recording (committed, chosen_tok, close0_tok,
logit[close], logit[chosen], diff, prob_ratio). Designed to capture
real close-vs-chosen logit trajectories on qwen3.6 so a sliding-ratio
soft-close curve can be fit from data rather than guessed.

The fixed-ratio soft-close (PR Luce-Org#326) terminates thinking when
logit[close]-logit[chosen] >= log(ratio). A single ratio is the wrong
tool for both "step 1 reasoning" and "5K-token reasoning" — what we
want is a ratio that slides from strict at the start to permissive at
the cap. Curve shape (linear / exponential / piecewise) depends on
how the logit gap evolves through thinking, which this flag now
exposes empirically.

Plumbing:
- BudgetHook::debug_thinking_logits (model_backend.h)
- qwen35_backend.cpp maybe_soft_close lambda: emits [soft-trace]
  every step when flag set, regardless of soft_close_min_ratio.
  Also enables the prefill-last-logits read on the first AR token
  so step 0 participates.
- ServerConfig::debug_thinking_logits + --debug-thinking-logits CLI
  + startup banner line.
- http_server.cpp threads config_.debug_thinking_logits into the
  per-request BudgetHook.
- DFLASH_DEBUG_THINKING_LOGITS env in entrypoint.sh (default 0;
  forwarded to --debug-thinking-logits when "1").
- lucebox: DflashRuntime.debug_thinking_logits (bool, default
  False) + config.py setter + docker_run.py env emission.

Zero GPU cost (logits already on CPU for sampling); ~1 stderr line
per thinking token across in-flight requests when on. Off by default.

No behavior change when DFLASH_DEBUG_THINKING_LOGITS=0.

test_server_unit: 1973 assertions, 0 failures.
lucebox tests: 114/114 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…floor

Empirical motivation
====================

Soft-close (PR Luce-Org#326 mainline) was effectively inert on qwen3.6-27b. A
trajectory probe across 5 diverse prompts (1085-5771 thinking tokens
each) showed `prob_ratio < 1e-8` every step — meaning no sampled ratio
in {0.1, 0.3, 0.5, 0.7, 0.9} would ever fire.

Root cause: `BudgetHook::close_token_ids` was used for both:
  (a) the peek-token id read by `soft_close::should_fire(..., close0)`
  (b) the inject sequence written when the hook fires.

For qwen3.6-27b the model card's `thinking_terminator_hint` is a 16+
token English directive starting with "Considering the limited time by
the user, ...". So `close_token_ids[0]` tokenized to ~79939 ("Considering")
— a mid-sentence content token whose logit sits 19-35 nats below the
chosen token at every thinking step.

Fix (path α): split probe-vs-inject in BudgetHook
==================================================

* `close_token_ids` — unchanged role. Full inject sequence written on
  hard close or when soft-close fires.
* `soft_close_probe_ids` — NEW. Short sequence (typically one token)
  used only for the comparator peek. server_main detects the close
  marker substring inside the hint and tokenizes it in isolation;
  on miss it leaves the probe field empty (legacy fallback peek path
  in force). `BudgetHook::soft_close_probe_token()` returns the probe
  when set, else falls back to close_token_ids.front().

Validation: re-probed with image built from this branch. `</think>`
(token 248069) reliably becomes argmax-competitive at 66-94% of natural
reasoning length across all 5 prompts. `max_diff` reaches 0.000
(`prob_ratio = 1.0`) on every prompt vs prior `max_diff = -9.69` on
token 79939. 9.7 nat improvement, restoring the mechanism to its
designed regime.

False-positive guard: min_thinking_tokens floor
================================================

The peek runs every AR step but the fire decision can be gated by a
new `BudgetHook::soft_close_min_tokens` (server CLI:
`--think-soft-close-min-tokens N`). When set, suppress fire until
`committed_now - committed_at_entry >= soft_close_min_tokens`. Protects
against a rare early `</think>` logit spike on prompts where the model
briefly considers concluding mid-thought. Default 0 = floor disabled
(no behavior change from prior).

Empirical 66-94% fire window puts typical operating point at floor=128
for qwen3.6-27b. Per-request override not exposed (server-policy gate).

Diagnostic: --debug-thinking-logits
====================================

Adds `BudgetHook::debug_thinking_logits` + server CLI flag. When on,
emits one stderr line per AR step recording committed, chosen, probe0,
logit[close], logit[chosen], diff, prob_ratio. Used to capture full
close-vs-chosen trajectories so a sliding-ratio curve can be designed
from data rather than guessed. Zero GPU cost (logits already on CPU
for sampling); stderr-heavy, operator-only.

Tests
=====

5 new unit tests:
  - test_soft_close_probe_uses_probe_ids_not_inject_ids
  - test_soft_close_probe_ids_empty_falls_back_to_close_token_ids
  - test_soft_close_inject_sequence_unchanged_when_fires
  - test_soft_close_min_tokens_blocks_early_fire
  - test_soft_close_min_tokens_default_zero_unchanged_behavior

Also fixes a pre-existing OOB write in test_soft_close_determinism_when_disabled
(vocab=1000 row indexed at 248069). UB-silent in Release before the
new tests perturbed heap layout enough to crash; widened to vocab=250000
in place.

test_server_unit: 1780 assertions, 0 failures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@easel
Copy link
Copy Markdown
Collaborator Author

easel commented Jun 3, 2026

Follow-up commit f7e8d6f8: probe/inject split + min_thinking_tokens floor

Empirical validation on qwen3.6-27b showed the dial inert at every sampled ratio (prob_ratio < 1e-8 across 12,888 thinking-token steps). Root cause: BudgetHook::close_token_ids was used for both the soft-close peek probe AND the inject sequence — for qwen3.6's trained-hint sidecar that meant peeking the "Considering" lead-in (id 79939) instead of the </think> marker (id 248069).

This commit:

  1. Splits probe-vs-inject in BudgetHook. New soft_close_probe_ids field (empty = legacy fallback). server_main detects the marker substring inside the hint and tokenizes it in isolation.
  2. Adds --think-soft-close-min-tokens N false-positive floor (default 0 = disabled).
  3. Adds --debug-thinking-logits trajectory diagnostic for tuning future curves.

After the fix, re-probe shows </think> reaches argmax (max_diff = 0.000, prob_ratio = 1.0) at 66-94% of natural reasoning length across 5 diverse prompts. 9.7 nat improvement restores the mechanism to its designed regime.

5 new unit tests (test_server_unit: 1780 assertions, 0 failures). PR #331 (which had this fix on a bad base) is closed in favor of consolidation here.

🤖 Generated with Claude Code

easel added a commit to easel/lucebox-hub that referenced this pull request Jun 3, 2026
…floor

Empirical motivation
====================

Soft-close (PR Luce-Org#326 mainline) was effectively inert on qwen3.6-27b. A
trajectory probe across 5 diverse prompts (1085-5771 thinking tokens
each) showed `prob_ratio < 1e-8` every step — meaning no sampled ratio
in {0.1, 0.3, 0.5, 0.7, 0.9} would ever fire.

Root cause: `BudgetHook::close_token_ids` was used for both:
  (a) the peek-token id read by `soft_close::should_fire(..., close0)`
  (b) the inject sequence written when the hook fires.

For qwen3.6-27b the model card's `thinking_terminator_hint` is a 16+
token English directive starting with "Considering the limited time by
the user, ...". So `close_token_ids[0]` tokenized to ~79939 ("Considering")
— a mid-sentence content token whose logit sits 19-35 nats below the
chosen token at every thinking step.

Fix (path α): split probe-vs-inject in BudgetHook
==================================================

* `close_token_ids` — unchanged role. Full inject sequence written on
  hard close or when soft-close fires.
* `soft_close_probe_ids` — NEW. Short sequence (typically one token)
  used only for the comparator peek. server_main detects the close
  marker substring inside the hint and tokenizes it in isolation;
  on miss it leaves the probe field empty (legacy fallback peek path
  in force). `BudgetHook::soft_close_probe_token()` returns the probe
  when set, else falls back to close_token_ids.front().

Validation: re-probed with image built from this branch. `</think>`
(token 248069) reliably becomes argmax-competitive at 66-94% of natural
reasoning length across all 5 prompts. `max_diff` reaches 0.000
(`prob_ratio = 1.0`) on every prompt vs prior `max_diff = -9.69` on
token 79939. 9.7 nat improvement, restoring the mechanism to its
designed regime.

False-positive guard: min_thinking_tokens floor
================================================

The peek runs every AR step but the fire decision can be gated by a
new `BudgetHook::soft_close_min_tokens` (server CLI:
`--think-soft-close-min-tokens N`). When set, suppress fire until
`committed_now - committed_at_entry >= soft_close_min_tokens`. Protects
against a rare early `</think>` logit spike on prompts where the model
briefly considers concluding mid-thought. Default 0 = floor disabled
(no behavior change from prior).

Empirical 66-94% fire window puts typical operating point at floor=128
for qwen3.6-27b. Per-request override not exposed (server-policy gate).

Diagnostic: --debug-thinking-logits
====================================

Adds `BudgetHook::debug_thinking_logits` + server CLI flag. When on,
emits one stderr line per AR step recording committed, chosen, probe0,
logit[close], logit[chosen], diff, prob_ratio. Used to capture full
close-vs-chosen trajectories so a sliding-ratio curve can be designed
from data rather than guessed. Zero GPU cost (logits already on CPU
for sampling); stderr-heavy, operator-only.

Tests
=====

5 new unit tests:
  - test_soft_close_probe_uses_probe_ids_not_inject_ids
  - test_soft_close_probe_ids_empty_falls_back_to_close_token_ids
  - test_soft_close_inject_sequence_unchanged_when_fires
  - test_soft_close_min_tokens_blocks_early_fire
  - test_soft_close_min_tokens_default_zero_unchanged_behavior

Also fixes a pre-existing OOB write in test_soft_close_determinism_when_disabled
(vocab=1000 row indexed at 248069). UB-silent in Release before the
new tests perturbed heap layout enough to crash; widened to vocab=250000
in place.

test_server_unit: 1780 assertions, 0 failures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant