fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel#308
Open
easel wants to merge 4 commits into
Open
fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel#308easel wants to merge 4 commits into
easel wants to merge 4 commits into
Conversation
…ontent channel
The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.
The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.
Fix:
1. chat_template.h: render_chat_template / render_chat_template_jinja now
return a PromptRenderResult { text, started_in_thinking }. The built-in
QWEN3 and LAGUNA branches set started_in_thinking deterministically when
enable_thinking && add_generation_prompt; GEMMA4 stays false (its
reasoning channel is opened by the model emitting `<|channel>`, which
http_server forwards into the emitter as `<think>`). The Jinja path
suffix-sniffs the rendered prompt for a trailing `<think>` opener and
emits a [WARN] log when sniffing decides true so a template/model-card
mismatch surfaces at runtime.
2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
When constructed with REASONING, active_kind_ initializes to "thinking"
so the Anthropic first content_block is `thinking` instead of `text`
(avoids a spurious empty text-block stop+restart on the first reasoning
delta). Deliberately leaves checked_think_prefix_ at its default (false)
so the existing one-time `<think>` strip guard still trips if a
template/model-card mismatch causes the model to emit a redundant opener.
3. http_server.cpp: thread render_result.started_in_thinking through
ParsedRequest into the SseEmitter's initial_mode. Both streaming and
non-streaming paths feed tokens through the same emitter, so the fix
covers both response shapes.
Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.
Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three C++ tests that chain render_chat_template + SseEmitter so the wiring between the renderer's started_in_thinking flag and the emitter's initial_mode is exercised end-to-end, not just at each end. The per-unit tests above each verify their half of the contract, but the original bug was a missing call-site wire — both halves were correct in isolation. Also tighten the Python integration test assertions for enable_thinking and reasoning.effort: require non-empty reasoning_content and no raw <think>/</think> in either channel. The prior 'doesn't crash' assertion would have passed on the broken code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.
…tion
Adds a deterministic, scenario-driven integration test that exercises the
full HttpServer request path on a CPU-only CI runner — no GPU, no model
weights, no GGUF download. Designed to catch the regression class from
this PR (renderer→emitter wiring) end-to-end pre-merge.
Components:
- StubModelBackend (server/test/stub_model_backend.*) — a ModelBackend
whose generate() decodes the prompt to text via the real tokenizer,
matches it against a JSON scenario (longest prompt_suffix wins), then
replays scripted tokens through the production req.on_token/io.on_token
callbacks. Streaming behavior comes from the production code path; the
stub just feeds it tokens.
- ScenarioStore (server/test/scenario_store.*) — loads server/test/
scenarios/*.json. Schema:
{ "match": {"prompt_suffix": "..."},
"response": {"tokens": [...], "finish_reason": "stop"} }
Tokens are either plain strings (BPE-encoded by the real tokenizer) or
{text, special:true} objects (looked up via token_to_id, so Qwen3.6's
single-token </think> arrives as the right ID).
- spike_no_gpu_http_server (server/test/) — driver that wires Tokenizer
+ ScenarioStore + StubModelBackend + HttpServer together. Links
dflash_common (CUDA TUs included) but never instantiates a real model;
ggml_cuda_init() is never called, so CUDA_VISIBLE_DEVICES="" is the
supported configuration.
- Tokenizer fixture (server/test/fixtures/qwen3.6-tokenizer.gguf, 11MB
via LFS) — full Qwen3.6 vocab/merges/special tokens stripped from the
27B GGUF. Real BPE round-tripping, deterministic token concordance.
Build script at server/test/scripts/strip_gguf_to_tokenizer.py.
- test_stub_integration.py — pytest module that spawns the driver and
exercises OpenAI + Anthropic, streaming + non-streaming. Asserts on
reasoning_content routing, content channel cleanliness, no <think>
leakage, Anthropic first-content-block-is-thinking. 4 tests, 0.43s.
- CI (.github/workflows/ci.yml) — new "Run CPU integration tests" step
after the venv populate, before the megakernel build. Builds the
spike target alongside the existing ones. Enables LFS in checkout so
the tokenizer fixture lands.
Why this matters: the 1656-assertion test_server_unit suite catches
emitter/renderer issues in isolation but cannot fail on a missed
http_server.cpp wire (the original bug). This new step exercises the
exact wire — render → ParsedRequest → SseEmitter → SSE socket — on
every PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The binary is the permanent stub-driven HTTP server test driver, not throwaway exploration. Rename it to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Include PR Luce-Org#308 after it became non-draft and record the latest containment, conflict probes, retained worktrees, and validation results.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
For models whose chat template appends
<think>to the prompt suffix whenenable_thinkingis honored (Qwen3.6, Laguna), the model generatesreasoning tokens directly into the output stream with no opening tag.
SseEmitterwas hardcoded to start inStreamMode::CONTENTand onlytransitioned to
REASONINGwhen it saw a<think>literal in thegenerated stream. Result: reasoning text leaked into
content,reasoning_contentstayed empty, and the</think>close tag appearedverbatim in the user-visible answer.
parse_reasoning()already supported astarted_in_thinking=truemode(see
reasoning.h) — the receiving end of the contract existed. Thesending end never threaded the flag through. Both halves were correct
in isolation; no caller connected them.
Fix
Plumb a
started_in_thinkingbit from the chat-template renderer to theSSE emitter so the emitter starts in the correct stream mode when the
prompt itself pre-opens the reasoning channel.
render_chat_templateandrender_chat_template_jinjanow returnPromptRenderResult { text, started_in_thinking }.started_in_thinking = trueiffenable_thinking && add_generation_prompt.false(its reasoning channel is opened by themodel emitting
<|channel>, which the server already forwards as<think>).<think>and logs a[WARN]when the sniff fires, sotemplate/model-card mismatches surface at runtime.
SseEmittertakes aninitial_modeconstructor parameter (defaultedto
CONTENT). WhenREASONING, the Anthropic firstcontent_block_startis emitted asthinkingso SDK clients don'tsee a spurious empty
textblock. The existing one-time<think>-strip guard is preserved so a redundant model-emittedopener (template/card mismatch) is still removed cleanly.
http_server.cppthreadsrender_result.started_in_thinkingthroughParsedRequestinto the emitter'sinitial_mode. Streaming andnon-streaming paths share the emitter, so both response shapes are
fixed by one wire.
Tests
C++ unit (
test_server_unit, runs in CI)1656 assertions, 0 failures. Coverage:
started_in_thinkingcorrectlyacross
enable_thinking∈ {on, off} andadd_generation_prompt∈{on, off}; Jinja suffix-sniff positive and negative cases.
initial_mode=REASONING, tokens before</think>route to
reasoning_content; unclosed channel keeps everything inreasoning; redundant
<think>opener is stripped; Anthropic firstblock is
thinking, nottext.render_chat_template → propagate flag → SseEmitter(initial_mode=…) → emit tokensend-to-endfor QWEN3-on, LAGUNA-on, and QWEN3-off. Mirrors the production wiring
at
http_server.cpp.CPU-only HTTP integration (
test_stub_integration.py, runs in CI)Closes the original "CI can't load a real model" gap that allowed this
bug to ship. A new test driver —
spike_no_gpu_http_server— linksagainst the production
dflash_commonlibrary but instantiates adeterministic
StubModelBackendinstead of a CUDA-backed model. Thedriver runs with
CUDA_VISIBLE_DEVICES=""; the real Qwen3.6 tokenizerloads from a stripped GGUF fixture (
server/test/fixtures/ qwen3.6-tokenizer.gguf, 11MB via LFS, vocab/merges/specials only, notensor weights).
StubModelBackend::generate()decodes the request prompt to text viathe real tokenizer, matches it against a JSON scenario file (longest
prompt_suffixwins), then replays the scripted token stream throughthe production
req.on_token/io.on_tokencallbacks. Streamingbehavior comes from the production code path — the stub feeds the same
SseEmitter that GPU traffic does.
The new pytest suite (4 tests, 0.43s) covers:
populated, content has no
<think>leakage.precedes text block; no raw tags.
reasoning_contentdeltas before</think>, per-tokencontentdeltas after.
content_block_starthastype:"thinking"(nottext);thinking_deltaevents precedetext_deltaevents.Scenarios are JSON files in
server/test/scenarios/and triviallyextensible — any deterministic prompt→token-stream pairing can be
captured as a file. Synthesizing edge cases (unclosed
</think>,redundant openers, error responses, finish_reason variants) is just
authoring more files.
Manual smoke (existing
test_server_integration.py, run by deploy)The Python integration test for a live GPU server has had its reasoning
assertions tightened: prior versions used "doesn't crash" assertions
that would have passed on the broken code. Now require non-empty
reasoning_contentand no raw<think>/</think>in either channel.This is the deploy-time check; the CPU integration test above is the
pre-merge check.
Files
No model-card JSON or schema changes. No production code outside
server/src/server/.Out of scope
thinking_prompt_pre_opens_thinkmodel-card override field.The renderer is the source of truth; an override is belt-and-suspenders.
(prompt, emitted tokens, finish)triples into scenario filesautomatically. Hand-authored scenarios are sufficient for the bug
classes we care about today.
with reasoning). The infrastructure supports these — each is a new
JSON file.