fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel by easel · Pull Request #308 · Luce-Org/lucebox-hub

easel · 2026-05-29T15:26:02Z

Problem

For models whose chat template appends <think> to the prompt suffix when
enable_thinking is honored (Qwen3.6, Laguna), the model generates
reasoning tokens directly into the output stream with no opening tag.
SseEmitter was hardcoded to start in StreamMode::CONTENT and only
transitioned to REASONING when it saw a <think> literal in the
generated stream. Result: reasoning text leaked into content,
reasoning_content stayed empty, and the </think> close tag appeared
verbatim in the user-visible answer.

parse_reasoning() already supported a started_in_thinking=true mode
(see reasoning.h) — the receiving end of the contract existed. The
sending end never threaded the flag through. Both halves were correct
in isolation; no caller connected them.

Fix

Plumb a started_in_thinking bit from the chat-template renderer to the
SSE emitter so the emitter starts in the correct stream mode when the
prompt itself pre-opens the reasoning channel.

render_chat_template and render_chat_template_jinja now return
PromptRenderResult { text, started_in_thinking }.
- QWEN3 / LAGUNA builtins set started_in_thinking = true iff
  enable_thinking && add_generation_prompt.
- GEMMA4 always sets false (its reasoning channel is opened by the
  model emitting <|channel>, which the server already forwards as
  <think>).
- Jinja path: suffix-sniffs the rendered prompt for a trailing
  <think> and logs a [WARN] when the sniff fires, so
  template/model-card mismatches surface at runtime.
SseEmitter takes an initial_mode constructor parameter (defaulted
to CONTENT). When REASONING, the Anthropic first
content_block_start is emitted as thinking so SDK clients don't
see a spurious empty text block. The existing one-time
<think>-strip guard is preserved so a redundant model-emitted
opener (template/card mismatch) is still removed cleanly.
http_server.cpp threads render_result.started_in_thinking through
ParsedRequest into the emitter's initial_mode. Streaming and
non-streaming paths share the emitter, so both response shapes are
fixed by one wire.

Tests

C++ unit (`test_server_unit`, runs in CI)

1656 assertions, 0 failures. Coverage:

Renderer end: QWEN3/LAGUNA/GEMMA4 set started_in_thinking correctly
across enable_thinking ∈ {on, off} and add_generation_prompt ∈
{on, off}; Jinja suffix-sniff positive and negative cases.
Emitter end: with initial_mode=REASONING, tokens before </think>
route to reasoning_content; unclosed channel keeps everything in
reasoning; redundant <think> opener is stripped; Anthropic first
block is thinking, not text.
Wire integration: three tests chain render_chat_template → propagate flag → SseEmitter(initial_mode=…) → emit tokens end-to-end
for QWEN3-on, LAGUNA-on, and QWEN3-off. Mirrors the production wiring
at http_server.cpp.

CPU-only HTTP integration (`test_stub_integration.py`, runs in CI)

Closes the original "CI can't load a real model" gap that allowed this
bug to ship. A new test driver — spike_no_gpu_http_server — links
against the production dflash_common library but instantiates a
deterministic StubModelBackend instead of a CUDA-backed model. The
driver runs with CUDA_VISIBLE_DEVICES=""; the real Qwen3.6 tokenizer
loads from a stripped GGUF fixture (server/test/fixtures/ qwen3.6-tokenizer.gguf, 11MB via LFS, vocab/merges/specials only, no
tensor weights).

StubModelBackend::generate() decodes the request prompt to text via
the real tokenizer, matches it against a JSON scenario file (longest
prompt_suffix wins), then replays the scripted token stream through
the production req.on_token / io.on_token callbacks. Streaming
behavior comes from the production code path — the stub feeds the same
SseEmitter that GPU traffic does.

The new pytest suite (4 tests, 0.43s) covers:

Qwen3.6 enable_thinking non-streaming OpenAI: reasoning_content
populated, content has no <think> leakage.
Qwen3.6 enable_thinking non-streaming Anthropic: thinking block
precedes text block; no raw tags.
Qwen3.6 enable_thinking streaming OpenAI: per-token
reasoning_content deltas before </think>, per-token content
deltas after.
Qwen3.6 enable_thinking streaming Anthropic: first
content_block_start has type:"thinking" (not text);
thinking_delta events precede text_delta events.

Scenarios are JSON files in server/test/scenarios/ and trivially
extensible — any deterministic prompt→token-stream pairing can be
captured as a file. Synthesizing edge cases (unclosed </think>,
redundant openers, error responses, finish_reason variants) is just
authoring more files.

Manual smoke (existing `test_server_integration.py`, run by deploy)

The Python integration test for a live GPU server has had its reasoning
assertions tightened: prior versions used "doesn't crash" assertions
that would have passed on the broken code. Now require non-empty
reasoning_content and no raw <think>/</think> in either channel.
This is the deploy-time check; the CPU integration test above is the
pre-merge check.

Files

server/src/server/chat_template.{h,cpp}        +91/-6
server/src/server/http_server.{h,cpp}          +30/-8
server/src/server/sse_emitter.{h,cpp}          +25/-4
server/test/test_server_unit.cpp              +330/-10
server/scripts/test_server_integration.py      +32/-6

# New CPU integration infrastructure
server/test/stub_model_backend.{h,cpp}        +130 new
server/test/scenario_store.{h,cpp}            +220 new
server/test/spike_no_gpu_http_server.cpp      +110 new
server/test/test_stub_integration.py          +220 new
server/test/scenarios/qwen3_enable_thinking_basic.json  new
server/test/fixtures/qwen3.6-tokenizer.gguf   11MB (LFS) new
server/test/scripts/strip_gguf_to_tokenizer.py +90 new
.github/workflows/ci.yml                       +14
.gitattributes                                 +1
server/CMakeLists.txt                          +25

No model-card JSON or schema changes. No production code outside
server/src/server/.

Out of scope

Optional thinking_prompt_pre_opens_think model-card override field.
The renderer is the source of truth; an override is belt-and-suspenders.
A "record mode" that wraps the real GPU backend and captures
(prompt, emitted tokens, finish) triples into scenario files
automatically. Hand-authored scenarios are sufficient for the bug
classes we care about today.
Additional scenarios (Laguna, Gemma4, multi-turn, tool calls inline
with reasoning). The infrastructure supports these — each is a new
JSON file.

…ontent channel The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna chat templates append `<think>\n` to the prompt suffix when enable_thinking is honored, so the model emits reasoning tokens directly with no opening tag — the emitter never transitioned and reasoning text leaked into `content` while `reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7% (no-think) for Qwen3.6-27B Q4_K_M. The plumbing was already there: parse_reasoning() supports started_in_thinking=true (reasoning.h:17-19) but no caller passed it. Fix: 1. chat_template.h: render_chat_template / render_chat_template_jinja now return a PromptRenderResult { text, started_in_thinking }. The built-in QWEN3 and LAGUNA branches set started_in_thinking deterministically when enable_thinking && add_generation_prompt; GEMMA4 stays false (its reasoning channel is opened by the model emitting `<|channel>`, which http_server forwards into the emitter as `<think>`). The Jinja path suffix-sniffs the rendered prompt for a trailing `<think>` opener and emits a [WARN] log when sniffing decides true so a template/model-card mismatch surfaces at runtime. 2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter. When constructed with REASONING, active_kind_ initializes to "thinking" so the Anthropic first content_block is `thinking` instead of `text` (avoids a spurious empty text-block stop+restart on the first reasoning delta). Deliberately leaves checked_think_prefix_ at its default (false) so the existing one-time `<think>` strip guard still trips if a template/model-card mismatch causes the model to emit a redundant opener. 3. http_server.cpp: thread render_result.started_in_thinking through ParsedRequest into the SseEmitter's initial_mode. Both streaming and non-streaming paths feed tokens through the same emitter, so the fix covers both response shapes. Tests: add 12 unit tests under test_server_unit (assertion count 1608 → 1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA / GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff positive/negative cases. Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming `/v1/chat/completions` with `thinking:{type:enabled}` now populates reasoning_content and never leaks `</think>` into content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add three C++ tests that chain render_chat_template + SseEmitter so the wiring between the renderer's started_in_thinking flag and the emitter's initial_mode is exercised end-to-end, not just at each end. The per-unit tests above each verify their half of the contract, but the original bug was a missing call-site wire — both halves were correct in isolation. Also tighten the Python integration test assertions for enable_thinking and reasoning.effort: require non-empty reasoning_content and no raw <think>/</think> in either channel. The prior 'doesn't crash' assertion would have passed on the broken code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.

…tion Adds a deterministic, scenario-driven integration test that exercises the full HttpServer request path on a CPU-only CI runner — no GPU, no model weights, no GGUF download. Designed to catch the regression class from this PR (renderer→emitter wiring) end-to-end pre-merge. Components: - StubModelBackend (server/test/stub_model_backend.*) — a ModelBackend whose generate() decodes the prompt to text via the real tokenizer, matches it against a JSON scenario (longest prompt_suffix wins), then replays scripted tokens through the production req.on_token/io.on_token callbacks. Streaming behavior comes from the production code path; the stub just feeds it tokens. - ScenarioStore (server/test/scenario_store.*) — loads server/test/ scenarios/*.json. Schema: { "match": {"prompt_suffix": "..."}, "response": {"tokens": [...], "finish_reason": "stop"} } Tokens are either plain strings (BPE-encoded by the real tokenizer) or {text, special:true} objects (looked up via token_to_id, so Qwen3.6's single-token </think> arrives as the right ID). - spike_no_gpu_http_server (server/test/) — driver that wires Tokenizer + ScenarioStore + StubModelBackend + HttpServer together. Links dflash_common (CUDA TUs included) but never instantiates a real model; ggml_cuda_init() is never called, so CUDA_VISIBLE_DEVICES="" is the supported configuration. - Tokenizer fixture (server/test/fixtures/qwen3.6-tokenizer.gguf, 11MB via LFS) — full Qwen3.6 vocab/merges/special tokens stripped from the 27B GGUF. Real BPE round-tripping, deterministic token concordance. Build script at server/test/scripts/strip_gguf_to_tokenizer.py. - test_stub_integration.py — pytest module that spawns the driver and exercises OpenAI + Anthropic, streaming + non-streaming. Asserts on reasoning_content routing, content channel cleanliness, no <think> leakage, Anthropic first-content-block-is-thinking. 4 tests, 0.43s. - CI (.github/workflows/ci.yml) — new "Run CPU integration tests" step after the venv populate, before the megakernel build. Builds the spike target alongside the existing ones. Enables LFS in checkout so the tokenizer fixture lands. Why this matters: the 1656-assertion test_server_unit suite catches emitter/renderer issues in isolation but cannot fail on a missed http_server.cpp wire (the original bug). This new step exercises the exact wire — render → ParsedRequest → SseEmitter → SSE socket — on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The binary is the permanent stub-driven HTTP server test driver, not throwaway exploration. Rename it to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Include PR Luce-Org#308 after it became non-draft and record the latest containment, conflict probes, retained worktrees, and validation results.

cubic-dev-ai

No issues found across 20 files

_{Re-trigger cubic}

easel and others added 2 commits May 29, 2026 11:25

easel and others added 2 commits May 29, 2026 18:40

rename: spike_no_gpu_http_server → replay_http_server

9d4defe

The binary is the permanent stub-driven HTTP server test driver, not throwaway exploration. Rename it to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

easel marked this pull request as ready for review May 30, 2026 10:33

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026

docs: refresh auto-integration manifest

04e97f3

Include PR Luce-Org#308 after it became non-draft and record the latest containment, conflict probes, retained worktrees, and validation results.

cubic-dev-ai Bot reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel#308

fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel#308
easel wants to merge 4 commits into
Luce-Org:mainfrom
easel:fix/qwen-think-channel

easel commented May 29, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

C++ unit (test_server_unit, runs in CI)

CPU-only HTTP integration (test_stub_integration.py, runs in CI)

Manual smoke (existing test_server_integration.py, run by deploy)

Files

Out of scope

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented May 29, 2026 •

edited

Loading

C++ unit (`test_server_unit`, runs in CI)

CPU-only HTTP integration (`test_stub_integration.py`, runs in CI)

Manual smoke (existing `test_server_integration.py`, run by deploy)