Skip to content

fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel#308

Open
easel wants to merge 4 commits into
Luce-Org:mainfrom
easel:fix/qwen-think-channel
Open

fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel#308
easel wants to merge 4 commits into
Luce-Org:mainfrom
easel:fix/qwen-think-channel

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented May 29, 2026

Problem

For models whose chat template appends <think> to the prompt suffix when
enable_thinking is honored (Qwen3.6, Laguna), the model generates
reasoning tokens directly into the output stream with no opening tag.
SseEmitter was hardcoded to start in StreamMode::CONTENT and only
transitioned to REASONING when it saw a <think> literal in the
generated stream. Result: reasoning text leaked into content,
reasoning_content stayed empty, and the </think> close tag appeared
verbatim in the user-visible answer.

parse_reasoning() already supported a started_in_thinking=true mode
(see reasoning.h) — the receiving end of the contract existed. The
sending end never threaded the flag through. Both halves were correct
in isolation; no caller connected them.

Fix

Plumb a started_in_thinking bit from the chat-template renderer to the
SSE emitter so the emitter starts in the correct stream mode when the
prompt itself pre-opens the reasoning channel.

  1. render_chat_template and render_chat_template_jinja now return
    PromptRenderResult { text, started_in_thinking }.

    • QWEN3 / LAGUNA builtins set started_in_thinking = true iff
      enable_thinking && add_generation_prompt.
    • GEMMA4 always sets false (its reasoning channel is opened by the
      model emitting <|channel>, which the server already forwards as
      <think>).
    • Jinja path: suffix-sniffs the rendered prompt for a trailing
      <think> and logs a [WARN] when the sniff fires, so
      template/model-card mismatches surface at runtime.
  2. SseEmitter takes an initial_mode constructor parameter (defaulted
    to CONTENT). When REASONING, the Anthropic first
    content_block_start is emitted as thinking so SDK clients don't
    see a spurious empty text block. The existing one-time
    <think>-strip guard is preserved so a redundant model-emitted
    opener (template/card mismatch) is still removed cleanly.

  3. http_server.cpp threads render_result.started_in_thinking through
    ParsedRequest into the emitter's initial_mode. Streaming and
    non-streaming paths share the emitter, so both response shapes are
    fixed by one wire.

Tests

C++ unit (test_server_unit, runs in CI)

1656 assertions, 0 failures. Coverage:

  • Renderer end: QWEN3/LAGUNA/GEMMA4 set started_in_thinking correctly
    across enable_thinking ∈ {on, off} and add_generation_prompt
    {on, off}; Jinja suffix-sniff positive and negative cases.
  • Emitter end: with initial_mode=REASONING, tokens before </think>
    route to reasoning_content; unclosed channel keeps everything in
    reasoning; redundant <think> opener is stripped; Anthropic first
    block is thinking, not text.
  • Wire integration: three tests chain render_chat_template → propagate flag → SseEmitter(initial_mode=…) → emit tokens end-to-end
    for QWEN3-on, LAGUNA-on, and QWEN3-off. Mirrors the production wiring
    at http_server.cpp.

CPU-only HTTP integration (test_stub_integration.py, runs in CI)

Closes the original "CI can't load a real model" gap that allowed this
bug to ship. A new test driver — spike_no_gpu_http_server — links
against the production dflash_common library but instantiates a
deterministic StubModelBackend instead of a CUDA-backed model. The
driver runs with CUDA_VISIBLE_DEVICES=""; the real Qwen3.6 tokenizer
loads from a stripped GGUF fixture (server/test/fixtures/ qwen3.6-tokenizer.gguf, 11MB via LFS, vocab/merges/specials only, no
tensor weights).

StubModelBackend::generate() decodes the request prompt to text via
the real tokenizer, matches it against a JSON scenario file (longest
prompt_suffix wins), then replays the scripted token stream through
the production req.on_token / io.on_token callbacks. Streaming
behavior comes from the production code path — the stub feeds the same
SseEmitter that GPU traffic does.

The new pytest suite (4 tests, 0.43s) covers:

  • Qwen3.6 enable_thinking non-streaming OpenAI: reasoning_content
    populated, content has no <think> leakage.
  • Qwen3.6 enable_thinking non-streaming Anthropic: thinking block
    precedes text block; no raw tags.
  • Qwen3.6 enable_thinking streaming OpenAI: per-token
    reasoning_content deltas before </think>, per-token content
    deltas after.
  • Qwen3.6 enable_thinking streaming Anthropic: first
    content_block_start has type:"thinking" (not text);
    thinking_delta events precede text_delta events.

Scenarios are JSON files in server/test/scenarios/ and trivially
extensible — any deterministic prompt→token-stream pairing can be
captured as a file. Synthesizing edge cases (unclosed </think>,
redundant openers, error responses, finish_reason variants) is just
authoring more files.

Manual smoke (existing test_server_integration.py, run by deploy)

The Python integration test for a live GPU server has had its reasoning
assertions tightened: prior versions used "doesn't crash" assertions
that would have passed on the broken code. Now require non-empty
reasoning_content and no raw <think>/</think> in either channel.
This is the deploy-time check; the CPU integration test above is the
pre-merge check.

Files

server/src/server/chat_template.{h,cpp}        +91/-6
server/src/server/http_server.{h,cpp}          +30/-8
server/src/server/sse_emitter.{h,cpp}          +25/-4
server/test/test_server_unit.cpp              +330/-10
server/scripts/test_server_integration.py      +32/-6

# New CPU integration infrastructure
server/test/stub_model_backend.{h,cpp}        +130 new
server/test/scenario_store.{h,cpp}            +220 new
server/test/spike_no_gpu_http_server.cpp      +110 new
server/test/test_stub_integration.py          +220 new
server/test/scenarios/qwen3_enable_thinking_basic.json  new
server/test/fixtures/qwen3.6-tokenizer.gguf   11MB (LFS) new
server/test/scripts/strip_gguf_to_tokenizer.py +90 new
.github/workflows/ci.yml                       +14
.gitattributes                                 +1
server/CMakeLists.txt                          +25

No model-card JSON or schema changes. No production code outside
server/src/server/.

Out of scope

  • Optional thinking_prompt_pre_opens_think model-card override field.
    The renderer is the source of truth; an override is belt-and-suspenders.
  • A "record mode" that wraps the real GPU backend and captures
    (prompt, emitted tokens, finish) triples into scenario files
    automatically. Hand-authored scenarios are sufficient for the bug
    classes we care about today.
  • Additional scenarios (Laguna, Gemma4, multi-turn, tool calls inline
    with reasoning). The infrastructure supports these — each is a new
    JSON file.

easel and others added 2 commits May 29, 2026 11:25
…ontent channel

The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.

The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.

Fix:

1. chat_template.h: render_chat_template / render_chat_template_jinja now
   return a PromptRenderResult { text, started_in_thinking }. The built-in
   QWEN3 and LAGUNA branches set started_in_thinking deterministically when
   enable_thinking && add_generation_prompt; GEMMA4 stays false (its
   reasoning channel is opened by the model emitting `<|channel>`, which
   http_server forwards into the emitter as `<think>`). The Jinja path
   suffix-sniffs the rendered prompt for a trailing `<think>` opener and
   emits a [WARN] log when sniffing decides true so a template/model-card
   mismatch surfaces at runtime.

2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
   When constructed with REASONING, active_kind_ initializes to "thinking"
   so the Anthropic first content_block is `thinking` instead of `text`
   (avoids a spurious empty text-block stop+restart on the first reasoning
   delta). Deliberately leaves checked_think_prefix_ at its default (false)
   so the existing one-time `<think>` strip guard still trips if a
   template/model-card mismatch causes the model to emit a redundant opener.

3. http_server.cpp: thread render_result.started_in_thinking through
   ParsedRequest into the SseEmitter's initial_mode. Both streaming and
   non-streaming paths feed tokens through the same emitter, so the fix
   covers both response shapes.

Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.

Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three C++ tests that chain render_chat_template + SseEmitter so the
wiring between the renderer's started_in_thinking flag and the emitter's
initial_mode is exercised end-to-end, not just at each end. The per-unit
tests above each verify their half of the contract, but the original bug
was a missing call-site wire — both halves were correct in isolation.

Also tighten the Python integration test assertions for enable_thinking
and reasoning.effort: require non-empty reasoning_content and no raw
<think>/</think> in either channel. The prior 'doesn't crash' assertion
would have passed on the broken code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
…box-docker)

Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into
reasoning_content channel instead of content) into the lucebox-docker stack.
easel and others added 2 commits May 29, 2026 18:40
…tion

Adds a deterministic, scenario-driven integration test that exercises the
full HttpServer request path on a CPU-only CI runner — no GPU, no model
weights, no GGUF download. Designed to catch the regression class from
this PR (renderer→emitter wiring) end-to-end pre-merge.

Components:

- StubModelBackend (server/test/stub_model_backend.*) — a ModelBackend
  whose generate() decodes the prompt to text via the real tokenizer,
  matches it against a JSON scenario (longest prompt_suffix wins), then
  replays scripted tokens through the production req.on_token/io.on_token
  callbacks. Streaming behavior comes from the production code path; the
  stub just feeds it tokens.

- ScenarioStore (server/test/scenario_store.*) — loads server/test/
  scenarios/*.json. Schema:
    { "match": {"prompt_suffix": "..."},
      "response": {"tokens": [...], "finish_reason": "stop"} }
  Tokens are either plain strings (BPE-encoded by the real tokenizer) or
  {text, special:true} objects (looked up via token_to_id, so Qwen3.6's
  single-token </think> arrives as the right ID).

- spike_no_gpu_http_server (server/test/) — driver that wires Tokenizer
  + ScenarioStore + StubModelBackend + HttpServer together. Links
  dflash_common (CUDA TUs included) but never instantiates a real model;
  ggml_cuda_init() is never called, so CUDA_VISIBLE_DEVICES="" is the
  supported configuration.

- Tokenizer fixture (server/test/fixtures/qwen3.6-tokenizer.gguf, 11MB
  via LFS) — full Qwen3.6 vocab/merges/special tokens stripped from the
  27B GGUF. Real BPE round-tripping, deterministic token concordance.
  Build script at server/test/scripts/strip_gguf_to_tokenizer.py.

- test_stub_integration.py — pytest module that spawns the driver and
  exercises OpenAI + Anthropic, streaming + non-streaming. Asserts on
  reasoning_content routing, content channel cleanliness, no <think>
  leakage, Anthropic first-content-block-is-thinking. 4 tests, 0.43s.

- CI (.github/workflows/ci.yml) — new "Run CPU integration tests" step
  after the venv populate, before the megakernel build. Builds the
  spike target alongside the existing ones. Enables LFS in checkout so
  the tokenizer fixture lands.

Why this matters: the 1656-assertion test_server_unit suite catches
emitter/renderer issues in isolation but cannot fail on a missed
http_server.cpp wire (the original bug). This new step exercises the
exact wire — render → ParsedRequest → SseEmitter → SSE socket — on
every PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The binary is the permanent stub-driven HTTP server test driver, not
throwaway exploration. Rename it to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel marked this pull request as ready for review May 30, 2026 10:33
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Include PR Luce-Org#308 after it became non-draft and record the latest containment, conflict probes, retained worktrees, and validation results.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 20 files

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant