spec: avoid all-token outputs during MTP prefill by claude-eric-steiner · Pull Request #149 · TheTom/llama-cpp-turboquant

claude-eric-steiner · 2026-05-20T06:48:37Z

Summary

Declaration: this patch was fully or predominantly AI-generated

use it or not, as you wish...

This PR addresses this issue: #147

split MTP pre-norm hidden-state extraction from normal embedding/logit output mode
stop MTP prompt ingestion from marking every prompt token as a normal output row
keep full Qwen35/Qwen35MoE hidden rows available for MTP, then slice before final output norm / LM head
skip host-visible draft pre-norm copies while prompt sync only needs the draft context state updated
guard unused inp_out_ids graph inputs so MTP/pre-norm graphs that do not consume output IDs do not crash while setting inputs

Rationale

The MTP prompt-sync path needs the target context's pre-final-norm hidden state for prompt tokens, but it does not need normal logits/embeddings for every prompt token.

Before this change, the server-side MTP hidden-state requirement flowed through need_embd() and therefore made prompt batches mark every prompt token as an output row. That enabled normal embedding/output mode, reserved all-token output buffers, and pushed Qwen35/Qwen35MoE graphs through final output norm / LM head work that MTP prompt sync does not consume.

This keeps the inherent MTP draft-context synchronization, but removes the avoidable all-token output/head overhead during prompt ingestion.

Implementation Notes

server_slot::need_embd() now reflects only request-level embedding output.
MTP hidden-state extraction is exposed separately through need_embd_pre_norm().
llama_context tracks pre-norm output rows independently from normal logits/embeddings output rows.
Qwen35 and Qwen35MoE keep full pre-norm hidden rows available for MTP, then slice only the actual output rows before final output norm / LM head.
MTP prompt sync disables draft-context pre-norm host copies; draft generation re-enables them.
inp_out_ids is only populated when the graph actually allocated/uses the tensor.

Validation

Built CUDA server image from the local fix candidate:

agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519

Build smoke:

docker run --rm --entrypoint /app/llama-server agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519 --version
version: 9411 (35db4bd07)
built with GNU 14.2.0 for Linux x86_64

Note: the tested Docker image reported 35db4bd07-dirty during build/version because the second out_ids guard was still an uncommitted local patch at image build time. The PR branch now contains that patch as the second clean commit.

Limited draft-MTP2 smoke on Qwen3.6-35B-A3B UD-IQ4_NL:

1024 tokens: passed, prompt 1275.51 tok/s, output 158.68 tok/s
8192 tokens: passed, prompt 3795.18 tok/s, output 150.60 tok/s

Full speed matrix with no-MTP, draft-MTP2, draft-MTP3, draft-MTP4 all passed at 1k, 8k, 32k, 64k, 128k, and 192k target prompt sizes.

Representative full-matrix results:

1,024 input tokens:
  no-MTP      input 1288.02 tok/s, output 118.10 tok/s, elapsed 1.9s
  draft-MTP2  input 1190.01 tok/s, output 147.63 tok/s, elapsed 1.7s
  draft-MTP3  input 1145.07 tok/s, output 152.45 tok/s, elapsed 1.7s
  draft-MTP4  input 1277.92 tok/s, output 149.66 tok/s, elapsed 1.7s

196,608 input tokens:
  no-MTP      input 1716.97 tok/s, output 25.70 tok/s, elapsed 119.8s
  draft-MTP2  input 1478.25 tok/s, output 64.89 tok/s, elapsed 135.2s
  draft-MTP3  input 1479.37 tok/s, output 69.51 tok/s, elapsed 135.0s
  draft-MTP4  input 1410.82 tok/s, output 72.61 tok/s, elapsed 141.4s

Code-output matrix with no-MTP, draft-MTP2, and draft-MTP3 also completed successfully with --no-cache-prompt, target prompt sizes up to 196k, and max_tokens up to 16k.

git diff --check origin/feature/turboquant-kv-cache...HEAD passes.

Split MTP pre-norm hidden-state extraction from normal embedding/logit output mode so prompt batches no longer mark every token as an output row. Keep Qwen35/Qwen35MoE hidden rows available for MTP while slicing before the final output head, and skip draft pre-norm host copies during prompt sync.

claude-eric-steiner added 2 commits May 20, 2026 08:29

fix: guard unused MTP out_ids inputs

e7a7b93

github-actions Bot added examples server model labels May 20, 2026

claude-eric-steiner mentioned this pull request May 20, 2026

Misc. bug: current draft MTP implementation very slow input tokens digestion #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

spec: avoid all-token outputs during MTP prefill#149

spec: avoid all-token outputs during MTP prefill#149
claude-eric-steiner wants to merge 2 commits into
TheTom:feature/turboquant-kv-cachefrom
claude-eric-steiner:codex/mtp-prefill-hidden-state-fix

claude-eric-steiner commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

claude-eric-steiner commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale

Implementation Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude-eric-steiner commented May 20, 2026 •

edited

Loading