Skip to content

spec: avoid all-token outputs during MTP prefill#149

Open
claude-eric-steiner wants to merge 2 commits into
TheTom:feature/turboquant-kv-cachefrom
claude-eric-steiner:codex/mtp-prefill-hidden-state-fix
Open

spec: avoid all-token outputs during MTP prefill#149
claude-eric-steiner wants to merge 2 commits into
TheTom:feature/turboquant-kv-cachefrom
claude-eric-steiner:codex/mtp-prefill-hidden-state-fix

Conversation

@claude-eric-steiner
Copy link
Copy Markdown

@claude-eric-steiner claude-eric-steiner commented May 20, 2026

Summary

Declaration: this patch was fully or predominantly AI-generated

use it or not, as you wish...

This PR addresses this issue: #147

  • split MTP pre-norm hidden-state extraction from normal embedding/logit output mode
  • stop MTP prompt ingestion from marking every prompt token as a normal output row
  • keep full Qwen35/Qwen35MoE hidden rows available for MTP, then slice before final output norm / LM head
  • skip host-visible draft pre-norm copies while prompt sync only needs the draft context state updated
  • guard unused inp_out_ids graph inputs so MTP/pre-norm graphs that do not consume output IDs do not crash while setting inputs

Rationale

The MTP prompt-sync path needs the target context's pre-final-norm hidden state for prompt tokens, but it does not need normal logits/embeddings for every prompt token.

Before this change, the server-side MTP hidden-state requirement flowed through need_embd() and therefore made prompt batches mark every prompt token as an output row. That enabled normal embedding/output mode, reserved all-token output buffers, and pushed Qwen35/Qwen35MoE graphs through final output norm / LM head work that MTP prompt sync does not consume.

This keeps the inherent MTP draft-context synchronization, but removes the avoidable all-token output/head overhead during prompt ingestion.

Implementation Notes

  • server_slot::need_embd() now reflects only request-level embedding output.
  • MTP hidden-state extraction is exposed separately through need_embd_pre_norm().
  • llama_context tracks pre-norm output rows independently from normal logits/embeddings output rows.
  • Qwen35 and Qwen35MoE keep full pre-norm hidden rows available for MTP, then slice only the actual output rows before final output norm / LM head.
  • MTP prompt sync disables draft-context pre-norm host copies; draft generation re-enables them.
  • inp_out_ids is only populated when the graph actually allocated/uses the tensor.

Validation

Built CUDA server image from the local fix candidate:

agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519

Build smoke:

docker run --rm --entrypoint /app/llama-server agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519 --version
version: 9411 (35db4bd07)
built with GNU 14.2.0 for Linux x86_64

Note: the tested Docker image reported 35db4bd07-dirty during build/version because the second out_ids guard was still an uncommitted local patch at image build time. The PR branch now contains that patch as the second clean commit.

Limited draft-MTP2 smoke on Qwen3.6-35B-A3B UD-IQ4_NL:

1024 tokens: passed, prompt 1275.51 tok/s, output 158.68 tok/s
8192 tokens: passed, prompt 3795.18 tok/s, output 150.60 tok/s

Full speed matrix with no-MTP, draft-MTP2, draft-MTP3, draft-MTP4 all passed at 1k, 8k, 32k, 64k, 128k, and 192k target prompt sizes.

Representative full-matrix results:

1,024 input tokens:
  no-MTP      input 1288.02 tok/s, output 118.10 tok/s, elapsed 1.9s
  draft-MTP2  input 1190.01 tok/s, output 147.63 tok/s, elapsed 1.7s
  draft-MTP3  input 1145.07 tok/s, output 152.45 tok/s, elapsed 1.7s
  draft-MTP4  input 1277.92 tok/s, output 149.66 tok/s, elapsed 1.7s

196,608 input tokens:
  no-MTP      input 1716.97 tok/s, output 25.70 tok/s, elapsed 119.8s
  draft-MTP2  input 1478.25 tok/s, output 64.89 tok/s, elapsed 135.2s
  draft-MTP3  input 1479.37 tok/s, output 69.51 tok/s, elapsed 135.0s
  draft-MTP4  input 1410.82 tok/s, output 72.61 tok/s, elapsed 141.4s

Code-output matrix with no-MTP, draft-MTP2, and draft-MTP3 also completed successfully with --no-cache-prompt, target prompt sizes up to 196k, and max_tokens up to 16k.

git diff --check origin/feature/turboquant-kv-cache...HEAD passes.

Split MTP pre-norm hidden-state extraction from normal embedding/logit output mode so prompt batches no longer mark every token as an output row. Keep Qwen35/Qwen35MoE hidden rows available for MTP while slicing before the final output head, and skip draft pre-norm host copies during prompt sync.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant