spec: avoid all-token outputs during MTP prefill#149
Open
claude-eric-steiner wants to merge 2 commits into
Open
spec: avoid all-token outputs during MTP prefill#149claude-eric-steiner wants to merge 2 commits into
claude-eric-steiner wants to merge 2 commits into
Conversation
Split MTP pre-norm hidden-state extraction from normal embedding/logit output mode so prompt batches no longer mark every token as an output row. Keep Qwen35/Qwen35MoE hidden rows available for MTP while slicing before the final output head, and skip draft pre-norm host copies during prompt sync.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Declaration: this patch was fully or predominantly AI-generated
use it or not, as you wish...
This PR addresses this issue: #147
inp_out_idsgraph inputs so MTP/pre-norm graphs that do not consume output IDs do not crash while setting inputsRationale
The MTP prompt-sync path needs the target context's pre-final-norm hidden state for prompt tokens, but it does not need normal logits/embeddings for every prompt token.
Before this change, the server-side MTP hidden-state requirement flowed through
need_embd()and therefore made prompt batches mark every prompt token as an output row. That enabled normal embedding/output mode, reserved all-token output buffers, and pushed Qwen35/Qwen35MoE graphs through final output norm / LM head work that MTP prompt sync does not consume.This keeps the inherent MTP draft-context synchronization, but removes the avoidable all-token output/head overhead during prompt ingestion.
Implementation Notes
server_slot::need_embd()now reflects only request-level embedding output.need_embd_pre_norm().llama_contexttracks pre-norm output rows independently from normal logits/embeddings output rows.inp_out_idsis only populated when the graph actually allocated/uses the tensor.Validation
Built CUDA server image from the local fix candidate:
Build smoke:
Note: the tested Docker image reported
35db4bd07-dirtyduring build/version because the secondout_idsguard was still an uncommitted local patch at image build time. The PR branch now contains that patch as the second clean commit.Limited draft-MTP2 smoke on Qwen3.6-35B-A3B UD-IQ4_NL:
Full speed matrix with no-MTP, draft-MTP2, draft-MTP3, draft-MTP4 all passed at 1k, 8k, 32k, 64k, 128k, and 192k target prompt sizes.
Representative full-matrix results:
Code-output matrix with no-MTP, draft-MTP2, and draft-MTP3 also completed successfully with
--no-cache-prompt, target prompt sizes up to 196k, andmax_tokensup to 16k.git diff --check origin/feature/turboquant-kv-cache...HEADpasses.