Skip to content

Misc. bug: current draft MTP implementation very slow input tokens digestion #147

@claude-eric-steiner

Description

@claude-eric-steiner

Name and Version

I just ran some speed tests with MTP.
branch feature/turboquant-kv-cache, 2b61ea2

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

Problem description & steps to reproduce

The output speed gain is real (and most excitedly far bigger in percentage gain for big context sizes), model quality tests I've also ran point to no quality degradation.
But for some reasons the input token digestion is severely decreased, which bloats up the wall time.

Just as a reference: I'm on a RTX 3090

Input tokens qwen36_35b_a3b_iq4nl_no_mtp qwen36_35b_a3b_iq4nl_draftmtp2 qwen36_35b_a3b_iq4nl_draftmtp3 qwen36_35b_a3b_iq4nl_draftmtp4
1,024 input 1461.19 tok/s,
output 98.36 tok/s, elapsed 2.0s
input 480.90 tok/s, output 115.37 tok/s, elapsed 3.3s input 471.46 tok/s, output 126.36 tok/s, elapsed 3.2s input 455.82 tok/s, output 143.21 tok/s, elapsed 3.2s
8,192 input 3277.07 tok/s, output 87.38 tok/s, elapsed 4.0s input 1568.59 tok/s, output 110.49 tok/s, elapsed 6.4s input 1626.07 tok/s, output 109.00 tok/s, elapsed 6.2s input 1671.36 tok/s, output 142.32 tok/s, elapsed 5.8s
32,768 input 2867.22 tok/s, output 61.33 tok/s, elapsed 13.6s input 1704.29 tok/s, output 100.82 tok/s, elapsed 20.5s input 1805.61 tok/s, output 119.21 tok/s, elapsed 19.3s input 1805.14 tok/s, output 128.24 tok/s, elapsed 19.2s
65,536 input 2398.14 tok/s, output 44.47 tok/s, elapsed 30.3s input 1463.43 tok/s, output 85.90 tok/s, elapsed 46.4s input 1602.67 tok/s, output 101.13 tok/s, elapsed 42.2s input 1600.79 tok/s, output 107.77 tok/s, elapsed 42.2s
131,072 input 1813.04 tok/s, output 28.48 tok/s, elapsed 77.0s input 1194.38 tok/s, output 70.47 tok/s, elapsed 111.7s input 1294.03 tok/s, output 77.20 tok/s, elapsed 103.1s input 1294.96 tok/s, output 86.81 tok/s, elapsed 102.9s
196,608 input 1451.00 tok/s, output 20.60 tok/s, elapsed 142.0s input 1079.42 tok/s, output 57.42 tok/s, elapsed 184.6s input 1077.91 tok/s, output 64.31 tok/s, elapsed 184.6s input 1087.32 tok/s, output 69.02 tok/s, elapsed 182.9s

The server currently treats MTP's hidden-state requirement as full embedding/output mode. That causes prompt batches to mark every token as an output row and enables normal embeddings on the target context. As a result, prefill pays for extra output buffer work and unnecessary final-head work that MTP does not consume.

There is also inherent MTP work during prefill: the prompt must be mirrored through the MTP draft context so that draft-time attention/cache state is available. That part cannot be removed. The problem is that the current implementation adds avoidable all-token output/logit/head overhead on top of that inherent sync.

The fix is to split hidden-state extraction from normal output extraction:

1. Keep llama_set_embeddings_pre_norm() as the MTP hidden-state tap.
2. Do not make MTP set normal llama_set_embeddings() on the target context.
3. During prompt batching, mark only actual output/logit rows in batch.logits.
4. Allocate and copy pre-norm hidden states by token row, independent of n_outputs.
5. In Qwen35/Qwen35MoE graphs, when pre-norm extraction is enabled, compute the full final hidden state for the prompt, then slice to output rows before final output norm and LM head.
6. In MTP prompt-sync batches with no requested logits, skip the final MTP head and expand only the hidden-state graph needed to update draft-context state.
7. Avoid copying draft-context pre-norm hidden states during prompt sync; the draft context only needs host-visible pre-norm rows during draft generation.

Expected result: MTP prefill should remain slower than no-MTP because it maintains the draft MTP context, but the severe all-token output/head overhead should be removed. Decode speed should remain improved.

This analysis was assisted by GPT 5.5

First Bad Commit

2b61ea2

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions