Misc. bug: current draft MTP implementation very slow input tokens digestion

### Name and Version

I just ran some speed tests with MTP.
branch feature/turboquant-kv-cache, https://github.com/TheTom/llama-cpp-turboquant/commit/2b61ea24ef4fef435866301b7c953434e4fcb866



### Operating systems

_No response_

### Which llama.cpp modules do you know to be affected?

_No response_

### Command line

```shell

```

### Problem description & steps to reproduce

The output speed gain is real (and most excitedly far bigger in percentage gain for big context sizes), model quality tests I've also ran point to no quality degradation.
But for some reasons the input token digestion is severely decreased, which bloats up the wall time. 

Just as a reference: I'm on a RTX 3090

Input   tokens | qwen36_35b_a3b_iq4nl_no_mtp | qwen36_35b_a3b_iq4nl_draftmtp2 | qwen36_35b_a3b_iq4nl_draftmtp3 | qwen36_35b_a3b_iq4nl_draftmtp4
-- | -- | -- | -- | --
1,024 | input 1461.19   tok/s, <BR>output 98.36   tok/s, elapsed 2.0s | input 480.90   tok/s, output 115.37   tok/s, elapsed 3.3s | input 471.46   tok/s, output 126.36   tok/s, elapsed 3.2s | input 455.82   tok/s, output 143.21   tok/s, elapsed 3.2s
8,192 | input 3277.07   tok/s, output 87.38   tok/s, elapsed 4.0s | input 1568.59   tok/s, output 110.49   tok/s, elapsed 6.4s | input 1626.07   tok/s, output 109.00   tok/s, elapsed 6.2s | input 1671.36   tok/s, output 142.32   tok/s, elapsed 5.8s
32,768 | input 2867.22   tok/s, output 61.33   tok/s, elapsed 13.6s | input 1704.29   tok/s, output 100.82   tok/s, elapsed 20.5s | input 1805.61   tok/s, output 119.21   tok/s, elapsed 19.3s | input 1805.14   tok/s, output 128.24   tok/s, elapsed 19.2s
65,536 | input 2398.14   tok/s, output 44.47   tok/s, elapsed 30.3s | input 1463.43   tok/s, output 85.90   tok/s, elapsed 46.4s | input 1602.67   tok/s, output 101.13   tok/s, elapsed 42.2s | input 1600.79   tok/s, output 107.77   tok/s, elapsed 42.2s
131,072 | input 1813.04   tok/s, output 28.48   tok/s, elapsed 77.0s | input 1194.38   tok/s, output 70.47   tok/s, elapsed 111.7s | input 1294.03   tok/s, output 77.20   tok/s, elapsed 103.1s | input 1294.96   tok/s, output 86.81   tok/s, elapsed 102.9s
196,608 | input 1451.00   tok/s, output 20.60   tok/s, elapsed 142.0s | input 1079.42   tok/s, output 57.42   tok/s, elapsed 184.6s | input 1077.91   tok/s, output 64.31   tok/s, elapsed 184.6s | input 1087.32   tok/s, output 69.02   tok/s, elapsed 182.9s

The server currently treats MTP's hidden-state requirement as full embedding/output mode. That causes prompt batches to mark every token as an output row and enables normal embeddings on the target context. As a result, prefill pays for extra output buffer work and unnecessary final-head work that MTP does not consume.

There is also inherent MTP work during prefill: the prompt must be mirrored through the MTP draft context so that draft-time attention/cache state is available. That part cannot be removed. The problem is that the current implementation adds avoidable all-token output/logit/head overhead on top of that inherent sync.

The fix is to split hidden-state extraction from normal output extraction:

    1. Keep llama_set_embeddings_pre_norm() as the MTP hidden-state tap.
    2. Do not make MTP set normal llama_set_embeddings() on the target context.
    3. During prompt batching, mark only actual output/logit rows in batch.logits.
    4. Allocate and copy pre-norm hidden states by token row, independent of n_outputs.
    5. In Qwen35/Qwen35MoE graphs, when pre-norm extraction is enabled, compute the full final hidden state for the prompt, then slice to output rows before final output norm and LM head.
    6. In MTP prompt-sync batches with no requested logits, skip the final MTP head and expand only the hidden-state graph needed to update draft-context state.
    7. Avoid copying draft-context pre-norm hidden states during prompt sync; the draft context only needs host-visible pre-norm rows during draft generation.

Expected result: MTP prefill should remain slower than no-MTP because it maintains the draft MTP context, but the severe all-token output/head overhead should be removed. Decode speed should remain improved.

This analysis was assisted by GPT 5.5

### First Bad Commit

https://github.com/TheTom/llama-cpp-turboquant/commit/2b61ea24ef4fef435866301b7c953434e4fcb866

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: current draft MTP implementation very slow input tokens digestion #147

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Input tokens	qwen36_35b_a3b_iq4nl_no_mtp	qwen36_35b_a3b_iq4nl_draftmtp2	qwen36_35b_a3b_iq4nl_draftmtp3	qwen36_35b_a3b_iq4nl_draftmtp4
1,024	input 1461.19 tok/s, output 98.36 tok/s, elapsed 2.0s	input 480.90 tok/s, output 115.37 tok/s, elapsed 3.3s	input 471.46 tok/s, output 126.36 tok/s, elapsed 3.2s	input 455.82 tok/s, output 143.21 tok/s, elapsed 3.2s
8,192	input 3277.07 tok/s, output 87.38 tok/s, elapsed 4.0s	input 1568.59 tok/s, output 110.49 tok/s, elapsed 6.4s	input 1626.07 tok/s, output 109.00 tok/s, elapsed 6.2s	input 1671.36 tok/s, output 142.32 tok/s, elapsed 5.8s
32,768	input 2867.22 tok/s, output 61.33 tok/s, elapsed 13.6s	input 1704.29 tok/s, output 100.82 tok/s, elapsed 20.5s	input 1805.61 tok/s, output 119.21 tok/s, elapsed 19.3s	input 1805.14 tok/s, output 128.24 tok/s, elapsed 19.2s
65,536	input 2398.14 tok/s, output 44.47 tok/s, elapsed 30.3s	input 1463.43 tok/s, output 85.90 tok/s, elapsed 46.4s	input 1602.67 tok/s, output 101.13 tok/s, elapsed 42.2s	input 1600.79 tok/s, output 107.77 tok/s, elapsed 42.2s
131,072	input 1813.04 tok/s, output 28.48 tok/s, elapsed 77.0s	input 1194.38 tok/s, output 70.47 tok/s, elapsed 111.7s	input 1294.03 tok/s, output 77.20 tok/s, elapsed 103.1s	input 1294.96 tok/s, output 86.81 tok/s, elapsed 102.9s
196,608	input 1451.00 tok/s, output 20.60 tok/s, elapsed 142.0s	input 1079.42 tok/s, output 57.42 tok/s, elapsed 184.6s	input 1077.91 tok/s, output 64.31 tok/s, elapsed 184.6s	input 1087.32 tok/s, output 69.02 tok/s, elapsed 182.9s

Uh oh!

Misc. bug: current draft MTP implementation very slow input tokens digestion #147

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions