about mtp speculative decoding

Hi, thanks for releasing MiMo!
I have a simple question about the architecture:
    **If one MTP layer can already generate a multi-token draft sequence, what are the other MTP layers used for?**
In the paper you mention 3× MTP layers.
But conceptually, a single MTP block already produces multiple future tokens.
So I want to confirm:
    Do the extra MTP layers predict different future positions?
    Or do they contribute to the same draft sequence?
    What benefit do layers 2 and 3 provide compared to using just one MTP layer?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about mtp speculative decoding #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

about mtp speculative decoding #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions