Hi, thanks for releasing MiMo!
I have a simple question about the architecture:
If one MTP layer can already generate a multi-token draft sequence, what are the other MTP layers used for?
In the paper you mention 3× MTP layers.
But conceptually, a single MTP block already produces multiple future tokens.
So I want to confirm:
Do the extra MTP layers predict different future positions?
Or do they contribute to the same draft sequence?
What benefit do layers 2 and 3 provide compared to using just one MTP layer?
Thanks!
Hi, thanks for releasing MiMo!
I have a simple question about the architecture:
If one MTP layer can already generate a multi-token draft sequence, what are the other MTP layers used for?
In the paper you mention 3× MTP layers.
But conceptually, a single MTP block already produces multiple future tokens.
So I want to confirm:
Do the extra MTP layers predict different future positions?
Or do they contribute to the same draft sequence?
What benefit do layers 2 and 3 provide compared to using just one MTP layer?
Thanks!