Multi-token prediction #179

RaymondLi0 · 2025-03-07T21:15:42Z

✨ Description

Add a minimally working version of Multi-token prediction:

a new num_multi_token_prediction_heads argument. None by default. If true, will add new layers for multi-token prediction, that replace the standard lm_head layer. For each token to predict, we add a MultiTokenPredictionTransformerLayer and a MultiTokenPredictionLanguageModelHead
All the output weights are shared. Like before, they can be shared with the input embeddings with tie_word_embeddings
Each multi-token prediction loss is tracked separately, and contributes equally to the final loss being optimized.
At export the additional lm-heads are discarded, resulting in a transformer model with num_layers + 1 layers.

In a future PR:

Add more labels in data-sampling to avoid input-truncation
add weights for each future-token prediction loss (e.g. the next token could be more important than the future ones)
handle sequence-parallelism and cross-entropy splits
Convert the additional lm-heads (requires a corresponding hf-transformer implementation)
Improve throughput

Closes #167

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Sanity check

A baseline transformer model reaches the same loss as a model with num_multi_token_prediction_heads=1 and one less layer in the shared trunk. Memory usage is very similar.

Performance

The additional layers seem to negatively impact training throughput

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier

Looks good, but I have some suggestions to make it behave more like a standard language model with added prediction heads rather than an entirely new thing.

fast_llm/layers/language_model/config.py

jlamypoirier · 2025-03-24T23:21:25Z

fast_llm/layers/language_model/config.py

@@ -22,6 +22,10 @@ class LanguageModelLossNames:
    language_model_loss = "language_model_loss"
    z_loss = "z_loss"

+    @classmethod


Is this really needed?

Would make more sense as a @staticmethod?

@classmethod is fine (and probably better), I meant having the method at all seems unnecessary...

It's for the same purpose that we have language_model_loss = "language_model_loss" above, no?

fast_llm/engine/checkpoint/external.py

fast_llm/layers/language_model/head.py

fast_llm/layers/language_model/multi_token_prediction_head.py

jlamypoirier · 2025-03-25T00:00:26Z

fast_llm/layers/language_model/multi_token_prediction_head.py

+            losses[self.loss_name].append(language_model_loss)
+        if self.is_last_head:
+            # Last layer should return the loss for backward.
+            return language_model_loss


This returns the loss for the most distant predicted token, which doesn't make much sense. How about running the predictions in reversed order so the last one is the next token (index=0)? This would make the return value here more relevant and merge the is_last_head and multi_token_prediction_index > 0 conditions.

we want this ^^^ because this way we ensure that memory impact is minimal.

We do want to compute the heads one at the time, but the order in which we do it doesn't matter.

For the conditions: the output-weights should be defined in the first layer (according to

Fast-LLM/fast_llm/engine/base_model/base_model.py

Line 129 in f7c9e65

# The weight should be defined in the first layer in the set.

), right? And the last layer should return its loss. So even if we do the reverse order I don't think we can merge the is_last_head and multi_token_prediction_index > 0 conditions.

I also thought it would make it convenient if later we want to support a sequential version of this, where each head would use the output of the previous one as input (this is what deepseek-v3 does for example)

I guess you're right. I'll let you decide on the best option here.

jlamypoirier · 2025-03-25T00:04:11Z

fast_llm/layers/language_model/multi_token_prediction_head.py

+        return torch.stack((input_, output), dim=0)
+
+
+class MultiTokenPredictionLanguageModelHead(LanguageModelHead):


Is this really worth a separate class? Seems like we could just add a multi_token_prediction_index/_prediction_distance option to the base class, it would only lead to tiny changes.

Hmm, we could yes, but there are quite a bit of changes, especially in _foward_backward, with the shifting/truncating of labels and inputs, and the potential handling of sequence-parallel and cross-entropy-splits (not supported now, but this would need different code).
Do you still think we should put this in the base class LanguageModelHead ?

Oh I guess the logic of the base class would just be a special-case where multi_token_prediction_index=0
Then indeed I could just add this code to the base class

I could also remove the MultiTokenPredictionTransformerLayer class by adding a stacked_output parameter to the TransformerLayer class. wdyt @jlamypoirier @tscholak ?

I removed both MultiTokenPredictionTransformerLayer and MultiTokenPredictionLanguageModelHead and moved the code to the base classes. Things should be a bit simpler now.

fast_llm/models/gpt/model.py

fast_llm/layers/language_model/config.py

tscholak

Looks good, thanks @RaymondLi0!
I wonder though if it wouldn't be better to avoid adding the additional transformer layer for the first lm head. If that was the case, then the mtp feature could always be on (with default 1 head) and would do the same thing as before. Only for more than 1 head we would see additional transformer layers being added.

tscholak · 2025-03-25T01:31:27Z

fast_llm/data/dataset/gpt/sampled.py

@@ -128,6 +128,7 @@ def _sample(self) -> None:
        # Calculate basic stats.
        documents_per_epoch = document_sizes.numel()
        tokens_per_epoch = document_sizes.sum().item()
+        # TODO MTP: Produce more labels to provide labels for the multi-token prediction heads?


I'm not following, could you explain what you mean here?
Surely you're shifting the sequence for the additional heads, are you not?

Yes. I shift the labels and truncate the inputs for the additional heads.
We could produce more labels than inputs to avoid truncating the input. Example with sequence-length=4, and mtp=2

t1 t2 t3 t4 t5 t6 <- document i1 i2 i3 i4 -- -- <- inputs -- l1 l2 l3 l4 -- <- labels

Currently, the labels stop at l4, so the second head only processes [i1, i2, i3]. By adding l5, we can have a label for all the input tokens.

Should we implement here? Should be quite easy, just need to replace a bunch of sequence_length+1 with sequence_length+prediction_heads

Let's time-box this. I do not want us to spend more than a couple of hours on adding 4 tokens to the sequence.

Ok I can have a quick look at this this afternoon, and let's try to merge this PR afterwards

Will merge now, and add this to a new PR, since I'd need to test it a little bit more

tscholak · 2025-03-25T01:39:32Z

fast_llm/layers/language_model/multi_token_prediction_head.py

+            losses[self.loss_name].append(language_model_loss)
+        if self.is_last_head:
+            # Last layer should return the loss for backward.
+            return language_model_loss


we want this ^^^ because this way we ensure that memory impact is minimal.

fast_llm/engine/checkpoint/state_dict.py

jlamypoirier · 2025-03-28T21:41:22Z

fast_llm/layers/language_model/config.py

@@ -22,6 +22,10 @@ class LanguageModelLossNames:
    language_model_loss = "language_model_loss"
    z_loss = "z_loss"

+    @classmethod


@classmethod is fine (and probably better), I meant having the method at all seems unnecessary...

fast_llm/layers/language_model/head.py

fast_llm/models/gpt/conversion.py

jlamypoirier · 2025-03-28T22:01:37Z

fast_llm/models/gpt/model.py

+            layer
+            for i in range(self._config.prediction_heads)
+            for layer in [
+                TransformerLayer(


Wrong when num_layers=0. (Mostly for debug but we do want to support it.)

the whole MTP thing doesn't make sense for num_layers=0. we would have 0 transformer layers for the first head and each 1 for every additional head.

fast_llm/layers/transformer/transformer.py

jlamypoirier · 2025-03-28T22:10:29Z

fast_llm/data/dataset/gpt/sampled.py

@@ -128,6 +128,7 @@ def _sample(self) -> None:
        # Calculate basic stats.
        documents_per_epoch = document_sizes.numel()
        tokens_per_epoch = document_sizes.sum().item()
+        # TODO MTP: Produce more labels to provide labels for the multi-token prediction heads?


Should we implement here? Should be quite easy, just need to replace a bunch of sequence_length+1 with sequence_length+prediction_heads

jlamypoirier · 2025-03-28T22:13:38Z

fast_llm/layers/transformer/transformer.py

@@ -123,4 +127,6 @@ def forward(
            hidden_states = self._bias_dropout_add(hidden_states, bias, input_)
        if self._debug_mode:
            self._debug_log(None, "MLP residual", kwargs, bias=bias)
+        if self._stacked_output:


This doesn't match the meta output which will break pipeline parallelism (same for LM head). We need to either update get_meta (really easy) or explicitly prevent pipeline parallelism.

tscholak

Thanks @RaymondLi0, this looks great.
I see some logic in trying to avoid truncation of the inputs when additional heads are present, but I would time box this at this point. I suggest 2 hours max. I'd like us to quickly move on to running actual experiments that validate the value proposition of MTP before over-investing in implementation details. Thanks.

fast_llm/engine/checkpoint/state_dict.py

tscholak · 2025-03-30T17:49:47Z

fast_llm/data/dataset/gpt/sampled.py

@@ -128,6 +128,7 @@ def _sample(self) -> None:
        # Calculate basic stats.
        documents_per_epoch = document_sizes.numel()
        tokens_per_epoch = document_sizes.sum().item()
+        # TODO MTP: Produce more labels to provide labels for the multi-token prediction heads?


Let's time-box this. I do not want us to spend more than a couple of hours on adding 4 tokens to the sequence.

tscholak · 2025-03-30T17:55:50Z

fast_llm/layers/language_model/head.py

@@ -151,6 +160,7 @@ def _logits_cross_entropy_forward_backward_split(
                return None, None
        else:
            loss = None
+            # TODO MTP: allow a _cross_entropy_splits that is not a divisor of the sequence length


so this ends up being an argument for not truncating the sequence, right?
i.e. resolving https://github.com/ServiceNow/Fast-LLM/pull/179/files#r2012178184

fast_llm/models/gpt/conversion.py

tscholak · 2025-03-30T18:02:43Z

fast_llm/models/gpt/model.py

+            layer
+            for i in range(self._config.prediction_heads)
+            for layer in [
+                TransformerLayer(


the whole MTP thing doesn't make sense for num_layers=0. we would have 0 transformer layers for the first head and each 1 for every additional head.

jlamypoirier

Minor suggestion, otherwise LGTM

jlamypoirier · 2025-04-01T01:10:14Z

fast_llm/models/gpt/model.py

    def get_layers(self) -> list[Layer]:
+        if self._config.transformer.num_layers == 0:
+            Assert.eq(self._config.prediction_heads, 1)


This could be checked in config validation

for a debug case like num_layers=0 that has no practical applications?
I think not.

RaymondLi0 · 2025-04-01T13:48:53Z

Thanks for the reviews @jlamypoirier @tscholak ! I'll move the additional labels to a new PR to respect the time-boxing.

RaymondLi0 added 9 commits March 7, 2025 21:12

add mtp heads

a42cb0a

remove unused import

96f3607

fixes

54a52ca

conversion: throw away additional heads

9a2188f

fix grad?

3988fa3

add bw hook

a099ef6

return loss

56e1ca4

remove comments

e6b5e27

fix conversion

fb37e29

RaymondLi0 marked this pull request as ready for review March 24, 2025 20:39

Merge branch 'main' into raymond/mtp

fb5ef34

RaymondLi0 requested review from jlamypoirier and tscholak March 24, 2025 20:54

jlamypoirier reviewed Mar 25, 2025

View reviewed changes

tscholak reviewed Mar 25, 2025

View reviewed changes

RaymondLi0 added 10 commits March 25, 2025 16:56

rename to _loss_name, adjust assert

525c701

rename get_language_model_layers

78e3c3e

refactor: move mtp logic to base language_model class

a68a8d1

rename to prediction_heads

71968a6

rename

feaf9a6

adjust tied weights

e22cb83

rename and add comments

7185072

fix conversion

afacb9b

Merge branch 'main' into raymond/mtp

2edc4c2

format

f5703c1

RaymondLi0 requested review from tscholak and jlamypoirier March 28, 2025 21:13

jlamypoirier reviewed Mar 28, 2025

View reviewed changes

tscholak approved these changes Mar 30, 2025

View reviewed changes

fix meta dims, rename

1f0e134

RaymondLi0 changed the title ~~WIP: Multi-token prediction~~ Multi-token prediction Mar 31, 2025

RaymondLi0 added 2 commits March 31, 2025 18:12

add warning, rename loss

c1b1450

support 0 layers

c85be14

jlamypoirier approved these changes Mar 31, 2025

View reviewed changes

jlamypoirier reviewed Apr 1, 2025

View reviewed changes

RaymondLi0 merged commit 9036fd2 into main Apr 1, 2025
4 checks passed

RaymondLi0 deleted the raymond/mtp branch April 1, 2025 13:49

RaymondLi0 mentioned this pull request Apr 1, 2025

improvements to MTP implementation #218

Merged

13 tasks

		return torch.stack((input_, output), dim=0)


		class MultiTokenPredictionLanguageModelHead(LanguageModelHead):

Multi-token prediction #179

Multi-token prediction #179

Uh oh!

Conversation

RaymondLi0 commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Sanity check

Performance

🗒️ Additional Notes

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

RaymondLi0 commented Mar 7, 2025 •

edited

Loading