Yeyu/hf eagle medusa #664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

yeyu-nvidia merged 19 commits into main from yeyu/hf_eagle_medusa

Dec 13, 2025

Contributor

yeyu-nvidia commented Dec 8, 2025

What does this PR do?

new feature,

Overview:
This PR implements HF parallel draft by combining eagle and medusa. In training, multiple medusa heads are added and trained together with eagle. In inference, medusa heads are used to generate draft tokens after all eagle tokens.

Usage

Set parallel_draft_step>1 in eagle_config to enable parallel draft.

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

yeyu-nvidia requested a review from a team as a code owner

December 8, 2025 19:07

yeyu-nvidia requested a review from h-guo18

December 8, 2025 19:07

yeyu-nvidia self-assigned this

Contributor Author

yeyu-nvidia commented Dec 9, 2025

/ok to test cdea9ed

Contributor Author

yeyu-nvidia commented Dec 10, 2025

/ok to test f8a1088

codecov bot commented Dec 10, 2025 •

edited

Loading

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.79%. Comparing base (51c3614) to head (d7139dd).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #664   +/-   ##
=======================================
  Coverage   74.78%   74.79%           
=======================================
  Files         192      192           
  Lines       18814    18810    -4     
=======================================
- Hits        14070    14068    -2     
+ Misses       4744     4742    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

h-guo18 reviewed

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Outdated Show resolved Hide resolved

h-guo18 reviewed

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Outdated Show resolved Hide resolved

h-guo18 reviewed

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py

    
                      draft_logits_list = [eagle_logits]

                      if self.eagle_config.parallel_draft_step > 1:

                          # Get additional draft logits from parallel draft heads

                          for draft_head in self.eagle_module.parallel_draft_heads:

Contributor

h-guo18 Dec 11, 2025

I think we can optimize this for-loop and do it parallelly. Can be done in following PRs perhaps.

h-guo18 reviewed

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py

    
                                  (

                                      torch.zeros(b, ttt_step, dtype=loss_mask.dtype, device=loss_mask.device),

                                      loss_mask[:, 1 + ttt_step :],

                          for i in range(self.eagle_config.parallel_draft_step):

Contributor

h-guo18 Dec 11, 2025 •

edited

Loading

Similar as above, this for-loop seems parallelizable

h-guo18 reviewed

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Outdated

    
                          eagle_input_hidden_states = base_model_hidden_states

                      draft_tokens = []

                      for _ in range(steps):

Contributor

h-guo18 Dec 11, 2025

The semantic of this steps argument seems ambiguous to me. Shall we rename it to eagle_steps? Then it means we do eagle_steps sequential drafting + num_medusa_heads parallel drafrting.

h-guo18 reviewed

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Show resolved Hide resolved

h-guo18 approved these changes

View reviewed changes

Contributor

h-guo18 left a comment

Left some comments and questions. Other changes in HF LGTM.
I think it would be great to test PTQ and AR regression test before merging. Thanks

yeyu-nvidia requested a review from a team as a code owner

December 11, 2025 21:31

yeyu-nvidia requested a review from sugunav14

December 11, 2025 21:31

sugunav14 reviewed

View reviewed changes

examples/speculative_decoding/eagle_config.json Show resolved Hide resolved

sugunav14 approved these changes

View reviewed changes

yeyu-nvidia added 13 commits

December 12, 2025 11:58


          implement eagle+medusa in HF

9a8afef

Signed-off-by: Ye Yu <[email protected]>


          implement eagle+medusa and update EagleTrainingPlot accordingly

732e9e7

Signed-off-by: Ye Yu <[email protected]>


          minor

0f8d8be

Signed-off-by: Ye Yu <[email protected]>


          debug

d3c40a0

Signed-off-by: Ye Yu <[email protected]>


          move hidden_size and vocab from main to model modify

aec7124

Signed-off-by: Ye Yu <[email protected]>


          debug

c01acc1

Signed-off-by: Ye Yu <[email protected]>


          fix a bug in default_config

a9494f0

Signed-off-by: Ye Yu <[email protected]>


          remove tree decoding test code

35d976d

Signed-off-by: Ye Yu <[email protected]>


          clean up config; fix export bug

94702ad

Signed-off-by: Ye Yu <[email protected]>


          take care of HF parallel draft export; rm unused mask tokens

50896bb

Signed-off-by: Ye Yu <[email protected]>


          update export logic to support multilayer eagle

34f9325

Signed-off-by: Ye Yu <[email protected]>


          debug

b83f66b

Signed-off-by: Ye Yu <[email protected]>


          formatting

10f46d7

Signed-off-by: Ye Yu <[email protected]>

yeyu-nvidia force-pushed the yeyu/hf_eagle_medusa branch from 0e40d90 to 10f46d7 Compare

December 12, 2025 19:58


          fix eagle export test

b936d7e

Signed-off-by: Ye Yu <[email protected]>

h-guo18 reviewed

View reviewed changes

modelopt/torch/export/plugins/hf_spec_export.py Outdated Show resolved Hide resolved


          fix typo

07aa861

Signed-off-by: Ye Yu <[email protected]>

yeyu-nvidia enabled auto-merge (squash)

December 12, 2025 21:29

yeyu-nvidia added 4 commits

December 12, 2025 13:59


          typo

b4b9b50

Signed-off-by: Ye Yu <[email protected]>


          switch back to has_lm_head=False by default as it hurts AR in testing

dd20805

Signed-off-by: Ye Yu <[email protected]>


          add parallel configs in export

8c2c6f5

Signed-off-by: Ye Yu <[email protected]>


          skip next_layer_regular in eagle_config as it is hardcoded to true

d7139dd

Signed-off-by: Ye Yu <[email protected]>

yeyu-nvidia merged commit 8c6de51 into main

36 checks passed

yeyu-nvidia deleted the yeyu/hf_eagle_medusa branch

December 13, 2025 02:26

b7r6 pushed a commit to weyl-ai/Model-Optimizer that referenced this pull request


          Yeyu/hf eagle medusa (NVIDIA#664)

4fc3227

## What does this PR do?
new feature,

**Overview:** 
This PR implements HF parallel draft by combining eagle and medusa. In
training, multiple medusa heads are added and trained together with
eagle. In inference, medusa heads are used to generate draft tokens
after all eagle tokens.

## Usage
Set parallel_draft_step>1 in eagle_config to enable parallel draft.

```python
# Add a code snippet demonstrating how to use this
```

## Testing
<!-- Mention how have you tested your change if applicable. -->

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->

---------

Signed-off-by: Ye Yu <[email protected]>

soodoshll pushed a commit to soodoshll/TensorRT-Model-Optimizer that referenced this pull request


          Yeyu/hf eagle medusa (NVIDIA#664)

2fb9cc6

## What does this PR do?
new feature,

**Overview:** 
This PR implements HF parallel draft by combining eagle and medusa. In
training, multiple medusa heads are added and trained together with
eagle. In inference, medusa heads are used to generate draft tokens
after all eagle tokens.

## Usage
Set parallel_draft_step>1 in eagle_config to enable parallel draft.

```python
# Add a code snippet demonstrating how to use this
```

## Testing
<!-- Mention how have you tested your change if applicable. -->

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->

---------

Signed-off-by: Ye Yu <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet