-
Notifications
You must be signed in to change notification settings - Fork 218
Yeyu/hf eagle medusa #664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yeyu/hf eagle medusa #664
Conversation
|
/ok to test cdea9ed |
|
/ok to test f8a1088 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #664 +/- ##
=======================================
Coverage 74.78% 74.79%
=======================================
Files 192 192
Lines 18814 18810 -4
=======================================
- Hits 14070 14068 -2
+ Misses 4744 4742 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| draft_logits_list = [eagle_logits] | ||
| if self.eagle_config.parallel_draft_step > 1: | ||
| # Get additional draft logits from parallel draft heads | ||
| for draft_head in self.eagle_module.parallel_draft_heads: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can optimize this for-loop and do it parallelly. Can be done in following PRs perhaps.
| ( | ||
| torch.zeros(b, ttt_step, dtype=loss_mask.dtype, device=loss_mask.device), | ||
| loss_mask[:, 1 + ttt_step :], | ||
| for i in range(self.eagle_config.parallel_draft_step): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar as above, this for-loop seems parallelizable
| eagle_input_hidden_states = base_model_hidden_states | ||
|
|
||
| draft_tokens = [] | ||
| for _ in range(steps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic of this steps argument seems ambiguous to me. Shall we rename it to eagle_steps? Then it means we do eagle_steps sequential drafting + num_medusa_heads parallel drafrting.
h-guo18
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments and questions. Other changes in HF LGTM.
I think it would be great to test PTQ and AR regression test before merging. Thanks
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
0e40d90 to
10f46d7
Compare
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
## What does this PR do? new feature, **Overview:** This PR implements HF parallel draft by combining eagle and medusa. In training, multiple medusa heads are added and trained together with eagle. In inference, medusa heads are used to generate draft tokens after all eagle tokens. ## Usage Set parallel_draft_step>1 in eagle_config to enable parallel draft. ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> --------- Signed-off-by: Ye Yu <[email protected]>
## What does this PR do? new feature, **Overview:** This PR implements HF parallel draft by combining eagle and medusa. In training, multiple medusa heads are added and trained together with eagle. In inference, medusa heads are used to generate draft tokens after all eagle tokens. ## Usage Set parallel_draft_step>1 in eagle_config to enable parallel draft. ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> --------- Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
new feature,
Overview:
This PR implements HF parallel draft by combining eagle and medusa. In training, multiple medusa heads are added and trained together with eagle. In inference, medusa heads are used to generate draft tokens after all eagle tokens.
Usage
Set parallel_draft_step>1 in eagle_config to enable parallel draft.
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information