Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Auto scaling factor tuning for FP8 collective communication #140

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

wkcn
Copy link
Contributor

@wkcn wkcn commented Dec 7, 2023

Description
Support for auto scaling factor tuning #41
Related Example: Azure/MS-AMP-Examples#21

Performance (model: GPT-345M, https://github.com/Azure/MS-AMP-Examples/blob/main/gpt3/pretrain_345m_megatron.sh):

  • msamp w/o auto scaling
    validation loss at iteration 5000 | lm loss value: 3.531525E+00 | lm loss PPL: 3.417605E+01 |
    samples per second: 519.524 | TFLOPs: 155.99 |

  • msamp w/ auto scaling (Add the argument --wgrad-auto-scaling):
    validation loss at iteration 5000 | lm loss value: 3.529646E+00 | lm loss PPL: 3.411188E+01 |
    samples per second: 516.702 | TFLOPs: 155.14 |

Major Revision

  • Add a new variable pre_scale in ScalingMeta
  • pre_scale support in Arithmetic.add_to_fp8
  • Auto scaling factor tuning in megatron FP8DistributedOptimizer
  • unittests

@wkcn wkcn marked this pull request as draft December 7, 2023 03:34
@wkcn wkcn marked this pull request as ready for review December 11, 2023 03:10
@wkcn wkcn requested review from tocean and guoshzhao December 12, 2023 06:47
@wkcn wkcn enabled auto-merge (squash) December 12, 2023 06:51
@wkcn wkcn changed the title Auto scaling factor tuning for FP8 collective communication [Feature] Auto scaling factor tuning for FP8 collective communication Dec 14, 2023
@wkcn wkcn mentioned this pull request Dec 14, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant