Feature/load balance add expert replacement feature for MoE model(mixtral) by uygnef · Pull Request #187 · alibaba/Pai-Megatron-Patch

uygnef · 2024-04-18T09:20:18Z

I have added a new feature to the Megatron LM repository that introduces a load balance interval for expert replacement in Mixture of Experts (MoE) models. This feature allows for the redistribution of experts across GPUs at user-specified intervals, with the aim of achieving a balanced computational load across the GPUs by maintaining a similar number of tokens processed on each card.

Implementation Details
The load balance interval for expert replacement is controlled by a new command-line argument --load-balance-interval. Users can specify the number of steps after which the redistribution of experts should take place. The system then automatically adjusts the placement of experts to ensure an even workload distribution, improving the overall efficiency of the MoE model training.

Benefits
Parallel strategy：tp4pp2ep2, 16 GPUs, train from scratch and without aux loss

strategy	tokens/gpu/sec
without load balance	1020
with load balance	1131（+11%）

How to Use
To enable the load balance interval for expert replacement, users should use the --load-balance-interval argument.

CLAassistant · 2024-04-18T09:20:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ uygnef
❌ fengyu05

fengyu05 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

revert change: router weight use default fp32 weight

fengyu05 and others added 2 commits April 18, 2024 16:56

moe load balance 配适megatron core moe

7a280a7

Merge branch 'alibaba:main' into feature/load_balance

89d92aa

uygnef and others added 3 commits April 18, 2024 19:34

Update router.py

695418e

revert change: router weight use default fp32 weight

Update moe_layer.py

8c8970e

Merge remote-tracking branch 'ali/main' into feature/load_balance

274e850

jerryli1981 force-pushed the main branch from 0f87624 to 2753cac Compare March 11, 2025 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/load balance add expert replacement feature for MoE model(mixtral) #187

Feature/load balance add expert replacement feature for MoE model(mixtral) #187
uygnef wants to merge 5 commits intoalibaba:mainfrom
uygnef:feature/load_balance

uygnef commented Apr 18, 2024 •

edited

Loading

Uh oh!

CLAassistant commented Apr 18, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uygnef commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

uygnef commented Apr 18, 2024 •

edited

Loading

CLAassistant commented Apr 18, 2024 •

edited

Loading