[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027

DaizeDong · 2025-02-28T08:38:19Z

Description

Updated the computation of router_logits in the MoE gate to use FP32 instead of the default BF16 to enhance numerical stability.
Ensured that activation functions (softmax and sigmoid) for router_logits operate in FP32, improving the precision of exponential calculations.
Addressed a related TODO in vllm/_custom_ops.py.

Results

The additional computation cost is negligible as the routing operation is lightweight. However, this adjustment results in a consistent performance improvement, with Winogrande accuracy increasing from 71.27 to 71.43 for the DeepSeek-V2-Lite model.

Before Winogrande Accuracy:

After Winogrande Accuracy:

References

FP32 Computation for `router_logits`

This approach is a new trick in Megatron (Issue 1421) that stabilizes MoE pretraining.
The original implementations of DeepSeek V2 and V3 also adopt this method.

FP32 Activation for `softmax` and `sigmoid`

Similarly, DeepSeek V2 and V3 use FP32 for activation calculations.

Potential Enhancement

Currently, the precision fix for router_logits is implemented by explicitly setting the router weights to FP32. A potentially better approach would be to retain the router weights in FP16 and adjust computation precision dynamically during execution. However, I am not fully familiar with the quant_config in ReplicatedLinear, so I believe someone with more expertise in this area could refine this further.

For now, this implementation effectively ensures stable and correct model behavior.

github-actions · 2025-02-28T08:38:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Daize Dong <[email protected]>

vllm/model_executor/models/deepseek_v2.py

mgoin · 2025-02-28T14:56:10Z

We should run some performance benchmarks to ensure we aren't regressing cc @tlrmchlsmth

Update deepseek_v2.py and fused_moe.py Signed-off-by: Daize Dong <[email protected]>

Signed-off-by: Daize Dong <[email protected]>

DaizeDong · 2025-04-08T07:11:09Z

See if this version can be merged?

hgt312 · 2025-04-12T02:13:37Z

any update on this PR?

DaizeDong · 2025-04-14T09:02:40Z

any update on this PR?

Merged the latest branch and resolved conflicts.

hgt312 · 2025-04-14T23:46:17Z

Oh what I'm asking is will there be reviews to check?

hgt312 · 2025-04-15T01:53:43Z

vllm/model_executor/models/deepseek_v2.py

        self.gate = ReplicatedLinear(config.hidden_size,
                                     config.n_routed_experts,
                                     bias=False,
+                                     params_dtype=torch.float32,


can we make the linear layer store weight in bf16 but do computation in fp32? weight itself is bf16 and do not need extra memory

I think the best way to do this is to pass a quant_config that converts bf16 weights to fp32 on the fly. Do you think it is a good idea?

hgt312 · 2025-04-15T01:57:47Z

and bias should be inited in fp32. in config, "torch_dtype": "bfloat16", default dtype is bf16, so that when not specified, bias will be bf16.

I'm running some test and can post some results later

hgt312 · 2025-04-16T22:46:55Z

vllm v0.8.3

Task	Version	Filter	n-shot	Metric	Value	Stderr
arc_challenge	1	none	0	acc	0.6314	0.0141
arc_challenge		none	0	acc_norm	0.6442	0.014
arc_easy	1	none	0	acc	0.8607	0.0071
arc_easy		none	0	acc_norm	0.8359	0.0076
gsm8k	3	flexible-extract	5	exact_match	0.9515	0.0059
gsm8k		strict-match	5	exact_match	0.9507	0.006
ifeval	4	none	0	inst_level_loose_acc	0.5444	N/A
ifeval		none	0	inst_level_strict_acc	0.4700	N/A
ifeval		none	0	prompt_level_loose_acc	0.4307	0.0213
ifeval		none	0	prompt_level_strict_acc	0.3401	0.0204
winogrande	1	none	0	acc	0.7545	0.0121
mmlu	2	none		acc	0.8546	0.0028
lambada_openai	1	none	0	acc	0.5145	0.007
lambada_openai		none	0	perplexity	8.5704	0.2438
lambada_standard	1	none	0	acc	0.4859	0.007
lambada_standard		none	0	perplexity	9.7722	0.2866

vllm v0.8.3 with fp32 router weight and correction bias

Task	Version	Filter	n-shot	Metric	Value	Stderr
arc_challenge	1	none	0	acc	0.6305	0.0141
arc_challenge		none	0	acc_norm	0.6399	0.014
arc_easy	1	none	0	acc	0.8615	0.0071
arc_easy		none	0	acc_norm	0.8388	0.0075
gsm8k	3	flexible-extract	5	exact_match	0.9591	0.0055
gsm8k		strict-match	5	exact_match	0.9583	0.0055
ifeval	4	none	0	inst_level_loose_acc	0.5204	N/A
ifeval		none	0	inst_level_strict_acc	0.4496	N/A
ifeval		none	0	prompt_level_loose_acc	0.4104	0.0212
ifeval		none	0	prompt_level_strict_acc	0.3327	0.0203
winogrande	1	none	0	acc	0.7577	0.012
mmlu	2	none		acc	0.8544	0.0028
lambada_openai	1	none	0	acc	0.5112	0.007
lambada_openai		none	0	perplexity	8.5911	0.2453
lambada_standard	1	none	0	acc	0.4859	0.007
lambada_standard		none	0	perplexity	9.7067	0.2842

DaizeDong · 2025-04-18T07:46:28Z

Seems most benchmarks stay relatively stable, but ifeval regresses a lot. Is this reasonable?

hgt312 · 2025-04-18T07:52:44Z

no much idea about evaluation results. i dont know if lm-eval is the right to why to evaluate, as offical ifeval score is 80+
see https://huggingface.co/deepseek-ai/DeepSeek-R1

Fix MoE Router Precision to FP32

4e09552

Signed-off-by: Daize Dong <[email protected]>

DaizeDong force-pushed the main branch from 85d4d4f to 4e09552 Compare February 28, 2025 08:55

jikunshang reviewed Feb 28, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Show resolved Hide resolved

DaizeDong requested a review from jikunshang April 8, 2025 06:58

DaizeDong force-pushed the main branch from 311dfd2 to 7ddfa28 Compare April 8, 2025 07:04

DaizeDong and others added 2 commits April 8, 2025 15:08

Merge branch 'main' into main

8708a9f

Update deepseek_v2.py and fused_moe.py Signed-off-by: Daize Dong <[email protected]>

Fixup deepseek_v2.py and fused_moe.py

77a2831

Signed-off-by: Daize Dong <[email protected]>

DaizeDong force-pushed the main branch from 7ddfa28 to 77a2831 Compare April 8, 2025 07:09

hgt312 reviewed Apr 15, 2025

View reviewed changes

DaizeDong closed this May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027

[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027

Uh oh!

DaizeDong commented Feb 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 28, 2025

Uh oh!

Uh oh!

mgoin commented Feb 28, 2025

Uh oh!

DaizeDong commented Apr 8, 2025

Uh oh!

hgt312 commented Apr 12, 2025

Uh oh!

DaizeDong commented Apr 14, 2025

Uh oh!

hgt312 commented Apr 14, 2025

Uh oh!

hgt312 Apr 15, 2025

Uh oh!

DaizeDong Apr 18, 2025

Uh oh!

hgt312 commented Apr 15, 2025

Uh oh!

hgt312 commented Apr 16, 2025

Uh oh!

DaizeDong commented Apr 18, 2025

Uh oh!

hgt312 commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027

[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027

Uh oh!

Conversation

DaizeDong commented Feb 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Results

Before Winogrande Accuracy:

After Winogrande Accuracy:

References

FP32 Computation for router_logits

FP32 Activation for softmax and sigmoid

Potential Enhancement

Uh oh!

github-actions bot commented Feb 28, 2025

Uh oh!

Uh oh!

mgoin commented Feb 28, 2025

Uh oh!

DaizeDong commented Apr 8, 2025

Uh oh!

hgt312 commented Apr 12, 2025

Uh oh!

DaizeDong commented Apr 14, 2025

Uh oh!

hgt312 commented Apr 14, 2025

Uh oh!

hgt312 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

DaizeDong Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

hgt312 commented Apr 15, 2025

Uh oh!

hgt312 commented Apr 16, 2025

Uh oh!

DaizeDong commented Apr 18, 2025

Uh oh!

hgt312 commented Apr 18, 2025

Uh oh!

Uh oh!

DaizeDong commented Feb 28, 2025 •

edited by github-actions bot

Loading

FP32 Computation for `router_logits`

FP32 Activation for `softmax` and `sigmoid`