-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Daize Dong <[email protected]>
We should run some performance benchmarks to ensure we aren't regressing cc @tlrmchlsmth |
Update deepseek_v2.py and fused_moe.py Signed-off-by: Daize Dong <[email protected]>
Signed-off-by: Daize Dong <[email protected]>
See if this version can be merged? |
any update on this PR? |
Merged the latest branch and resolved conflicts. |
Oh what I'm asking is will there be reviews to check? |
self.gate = ReplicatedLinear(config.hidden_size, | ||
config.n_routed_experts, | ||
bias=False, | ||
params_dtype=torch.float32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the linear layer store weight in bf16 but do computation in fp32? weight itself is bf16 and do not need extra memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way to do this is to pass a quant_config
that converts bf16 weights to fp32 on the fly. Do you think it is a good idea?
and bias should be inited in fp32. in config, I'm running some test and can post some results later |
vllm v0.8.3
vllm v0.8.3 with fp32 router weight and correction bias
|
Seems most benchmarks stay relatively stable, but |
no much idea about evaluation results. i dont know if lm-eval is the right to why to evaluate, as offical ifeval score is 80+ |
Description
router_logits
in the MoE gate to useFP32
instead of the defaultBF16
to enhance numerical stability.softmax
andsigmoid
) forrouter_logits
operate inFP32
, improving the precision of exponential calculations.vllm/_custom_ops.py
.Results
The additional computation cost is negligible as the routing operation is lightweight. However, this adjustment results in a consistent performance improvement, with Winogrande accuracy increasing from 71.27 to 71.43 for the DeepSeek-V2-Lite model.
Before Winogrande Accuracy:
After Winogrande Accuracy:
References
FP32 Computation for
router_logits
FP32 Activation for
softmax
andsigmoid
FP32
for activation calculations.Potential Enhancement
Currently, the precision fix for
router_logits
is implemented by explicitly setting the router weights toFP32
. A potentially better approach would be to retain the router weights inFP16
and adjust computation precision dynamically during execution. However, I am not fully familiar with thequant_config
inReplicatedLinear
, so I believe someone with more expertise in this area could refine this further.For now, this implementation effectively ensures stable and correct model behavior.