[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32) #14027
+6
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
router_logits
in the MoE gate to useFP32
instead of the defaultBF16
to enhance numerical stability.softmax
andsigmoid
) forrouter_logits
operate inFP32
, improving the precision of exponential calculations.vllm/_custom_ops.py
.Results
The additional computation cost is negligible as the routing operation is lightweight. However, this adjustment results in a consistent performance improvement, with Winogrande accuracy increasing from 71.27 to 71.43 for the DeepSeek-V2-Lite model.
Before Winogrande Accuracy:
After Winogrande Accuracy:
References
FP32 Computation for
router_logits
FP32 Activation for
softmax
andsigmoid
FP32
for activation calculations.Potential Enhancement
Currently, the precision fix for
router_logits
is implemented by explicitly setting the router weights toFP32
. A potentially better approach would be to retain the router weights inFP16
and adjust computation precision dynamically during execution. However, I am not fully familiar with thequant_config
inReplicatedLinear
, so I believe someone with more expertise in this area could refine this further.For now, this implementation effectively ensures stable and correct model behavior.