-
Notifications
You must be signed in to change notification settings - Fork 624
[Performance] Add async exponential while model executing #4501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an optimization to overlap the generation of random numbers for sampling with model execution, which is a good performance enhancement. However, I've identified two critical issues. One will cause a crash on startup if an environment variable is not set, due to an incorrect default value type. The other is a correctness bug in the sampling logic that incorrectly handles greedy decoding requests, causing them to be sampled randomly. I have provided suggestions to fix both issues.
| if len(generators) != q.shape[0]: | ||
| q.exponential_() | ||
| if generators: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if both len(generators) != q.shape[0] and generators are True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both len(generators) != q.shape[0] and generators are True, we just do q.exponential_() first, then overwrite each q[i] with q[i].exponential_(generator=generator).
This part we simply re-use the same logic in vllm's random_sample. Hope this information is helpful!
9b96294 to
78e6c91
Compare
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
78e6c91 to
9fafd98
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
Signed-off-by: YuhanBai <[email protected]>
What this PR does / why we need it?
Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance improvement.
Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator.
Does this PR introduce any user-facing change?
YES, added a new switch, VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL, controls whether the user wants to enable this feature. To enable this feature, we can set
export VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL=1.How was this patch tested?
A2 Qwen2.5-32B/A3 Qwen2.5-32B/A3 Qwen3-32B/A3 Qwen3-30B
In this PR, we test this feature on A2/A3 platform and on three different models:
Qwen2.5-32B;Qwen3-30B;Qwen3-32B.A2 Tests
On A2 platform, we only tested the
Qwen2.5-32B:On A2 platform, enable asnyc exponential overlap will provide about 2.28% performance improvment.
A3 Tests
On A3 platform, we tested all three models:
Qwen2.5-32B;Qwen3-30B;Qwen3-32BFor
Qwen2.5-32BandQwen3-32B, enable the asnyc exponential overlap will give significant performance improvement:Qwen2.5-32B
Qwen3-32B
On A3 platform, enable asnyc exponential overlap will give
Qwen2.5-32Babout 7% performance improvement and will giveQwen3-32Babout 13.3% performance improvement.For accuracy-wise, I tested on Math-500 dataset and compare the rating:
Our result shows that enable the async exponential overlap will not introduce huge accuracy drop.
However, on Qwen3-30B, no matter whether we
enable_expert_parallelor not, enable asnyc exponential overlap will cause the performance drop:Qwen3-30B enable_expert_parallel=True
Qwen3-30B enable_expert_parallel=False
Considering the proformance drop in
Qwen3-30B, we set this feature default to OFF, and to enable this feature, we can setexport VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL=1.