Skip to content

Conversation

@YuhanBai
Copy link

@YuhanBai YuhanBai commented Nov 27, 2025

What this PR does / why we need it?

Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance improvement.
Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator.

Does this PR introduce any user-facing change?

YES, added a new switch, VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL, controls whether the user wants to enable this feature. To enable this feature, we can set export VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL=1.

How was this patch tested?

A2 Qwen2.5-32B/A3 Qwen2.5-32B/A3 Qwen3-32B/A3 Qwen3-30B

In this PR, we test this feature on A2/A3 platform and on three different models: Qwen2.5-32B; Qwen3-30B; Qwen3-32B.

A2 Tests

On A2 platform, we only tested the Qwen2.5-32B:

test1 (tok/s) test2 (tok/s) test3 (tok/s) test4 (tok/s) test5 (tok/s) AVG (tok/s)
Disable overlap 1934.59 2161.24 2155.04 2155.58 2162.43 2,113.7
Enable overlap 1972.98 2205.74 2210.58 2210.72 2209.36 2,161.8

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 2k->2k

On A2 platform, enable asnyc exponential overlap will provide about 2.28% performance improvment.

A3 Tests

On A3 platform, we tested all three models: Qwen2.5-32B; Qwen3-30B; Qwen3-32B
For Qwen2.5-32B and Qwen3-32B, enable the asnyc exponential overlap will give significant performance improvement:

Qwen2.5-32B

test1 (tok/s) test2 (tok/s) test3 (tok/s) test4 (tok/s) test5 (tok/s) AVG (tok/s)
Disable overlap 1418.36 1632.18 1648.44 1670.25 1511.80 1576.21
Enable overlap 1601.71 1675.43 1740.61 1720.02 1696.65 1686.88

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

Qwen3-32B

test1 (tok/s) test2 (tok/s) test3 (tok/s) test4 (tok/s) test5 (tok/s) AVG (tok/s)
Disable overlap 967.51 1060.02 1053.79 1034.72 1043.43 1033.09
Enable overlap 1154.43 1188.64 1175.38 1154.65 1178.81 1170.38

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

On A3 platform, enable asnyc exponential overlap will give Qwen2.5-32B about 7% performance improvement and will give Qwen3-32B about 13.3% performance improvement.

For accuracy-wise, I tested on Math-500 dataset and compare the rating:

A3 qwen3-32B tp4dp1 enforce-eager A3 qwen2.5-32B tp4dp1 enforce-eager
Disable overlap 84.90 45.72
Enable overlap 85.36 45.68

Test were using tempreature=1 seed=1234 output_tokens=10k

Our result shows that enable the async exponential overlap will not introduce huge accuracy drop.

However, on Qwen3-30B, no matter whether we enable_expert_parallel or not, enable asnyc exponential overlap will cause the performance drop:

Qwen3-30B enable_expert_parallel=True

test1 (tok/s) test2 (tok/s) test3 (tok/s) test4 (tok/s) test5 (tok/s) AVG (tok/s)
Disable overlap 855.55 900.15 856.83 875.90 851.41 867.97
Enable overlap 782.77 758.10 814.84 837.34 823.42 803.29

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

Qwen3-30B enable_expert_parallel=False

test1 (tok/s) test2 (tok/s) test3 (tok/s) test4 (tok/s) test5 (tok/s) AVG (tok/s)
Disable overlap 972 1025.78 1131.05 1106.95 1122.46 1071.78
Enable overlap 1035.24 1037.53 1087.57 1087.56 1069.11 1063.40

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

Considering the proformance drop in Qwen3-30B, we set this feature default to OFF, and to enable this feature, we can set export VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL=1.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization to overlap the generation of random numbers for sampling with model execution, which is a good performance enhancement. However, I've identified two critical issues. One will cause a crash on startup if an environment variable is not set, due to an incorrect default value type. The other is a correctness bug in the sampling logic that incorrectly handles greedy decoding requests, causing them to be sampled randomly. I have provided suggestions to fix both issues.

@YuhanBai YuhanBai changed the title [Performance] Add async exponential while module executing [Performance] Add async exponential while model executing Nov 27, 2025
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Nov 27, 2025
Comment on lines +4503 to +4506
if len(generators) != q.shape[0]:
q.exponential_()
if generators:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if both len(generators) != q.shape[0] and generators are True?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both len(generators) != q.shape[0] and generators are True, we just do q.exponential_() first, then overwrite each q[i] with q[i].exponential_(generator=generator).
This part we simply re-use the same logic in vllm's random_sample. Hope this information is helpful!

@YuhanBai YuhanBai force-pushed the add_async_exponential branch from 9b96294 to 78e6c91 Compare December 2, 2025 01:57
@YuhanBai YuhanBai force-pushed the add_async_exponential branch from 78e6c91 to 9fafd98 Compare December 2, 2025 06:34
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants