[Performance] Add async exponential while model executing #4501

YuhanBai · 2025-11-27T08:22:44Z

What this PR does / why we need it?

Add a control to enable the exponential distribution operator overlapping with model executing (default is OFF due to this feature might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance improvement.
Also, overlapping the exponential operator with module execution can cover the performance drop introduced by AICPU-version's exponential operator.

Does this PR introduce any user-facing change?

YES, added a new switch, VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL, controls whether the user wants to enable this feature. To enable this feature, we can set export VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL=1.

How was this patch tested?

A2 Qwen2.5-32B/A3 Qwen2.5-32B/A3 Qwen3-32B/A3 Qwen3-30B

In this PR, we test this feature on A2/A3 platform and on three different models: Qwen2.5-32B; Qwen3-30B; Qwen3-32B.

A2 Tests

On A2 platform, we only tested the Qwen2.5-32B:

	test1 (tok/s)	test2 (tok/s)	test3 (tok/s)	test4 (tok/s)	test5 (tok/s)	AVG (tok/s)
Disable overlap	1934.59	2161.24	2155.04	2155.58	2162.43	2,113.7
Enable overlap	1972.98	2205.74	2210.58	2210.72	2209.36	2,161.8

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 2k->2k

On A2 platform, enable asnyc exponential overlap will provide about 2.28% performance improvment.

A3 Tests

On A3 platform, we tested all three models: Qwen2.5-32B; Qwen3-30B; Qwen3-32B
For Qwen2.5-32B and Qwen3-32B, enable the asnyc exponential overlap will give significant performance improvement:

Qwen2.5-32B

	test1 (tok/s)	test2 (tok/s)	test3 (tok/s)	test4 (tok/s)	test5 (tok/s)	AVG (tok/s)
Disable overlap	1418.36	1632.18	1648.44	1670.25	1511.80	1576.21
Enable overlap	1601.71	1675.43	1740.61	1720.02	1696.65	1686.88

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

Qwen3-32B

	test1 (tok/s)	test2 (tok/s)	test3 (tok/s)	test4 (tok/s)	test5 (tok/s)	AVG (tok/s)
Disable overlap	967.51	1060.02	1053.79	1034.72	1043.43	1033.09
Enable overlap	1154.43	1188.64	1175.38	1154.65	1178.81	1170.38

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

On A3 platform, enable asnyc exponential overlap will give Qwen2.5-32B about 7% performance improvement and will give Qwen3-32B about 13.3% performance improvement.

For accuracy-wise, I tested on Math-500 dataset and compare the rating:

	A3 qwen3-32B tp4dp1 enforce-eager	A3 qwen2.5-32B tp4dp1 enforce-eager
Disable overlap	84.90	45.72
Enable overlap	85.36	45.68

Test were using tempreature=1 seed=1234 output_tokens=10k

Our result shows that enable the async exponential overlap will not introduce huge accuracy drop.

However, on Qwen3-30B, no matter whether we enable_expert_parallel or not, enable asnyc exponential overlap will cause the performance drop:

Qwen3-30B enable_expert_parallel=True

	test1 (tok/s)	test2 (tok/s)	test3 (tok/s)	test4 (tok/s)	test5 (tok/s)	AVG (tok/s)
Disable overlap	855.55	900.15	856.83	875.90	851.41	867.97
Enable overlap	782.77	758.10	814.84	837.34	823.42	803.29

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

Qwen3-30B enable_expert_parallel=False

	test1 (tok/s)	test2 (tok/s)	test3 (tok/s)	test4 (tok/s)	test5 (tok/s)	AVG (tok/s)
Disable overlap	972	1025.78	1131.05	1106.95	1122.46	1071.78
Enable overlap	1035.24	1037.53	1087.57	1087.56	1069.11	1063.40

Tests were using these cofigs: tp4dp1 enforce-eager num_prompts=48, 1.5k->2k

Considering the proformance drop in Qwen3-30B, we set this feature default to OFF, and to enable this feature, we can set export VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL=1.

vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
vLLM main: vllm-project/vllm@86e178f

github-actions · 2025-11-27T08:22:53Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces an optimization to overlap the generation of random numbers for sampling with model execution, which is a good performance enhancement. However, I've identified two critical issues. One will cause a crash on startup if an environment variable is not set, due to an incorrect default value type. The other is a correctness bug in the sampling logic that incorrectly handles greedy decoding requests, causing them to be sampled randomly. I have provided suggestions to fix both issues.

vllm_ascend/envs.py

vllm_ascend/worker/model_runner_v1.py

jgong5 · 2025-11-27T11:17:24Z

vllm_ascend/worker/model_runner_v1.py

+            if len(generators) != q.shape[0]:
+                q.exponential_()
+            if generators:


what if both len(generators) != q.shape[0] and generators are True?

If both len(generators) != q.shape[0] and generators are True, we just do q.exponential_() first, then overwrite each q[i] with q[i].exponential_(generator=generator).
This part we simply re-use the same logic in vllm's random_sample. Hope this information is helpful!

Signed-off-by: YuhanBai <[email protected]>

github-actions · 2025-12-02T14:15:11Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: YuhanBai <[email protected]>

github-actions bot added the module:core label Nov 27, 2025

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

vllm_ascend/envs.py Outdated Show resolved Hide resolved

vllm_ascend/worker/model_runner_v1.py Show resolved Hide resolved

YuhanBai changed the title ~~[Performance] Add async exponential while module executing~~ [Performance] Add async exponential while model executing Nov 27, 2025

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Nov 27, 2025

jgong5 reviewed Nov 27, 2025

View reviewed changes

YuhanBai force-pushed the add_async_exponential branch from 9b96294 to 78e6c91 Compare December 2, 2025 01:57

YuhanBai added 7 commits December 2, 2025 14:33

[Feature] enable async exponential while doing model execution

711b714

Signed-off-by: YuhanBai <[email protected]>

[BugFix] fix bug in envs.py

021594a

Signed-off-by: YuhanBai <[email protected]>

Formating change

0fcbade

Signed-off-by: YuhanBai <[email protected]>

Formating change

feb5a85

Signed-off-by: YuhanBai <[email protected]>

Formating change

1b1f727

Signed-off-by: YuhanBai <[email protected]>

Formating change

4c63262

Signed-off-by: YuhanBai <[email protected]>

Default VLLM_ASCEND_ENABLE_ASYNC_EXPONENTIAL switch to OFF

9fafd98

Signed-off-by: YuhanBai <[email protected]>

YuhanBai force-pushed the add_async_exponential branch from 78e6c91 to 9fafd98 Compare December 2, 2025 06:34

weijinqian0 approved these changes Dec 2, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Dec 2, 2025

YuhanBai added 2 commits December 3, 2025 14:07

Fix conflic in model_runner_v1.py

e57f18d

Signed-off-by: YuhanBai <[email protected]>

Fix conflic in model_runner_v1.py

f0b3ecb

Signed-off-by: YuhanBai <[email protected]>

github-actions bot removed the merge-conflicts label Dec 3, 2025

Fix conflic in model_runner_v1.py

e148e63

Signed-off-by: YuhanBai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Add async exponential while model executing #4501

[Performance] Add async exponential while model executing #4501

Uh oh!

YuhanBai commented Nov 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

jgong5 Nov 27, 2025

Uh oh!

YuhanBai Nov 28, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Performance] Add async exponential while model executing #4501

Are you sure you want to change the base?

[Performance] Add async exponential while model executing #4501

Uh oh!

Conversation

YuhanBai commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

A2 Tests

A3 Tests

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

jgong5 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

YuhanBai Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YuhanBai commented Nov 27, 2025 •

edited by github-actions bot

Loading