add deepseek-r1-w8a8 tutorial. #4504

Gongdayao · 2025-11-27T09:41:04Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

gemini-code-assist

Code Review

This pull request adds a new tutorial for running the DeepSeek-R1-w8a8 model. The tutorial is comprehensive, covering setup, deployment, and evaluation. My review focuses on ensuring the correctness of the provided shell commands, as users will likely copy and paste them. I've identified several issues, including typos in environment variables, incorrect command-line arguments, and inconsistencies in port numbers, which would prevent the commands from running successfully. I've provided specific suggestions to correct these errors.

gemini-code-assist · 2025-11-27T09:42:52Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+export HCCL_BUFFSIZE=200
+export VLLM_ASCEND_ENABLE_MLAPO=1
+export VLLM_RPC_TIMEOUT=3600000
+export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000


There appears to be a typo in the environment variable name. VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS should likely be VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS. An incorrect variable name will prevent this setting from being applied, which could lead to unexpected timeouts.

Suggested change

export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000

export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600000

gemini-code-assist · 2025-11-27T09:42:52Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+  --no-enable-prefix-caching \
+  --gpu-memory-utilization 0.92 \
+  --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
+  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \


There is a trailing colon in the value for the --compilation-config argument. This will cause a JSON parsing error when the command is executed. Please remove the colon.

Suggested change

--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \

--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \

gemini-code-assist · 2025-11-27T09:42:52Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+  --host 0.0.0.0 \
+  --port 8004 \
+  --data-parallel-size 4 \
+  --tensor-parallel-size 2 \


The --tensor-parallel-size argument is specified twice with conflicting values (2 on this line, and 4 on line 179). This is likely an error. Based on other configurations in this document, the correct value is 4. Please remove this redundant line to avoid confusion and ensure the correct configuration is used.

gemini-code-assist · 2025-11-27T09:42:52Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+  --no-enable-prefix-caching \
+  --gpu-memory-utilization 0.92 \
+  --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
+  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \


There is a trailing colon in the value for the --compilation-config argument. This will cause a JSON parsing error when the command is executed. Please remove the colon.

Suggested change

--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \

--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \

gemini-code-assist · 2025-11-27T09:42:52Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+```shell
+lm_eval \
+  --model local-completions \
+  --model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \


The base_url in the lm_eval command uses port 8000, which is inconsistent with the ports 8011 (for A3 series) and 8004 (for A2 series) defined in the deployment scripts above. This will cause connection errors. Please use a placeholder like <port> to remind the user to fill in the correct port, similar to the curl example.

Suggested change

--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \

--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \

github-actions · 2025-11-27T09:45:16Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

menogrey · 2025-11-29T06:45:23Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+2. Install the package `custom-ops` to make the kernels available.
+
+```shell
+wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/CANN-custom_ops-sfa-linux.aarch64.run


DeepSeek-R1 don't need this, it's for DeepSeek-V3.2 specially.

menogrey · 2025-11-29T06:51:26Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

@@ -0,0 +1,327 @@
+# DeepSeek-R1-w8a8


The title should be DeepSeek-R1, and the content should not only contain DeepSeek-R1-W8A8, add DeepSeek-R1 will be better.

menogrey · 2025-11-29T06:52:09Z

docs/source/tutorials/DeepSeek-R1-w8a8.md

+2. Install the package `custom-ops` to make the kernels available.
+
+```shell
+wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/CANN-custom_ops-sfa-linux.aarch64.run


DeepSeek-R1 don't need this, it's for DeepSeek-V3.2 specially.

menogrey · 2025-11-29T06:58:51Z

docs/source/tutorials/DeepSeek-R1.md

+  --gpu-memory-utilization 0.92 \
+  --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
+  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
+  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'


ascend schedular is ready to be dropped in main. Refer to this #4498, And there is no need to add the additional-config if you set "enabled":false.

menogrey · 2025-11-29T06:59:11Z

docs/source/tutorials/DeepSeek-R1.md

+  --gpu-memory-utilization 0.92 \
+  --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
+  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
+  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'


Same as above.

menogrey · 2025-11-29T07:00:32Z

docs/source/tutorials/DeepSeek-R1.md

+  --gpu-memory-utilization 0.94 \
+  --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
+  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
+  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'


Same as above.

### What this PR does / why we need it? Adapted Qwen3-Next eager mode to `v0.11.2`. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: drslark <[email protected]> Signed-off-by: Gongdayao <[email protected]>

…sible device count error (#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue #4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Gongdayao <[email protected]>

This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer to simplify the invocation of `aclnn` operators on Ascend NPUs. **Key Changes:** * **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies to standardize `aclnn` calls. * **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a reference implementation to demonstrate the usage of the new macro. --- - vLLM version: v0.11.2 --------- Signed-off-by: SlightwindSec <[email protected]> Signed-off-by: Gongdayao <[email protected]>

### What this PR does / why we need it? Add eagle proposer ut - vLLM version: v0.11.2 Signed-off-by: GDzhu01 <[email protected]> Signed-off-by: Gongdayao <[email protected]>

### What this PR does / why we need it? Upgrade cann to 8.3rc2 ### Does this PR introduce _any_ user-facing change? Yes, docker image will use 8.3.RC2 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: MrZ20 <[email protected]> Signed-off-by: Gongdayao <[email protected]>

…c weight (#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 [email protected] Co-authored-by: chenxi-hh [email protected] - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <[email protected]> Signed-off-by: chenxi-hh <[email protected]> Signed-off-by: chenxi-hh <[email protected]> Co-authored-by: chenxi-hh <[email protected]> Co-authored-by: chenxi-hh <[email protected]> Signed-off-by: Gongdayao <[email protected]>

…VisionAttention (#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Gongdayao <[email protected]>

### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <[email protected]> Signed-off-by: liziyu <[email protected]> Co-authored-by: liziyu <[email protected]> Signed-off-by: Gongdayao <[email protected]>

### What this PR does / why we need it? Delete equals sign in doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: herizhen <[email protected]> Co-authored-by: herizhen <[email protected]> Signed-off-by: Gongdayao <[email protected]>

### What this PR does / why we need it? This PR introduces support for adding custom CANN `aclnn` ops to `vllm-ascend`, allowing users to define and use their own custom operators. Key changes include: - Building and installing custom ops into the `vllm-ascend`-specified directory - Binding the `aclnn` op interface to the `torch.ops._C_ascend` module - Enabling invocation of these ops within `vllm-ascend` This PR includes a sample custom op: `aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from the CANN operator [`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md). Its input parameters `weight` and `weight_scale` now accept `list[torch.Tensor]` (i.e., `at::TensorList`). ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.11.2 --------- Signed-off-by: QianChenxi <[email protected]> Signed-off-by: Gongdayao <[email protected]>

…4438) ### What this PR does / why we need it? 1.In short, we renamed the existing MooncakeStoreConnector to AscendStoreConnector and extracted the storage engine interaction logic into a new Backend class. Associated RFC：#4329 2.Fixed the issue where the number of input parameters for the connector was incorrect, introduced in vllm 0.11.2 ### Does this PR introduce _any_ user-facing change? change MooncakeStoreConnector to AscendStoreConnector ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: fems14 <[email protected]> Signed-off-by: Gongdayao <[email protected]>

@OsirisDuan

### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <[email protected]> Signed-off-by: Gongdayao <[email protected]>

Triton package url is not correct. This PR fix it Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Gongdayao <[email protected]>

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype. ### How was this patch tested? Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Ting FU <[email protected]> Signed-off-by: Gongdayao <[email protected]>

…e in index.md Signed-off-by: Gongdayao <[email protected]>

Signed-off-by: Gongdayao <[email protected]>

github-actions · 2025-11-29T07:41:33Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

add deepseek-r1-w8a8 tutorial.

508a101

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Nov 27, 2025

update deepseek-r1-w8a8 tutorial.

9b8c81a

menogrey reviewed Nov 29, 2025

View reviewed changes

Gongdayao force-pushed the main branch from 6b6cad0 to f03f8c3 Compare November 29, 2025 07:37

drslark and others added 17 commits November 29, 2025 15:37

[TEST] Add eagle proposer ut (#4447)

4a9763d

### What this PR does / why we need it? Add eagle proposer ut - vLLM version: v0.11.2 Signed-off-by: GDzhu01 <[email protected]> Signed-off-by: Gongdayao <[email protected]>

update triton package url (#4552)

c91235d

Triton package url is not correct. This PR fix it Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Gongdayao <[email protected]>

update gpqadataset accuracy in Deepseek-R1-w8a8 tutorial and file nam…

247e1b9

…e in index.md Signed-off-by: Gongdayao <[email protected]>

update DeepSeek-R1-w8a8.md to resove PyMarkdown failded

8c536af

Signed-off-by: Gongdayao <[email protected]>

update Deepseek-R1-w8a8 to Deepseek-R1

de89b57

Signed-off-by: Gongdayao <[email protected]>

Gongdayao force-pushed the main branch from f03f8c3 to de89b57 Compare November 29, 2025 07:38

github-actions bot added the merge-conflicts label Nov 29, 2025

github-actions bot added ci/build module:tests module:ops labels Nov 29, 2025

github-actions bot added module:core module:quantization labels Nov 29, 2025

Gongdayao closed this Nov 29, 2025

	export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000
	export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600000

	--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \
	--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \

	--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
	--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \

add deepseek-r1-w8a8 tutorial. #4504

add deepseek-r1-w8a8 tutorial. #4504

Uh oh!

Conversation

Gongdayao commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

menogrey Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

menogrey Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

menogrey Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

menogrey Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

menogrey Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

menogrey Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Gongdayao commented Nov 27, 2025 •

edited by github-actions bot

Loading