Skip to content

Conversation

@Gongdayao
Copy link

@Gongdayao Gongdayao commented Nov 27, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new tutorial for running the DeepSeek-R1-w8a8 model. The tutorial is comprehensive, covering setup, deployment, and evaluation. My review focuses on ensuring the correctness of the provided shell commands, as users will likely copy and paste them. I've identified several issues, including typos in environment variables, incorrect command-line arguments, and inconsistencies in port numbers, which would prevent the commands from running successfully. I've provided specific suggestions to correct these errors.

export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a typo in the environment variable name. VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS should likely be VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS. An incorrect variable name will prevent this setting from being applied, which could lead to unexpected timeouts.

Suggested change
export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600000

--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a trailing colon in the value for the --compilation-config argument. This will cause a JSON parsing error when the command is executed. Please remove the colon.

Suggested change
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \

--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--tensor-parallel-size 2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The --tensor-parallel-size argument is specified twice with conflicting values (2 on this line, and 4 on line 179). This is likely an error. Based on other configurations in this document, the correct value is 4. Please remove this redundant line to avoid confusion and ensure the correct configuration is used.

--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a trailing colon in the value for the --compilation-config argument. This will cause a JSON parsing error when the command is executed. Please remove the colon.

Suggested change
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \

```shell
lm_eval \
--model local-completions \
--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The base_url in the lm_eval command uses port 8000, which is inconsistent with the ports 8011 (for A3 series) and 8004 (for A2 series) defined in the deployment scripts above. This will cause connection errors. Please use a placeholder like <port> to remind the user to fill in the correct port, similar to the curl example.

Suggested change
--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 27, 2025
2. Install the package `custom-ops` to make the kernels available.

```shell
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/CANN-custom_ops-sfa-linux.aarch64.run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepSeek-R1 don't need this, it's for DeepSeek-V3.2 specially.

@@ -0,0 +1,327 @@
# DeepSeek-R1-w8a8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title should be DeepSeek-R1, and the content should not only contain DeepSeek-R1-W8A8, add DeepSeek-R1 will be better.

2. Install the package `custom-ops` to make the kernels available.

```shell
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/CANN-custom_ops-sfa-linux.aarch64.run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepSeek-R1 don't need this, it's for DeepSeek-V3.2 specially.

--gpu-memory-utilization 0.92 \
--speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ascend schedular is ready to be dropped in main. Refer to this #4498, And there is no need to add the additional-config if you set "enabled":false.

--gpu-memory-utilization 0.92 \
--speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

--gpu-memory-utilization 0.94 \
--speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

drslark and others added 17 commits November 29, 2025 15:37
### What this PR does / why we need it?

Adapted Qwen3-Next eager mode to `v0.11.2`.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: drslark <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
…sible device count error (#4457)

### What this PR does / why we need it?
Fix the ray start failed bug: local_world_size cannot little than
visible device count error
detail see issue #4456.

This fix code is copied from vllm fixing modify, PR:
[#28873](vllm-project/vllm#28873)

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer
to simplify the invocation of `aclnn` operators on Ascend NPUs.

**Key Changes:**
* **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies
to standardize `aclnn` calls.
* **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a
reference implementation to demonstrate the usage of the new macro.

---

- vLLM version: v0.11.2

---------

Signed-off-by: SlightwindSec <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it?
Add eagle proposer ut

- vLLM version: v0.11.2

Signed-off-by: GDzhu01 <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it?
Upgrade cann to 8.3rc2

### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.3.RC2

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: MrZ20 <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
…c weight (#4036)

### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 [email protected]
Co-authored-by: chenxi-hh [email protected]

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <[email protected]>
Signed-off-by: chenxi-hh <[email protected]>
Signed-off-by: chenxi-hh <[email protected]>
Co-authored-by: chenxi-hh <[email protected]>
Co-authored-by: chenxi-hh <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
…VisionAttention (#4349)

### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at vllm-project/vllm#29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for vllm-project/vllm#28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark

- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <[email protected]>
Signed-off-by: liziyu <[email protected]>
Co-authored-by: liziyu <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it?
Delete equals sign in doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: herizhen <[email protected]>
Co-authored-by: herizhen <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it?

This PR introduces support for adding custom CANN `aclnn` ops to
`vllm-ascend`, allowing users to define and use their own custom
operators.

Key changes include:
- Building and installing custom ops into the `vllm-ascend`-specified
directory
- Binding the `aclnn` op interface to the `torch.ops._C_ascend` module
- Enabling invocation of these ops within `vllm-ascend`

This PR includes a sample custom op:
`aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from
the CANN operator
[`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md).
Its input parameters `weight` and `weight_scale` now accept
`list[torch.Tensor]` (i.e., `at::TensorList`).

### Does this PR introduce _any_ user-facing change?

No.

- vLLM version: v0.11.2

---------

Signed-off-by: QianChenxi <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
…4438)

### What this PR does / why we need it?
1.In short, we renamed the existing MooncakeStoreConnector to
AscendStoreConnector and extracted the storage engine interaction logic
into a new Backend class.
Associated RFC:#4329
2.Fixed the issue where the number of input parameters for the connector
was incorrect, introduced in vllm 0.11.2
### Does this PR introduce _any_ user-facing change?
change MooncakeStoreConnector to AscendStoreConnector
### How was this patch tested?

- vLLM version: v0.11.2

---------

Signed-off-by: fems14 <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it?
qwen3-next suppot  triton chunk_gated_delta_rule ops

### co-owners
@OsirisDuan

- vLLM version: v0.11.2

Signed-off-by: shiyuan680 <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
Triton package url is not correct. This PR fix it

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Ting FU <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.