-
Notifications
You must be signed in to change notification settings - Fork 626
add deepseek-r1-w8a8 tutorial. #4504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new tutorial for running the DeepSeek-R1-w8a8 model. The tutorial is comprehensive, covering setup, deployment, and evaluation. My review focuses on ensuring the correctness of the provided shell commands, as users will likely copy and paste them. I've identified several issues, including typos in environment variables, incorrect command-line arguments, and inconsistencies in port numbers, which would prevent the commands from running successfully. I've provided specific suggestions to correct these errors.
| export HCCL_BUFFSIZE=200 | ||
| export VLLM_ASCEND_ENABLE_MLAPO=1 | ||
| export VLLM_RPC_TIMEOUT=3600000 | ||
| export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a typo in the environment variable name. VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS should likely be VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS. An incorrect variable name will prevent this setting from being applied, which could lead to unexpected timeouts.
| export VLLM_EXCUTE_MODEL_TIMEOUT_SECONDS=3600000 | |
| export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600000 |
| --no-enable-prefix-caching \ | ||
| --gpu-memory-utilization 0.92 \ | ||
| --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a trailing colon in the value for the --compilation-config argument. This will cause a JSON parsing error when the command is executed. Please remove the colon.
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \ | |
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ |
| --host 0.0.0.0 \ | ||
| --port 8004 \ | ||
| --data-parallel-size 4 \ | ||
| --tensor-parallel-size 2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --tensor-parallel-size argument is specified twice with conflicting values (2 on this line, and 4 on line 179). This is likely an error. Based on other configurations in this document, the correct value is 4. Please remove this redundant line to avoid confusion and ensure the correct configuration is used.
| --no-enable-prefix-caching \ | ||
| --gpu-memory-utilization 0.92 \ | ||
| --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a trailing colon in the value for the --compilation-config argument. This will cause a JSON parsing error when the command is executed. Please remove the colon.
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}:' \ | |
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ |
| ```shell | ||
| lm_eval \ | ||
| --model local-completions \ | ||
| --model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base_url in the lm_eval command uses port 8000, which is inconsistent with the ports 8011 (for A3 series) and 8004 (for A2 series) defined in the deployment scripts above. This will cause connection errors. Please use a placeholder like <port> to remind the user to fill in the correct port, similar to the curl example.
| --model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \ | |
| --model_args model=path/DeepSeek-R1-w8a8,base_url=http://127.0.0.1:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \ |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
| 2. Install the package `custom-ops` to make the kernels available. | ||
|
|
||
| ```shell | ||
| wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/CANN-custom_ops-sfa-linux.aarch64.run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepSeek-R1 don't need this, it's for DeepSeek-V3.2 specially.
| @@ -0,0 +1,327 @@ | |||
| # DeepSeek-R1-w8a8 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title should be DeepSeek-R1, and the content should not only contain DeepSeek-R1-W8A8, add DeepSeek-R1 will be better.
| 2. Install the package `custom-ops` to make the kernels available. | ||
|
|
||
| ```shell | ||
| wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/CANN-custom_ops-sfa-linux.aarch64.run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepSeek-R1 don't need this, it's for DeepSeek-V3.2 specially.
| --gpu-memory-utilization 0.92 \ | ||
| --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ | ||
| --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ascend schedular is ready to be dropped in main. Refer to this #4498, And there is no need to add the additional-config if you set "enabled":false.
| --gpu-memory-utilization 0.92 \ | ||
| --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ | ||
| --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
| --gpu-memory-utilization 0.94 \ | ||
| --speculative-config '{"num_speculative_tokens":1,"method":"deepseek_mtp"}' \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ | ||
| --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
### What this PR does / why we need it? Adapted Qwen3-Next eager mode to `v0.11.2`. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: drslark <[email protected]> Signed-off-by: Gongdayao <[email protected]>
…sible device count error (#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue #4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Gongdayao <[email protected]>
This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer to simplify the invocation of `aclnn` operators on Ascend NPUs. **Key Changes:** * **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies to standardize `aclnn` calls. * **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a reference implementation to demonstrate the usage of the new macro. --- - vLLM version: v0.11.2 --------- Signed-off-by: SlightwindSec <[email protected]> Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it? Add eagle proposer ut - vLLM version: v0.11.2 Signed-off-by: GDzhu01 <[email protected]> Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it? Upgrade cann to 8.3rc2 ### Does this PR introduce _any_ user-facing change? Yes, docker image will use 8.3.RC2 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: MrZ20 <[email protected]> Signed-off-by: Gongdayao <[email protected]>
…c weight (#4036) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 [email protected] Co-authored-by: chenxi-hh [email protected] - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <[email protected]> Signed-off-by: chenxi-hh <[email protected]> Signed-off-by: chenxi-hh <[email protected]> Co-authored-by: chenxi-hh <[email protected]> Co-authored-by: chenxi-hh <[email protected]> Signed-off-by: Gongdayao <[email protected]>
…VisionAttention (#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <[email protected]> Signed-off-by: liziyu <[email protected]> Co-authored-by: liziyu <[email protected]> Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it? Delete equals sign in doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: herizhen <[email protected]> Co-authored-by: herizhen <[email protected]> Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it? This PR introduces support for adding custom CANN `aclnn` ops to `vllm-ascend`, allowing users to define and use their own custom operators. Key changes include: - Building and installing custom ops into the `vllm-ascend`-specified directory - Binding the `aclnn` op interface to the `torch.ops._C_ascend` module - Enabling invocation of these ops within `vllm-ascend` This PR includes a sample custom op: `aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from the CANN operator [`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md). Its input parameters `weight` and `weight_scale` now accept `list[torch.Tensor]` (i.e., `at::TensorList`). ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.11.2 --------- Signed-off-by: QianChenxi <[email protected]> Signed-off-by: Gongdayao <[email protected]>
…4438) ### What this PR does / why we need it? 1.In short, we renamed the existing MooncakeStoreConnector to AscendStoreConnector and extracted the storage engine interaction logic into a new Backend class. Associated RFC:#4329 2.Fixed the issue where the number of input parameters for the connector was incorrect, introduced in vllm 0.11.2 ### Does this PR introduce _any_ user-facing change? change MooncakeStoreConnector to AscendStoreConnector ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: fems14 <[email protected]> Signed-off-by: Gongdayao <[email protected]>
### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <[email protected]> Signed-off-by: Gongdayao <[email protected]>
Triton package url is not correct. This PR fix it Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Gongdayao <[email protected]>
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype. ### How was this patch tested? Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Ting FU <[email protected]> Signed-off-by: Gongdayao <[email protected]>
…e in index.md Signed-off-by: Gongdayao <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
Signed-off-by: Gongdayao <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?