From a8404d80302af68be31b91879d4197a4571216ea Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Thu, 27 Nov 2025 22:33:54 +0800 Subject: [PATCH 1/3] =?UTF-8?q?=E5=A2=9E=E5=8A=A0Qwen3-Next=E6=A8=A1?= =?UTF-8?q?=E5=9E=8B=E7=9A=84readme=E6=96=87=E4=BB=B6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/source/tutorials/Qwen3-Next.md | 207 ++++++++++++++++++++++++++++ 1 file changed, 207 insertions(+) create mode 100644 docs/source/tutorials/Qwen3-Next.md diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md new file mode 100644 index 00000000000..370e329680b --- /dev/null +++ b/docs/source/tutorials/Qwen3-Next.md @@ -0,0 +1,207 @@ +# Qwen3-Next + +## Introduction + +The Qwen3-Next model is a sparse MoE (Mixture of Experts) model with high sparsity. Compared to the MoE architecture of Qwen3, it has introduced key improvements in aspects such as the hybrid attention mechanism and multi-token prediction mechanism, enhancing the training and inference efficiency of the model under long contexts and large total parameter scales. + +This document will present the core verification steps of the model, including supported features, environment preparation, as well as accuracy and performance evaluation. Qwen3 Next is currently using Triton Ascend, which is in the experimental phase. In subsequent versions, its performance related to stability and accuracy may change, and performance will be continuously optimized. + +The `Qwen3-Next` model is first supported in `vllm-ascend:v0.10.2rc1`. + +## Supported Features + +Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix. + +Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration. + + +## Weight Preparation + + + + Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main) + + + + +## Deployment +### Run docker container + +```{code-block} bash +# Update the vllm-ascend image +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| +docker run --rm \ +--shm-size=1g \ +--name vllm-ascend-qwen3 \ +--device /dev/davinci0 \ +--device /dev/davinci1 \ +--device /dev/davinci2 \ +--device /dev/davinci3 \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /root/.cache:/root/.cache \ +-p 8000:8000 \ +-it $IMAGE bash +``` +The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement. + +### Install Triton Ascend + +:::::{tab-set} +::::{tab-item} Linux (AArch64) + +The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency. + +Install the Ascend BiSheng toolkit: + +```bash +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run +chmod a+x Ascend-BiSheng-toolkit_aarch64.run +./Ascend-BiSheng-toolkit_aarch64.run --install +source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh +``` + +Install Triton Ascend: + +```bash +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl +pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl +``` + +:::: + +::::{tab-item} Linux (x86_64) + +Coming soon ... + +:::: +::::: + + +### Inference + + + +:::::{tab-set} +::::{tab-item} Online Inference + +Run the following script to start the vLLM server on multi-NPU: + +For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8. + +```bash +vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager +``` + +Once your server is started, you can query the model with input prompts. + +```bash +curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "Qwen/Qwen3-Next-80B-A3B-Instruct", + "messages": [ + {"role": "user", "content": "Who are you?"} + ], + "temperature": 0.6, + "top_p": 0.95, + "top_k": 20, + "max_tokens": 32 +}' +``` + +:::: + +::::{tab-item} Offline Inference + +Run the following script to execute offline inference on multi-NPU: + +```python +import gc +import torch + +from vllm import LLM, SamplingParams +from vllm.distributed.parallel_state import (destroy_distributed_environment, + destroy_model_parallel) + +def clean_up(): + destroy_model_parallel() + destroy_distributed_environment() + gc.collect() + torch.npu.empty_cache() + +if __name__ == '__main__': + prompts = [ + "Who are you?", + ] + sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) + llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct", + tensor_parallel_size=4, + enforce_eager=True, + distributed_executor_backend="mp", + gpu_memory_utilization=0.7, + max_model_len=4096) + + outputs = llm.generate(prompts, sampling_params) + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + + del llm + clean_up() +``` + +If you run this script successfully, you can see the info shown below: + +```bash +Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am' +``` + +:::: +::::: + + + + +## Accuracy Evaluation + + +### Using AISBench + + Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. + + + +## Performance + +### Using AISBench + +Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. + +### Using vLLM Benchmark + +Run performance evaluation of `Qwen3-Next` as an example. + +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. + +There are three `vllm bench` subcommand: +- `latency`: Benchmark the latency of a single batch of requests. +- `serve`: Benchmark the online serving throughput. +- `throughput`: Benchmark offline inference throughput. + +Take the `serve` as an example. Run the code as follows. + +```shell +export VLLM_USE_MODELSCOPE=true +vllm bench serve --model vllm-ascend/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +``` + +After about several minutes, you can get the performance evaluation result. + + + From 3dbc8c9b5f563a9c39ea27049fd986cdb08f7a1f Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Mon, 1 Dec 2025 11:32:52 +0800 Subject: [PATCH 2/3] =?UTF-8?q?=E4=BF=AE=E5=A4=8DQwen3-Next=E6=96=87?= =?UTF-8?q?=E4=BB=B6=E7=9B=B8=E5=85=B3=E7=9A=84=E6=A0=BC=E5=BC=8F=E9=97=AE?= =?UTF-8?q?=E9=A2=98?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/source/tutorials/Qwen3-Next.md | 15 +- docs/source/tutorials/index.md | 2 +- docs/source/tutorials/multi_npu_qwen3_next.md | 157 ------------------ 3 files changed, 2 insertions(+), 172 deletions(-) delete mode 100644 docs/source/tutorials/multi_npu_qwen3_next.md diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index 370e329680b..7c07be45963 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -14,16 +14,10 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md) Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration. - ## Weight Preparation - - Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main) - - - ## Deployment ### Run docker container @@ -49,6 +43,7 @@ docker run --rm \ -p 8000:8000 \ -it $IMAGE bash ``` + The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement. ### Install Triton Ascend @@ -83,11 +78,8 @@ Coming soon ... :::: ::::: - ### Inference - - :::::{tab-set} ::::{tab-item} Online Inference @@ -166,8 +158,6 @@ Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I ::::: - - ## Accuracy Evaluation @@ -202,6 +192,3 @@ vllm bench serve --model vllm-ascend/Qwen3-Next-80B-A3B-Instruct --dataset-name ``` After about several minutes, you can get the performance evaluation result. - - - diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 321ec22d9cc..db971e9a6ec 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -10,7 +10,7 @@ single_npu_qwen3_embedding single_npu_qwen3_quantization single_npu_qwen3_w4a4 single_node_pd_disaggregation_llmdatadist -multi_npu_qwen3_next +Qwen3-Next multi_npu multi_npu_moge multi_npu_qwen3_moe diff --git a/docs/source/tutorials/multi_npu_qwen3_next.md b/docs/source/tutorials/multi_npu_qwen3_next.md deleted file mode 100644 index 637fb4a61ca..00000000000 --- a/docs/source/tutorials/multi_npu_qwen3_next.md +++ /dev/null @@ -1,157 +0,0 @@ -# Multi-NPU (Qwen3-Next) - -```{note} -The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement. -``` - -## Run vllm-ascend on Multi-NPU with Qwen3 Next - -Run docker container: - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---shm-size=1g \ ---name vllm-ascend-qwen3 \ ---device /dev/davinci0 \ ---device /dev/davinci1 \ ---device /dev/davinci2 \ ---device /dev/davinci3 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --it $IMAGE bash -``` - -Set up environment variables: - -```bash -# Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=True -``` - -### Install Triton Ascend - -:::::{tab-set} -::::{tab-item} Linux (AArch64) - -The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency. - -Install the Ascend BiSheng toolkit: - -```bash -wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run -chmod a+x Ascend-BiSheng-toolkit_aarch64.run -./Ascend-BiSheng-toolkit_aarch64.run --install -source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh -``` - -Install Triton Ascend: - -```bash -wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl -pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl -``` - -:::: - -::::{tab-item} Linux (x86_64) - -Coming soon ... - -:::: -::::: - -### Inference on Multi-NPU - -Please make sure you have already executed the command: - -```bash -source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh -``` - -:::::{tab-set} -::::{tab-item} Online Inference - -Run the following script to start the vLLM server on multi-NPU: - -For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8. - -```bash -vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager -``` - -Once your server is started, you can query the model with input prompts. - -```bash -curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "Qwen/Qwen3-Next-80B-A3B-Instruct", - "messages": [ - {"role": "user", "content": "Who are you?"} - ], - "temperature": 0.6, - "top_p": 0.95, - "top_k": 20, - "max_tokens": 32 -}' -``` - -:::: - -::::{tab-item} Offline Inference - -Run the following script to execute offline inference on multi-NPU: - -```python -import gc -import torch - -from vllm import LLM, SamplingParams -from vllm.distributed.parallel_state import (destroy_distributed_environment, - destroy_model_parallel) - -def clean_up(): - destroy_model_parallel() - destroy_distributed_environment() - gc.collect() - torch.npu.empty_cache() - -if __name__ == '__main__': - prompts = [ - "Who are you?", - ] - sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) - llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct", - tensor_parallel_size=4, - enforce_eager=True, - distributed_executor_backend="mp", - gpu_memory_utilization=0.7, - max_model_len=4096) - - outputs = llm.generate(prompts, sampling_params) - for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - - del llm - clean_up() -``` - -If you run this script successfully, you can see the info shown below: - -```bash -Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am' -``` - -:::: -::::: From 3027dd08f6f91e9e2e1bbcc291e873b2cb90abe9 Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Mon, 1 Dec 2025 14:20:28 +0800 Subject: [PATCH 3/3] =?UTF-8?q?=E4=BF=AE=E6=94=B9=E6=96=87=E4=BB=B6?= =?UTF-8?q?=E4=B8=AD=E7=9A=84=E5=B0=8F=E9=94=99=E8=AF=AF?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: ming1212 <2717180080@qq.com> --- docs/source/tutorials/Qwen3-Next.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index 7c07be45963..ac4128844b6 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -188,7 +188,7 @@ Take the `serve` as an example. Run the code as follows. ```shell export VLLM_USE_MODELSCOPE=true -vllm bench serve --model vllm-ascend/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ ``` After about several minutes, you can get the performance evaluation result.