You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-qwen3-next-on-trtllm.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
-
# Quick Start Recipe for Qwen3 Next on TensorRT-LLM
1
+
# Quick Start Recipe for Qwen3 Next on TensorRTLLM
2
2
3
3
## Introduction
4
4
5
-
This deployment guide provides step-by-step instructions for running the Qwen3-Next model using TensorRT-LLM, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output.
5
+
This deployment guide provides step-by-step instructions for running the Qwen3-Next model using TensorRTLLM, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRTLLM parameters, launching the server, and validating inference output.
6
6
7
-
The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving.
7
+
The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRTLLM for model serving.
8
8
9
9
## Prerequisites
10
10
@@ -22,7 +22,7 @@ The guide is intended for developers and practitioners seeking high-throughput o
22
22
23
23
### Run Docker Container
24
24
25
-
Run the docker container using the TensorRT-LLM NVIDIA NGC image.
25
+
Run the docker container using the TensorRTLLM NVIDIA NGC image.
26
26
27
27
```shell
28
28
docker run --rm -it \
@@ -42,11 +42,11 @@ Note:
42
42
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
43
43
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
44
44
45
-
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
45
+
If you want to use latest main branch, you can choose to build from source to install TensorRTLLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
46
46
47
47
### Creating the TRT-LLM Server config
48
48
49
-
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings. Note that we should set kv_cache_reuse to false.
49
+
We create a YAML configuration file `/tmp/config.yml` for the TensorRTLLM Server and populate it with the following recommended performance settings. Note that we should set kv_cache_reuse to false.
50
50
51
51
```shell
52
52
EXTRA_LLM_API_FILE=/tmp/config.yml
@@ -105,7 +105,7 @@ These options are used directly on the command line when you start the `trtllm-s
105
105
106
106
#### `--backend pytorch`
107
107
108
-
***Description:** Tells TensorRT-LLM to use the **pytorch** backend.
108
+
***Description:** Tells TensorRTLLM to use the **pytorch** backend.
109
109
110
110
#### `--max_batch_size`
111
111
@@ -121,7 +121,7 @@ These options are used directly on the command line when you start the `trtllm-s
121
121
122
122
#### `--trust_remote_code`
123
123
124
-
***Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
124
+
***Description:** Allows TensorRTLLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
125
125
126
126
127
127
#### Extra LLM API Options (YAML Configuration)
@@ -159,7 +159,7 @@ See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/ref
159
159
160
160
### Basic Test
161
161
162
-
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
162
+
Start a new terminal on the host to test the TensorRTLLM server you just launched.
163
163
164
164
You can query the health/readiness of the server using:
165
165
@@ -205,7 +205,7 @@ Here is an example response:
205
205
206
206
## Benchmarking Performance
207
207
208
-
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
208
+
To benchmark the performance of your TensorRTLLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
0 commit comments