diff --git a/docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md b/docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md index a869a631a8f..76713213708 100644 --- a/docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md +++ b/docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md @@ -5,7 +5,7 @@ A complete reference for the API is available in the [OpenAI API Reference](http This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B and Qwen2.5-VL-7B for multimodal models: * Methodology Introduction - * Launch the OpenAI-Compatibale Server with NGC container + * Launch the OpenAI-Compatible Server with NGC container * Run the performance benchmark * Using `extra_llm_api_options` * Multimodal Serving and Benchmarking diff --git a/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md b/docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md similarity index 85% rename from docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md rename to docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md index 87394c8cdd3..8b0b89ec885 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md +++ b/docs/source/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md @@ -1,4 +1,4 @@ -# Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware +# Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware ## Introduction @@ -47,7 +47,7 @@ docker run --rm -it \ -p 8000:8000 \ -v ~/.cache:/root/.cache:rw \ --name tensorrt_llm \ -nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:x.y.z \ /bin/bash ``` @@ -60,108 +60,102 @@ Note: If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) -### Creating the TensorRT LLM Server config +### Recommended Performance Settings -We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings. +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: true -cuda_graph_config: - enable_padding: true - max_batch_size: 128 -kv_cache_config: - dtype: fp8 -stream_interval: 10 -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/deepseek-r1-throughput.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- ``` +```` -For FP8 model, we need extra `moe_config`: +To use the `DeepGEMM` MOE backend on B200/GB200, use this config instead: ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: true -cuda_graph_config: - enable_padding: true - max_batch_size: 128 -kv_cache_config: - dtype: fp8 -stream_interval: 10 -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 -moe_config: - backend: DEEPGEMM - max_num_tokens: 3200 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml ``` +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/deepseek-r1-deepgemm.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- +``` +```` + ### Launch the TensorRT LLM Server -Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section. +Below is an example command to launch the TensorRT LLM server with the DeepSeek-R1 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section. ```shell -trtllm-serve deepseek-ai/DeepSeek-R1-0528 \ - --host 0.0.0.0 \ - --port 8000 \ - --max_batch_size 1024 \ - --max_num_tokens 3200 \ - --max_seq_len 2048 \ - --kv_cache_free_gpu_memory_fraction 0.8 \ - --tp_size 8 \ - --ep_size 8 \ - --trust_remote_code \ - --extra_llm_api_options ${EXTRA_LLM_API_FILE} +trtllm-serve deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE} ``` After the server is set up, the client can now send prompt requests to the server and receive results. -### Configs and Parameters +### LLM API Options (YAML Configuration) + + + +These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. -These options are used directly on the command line when you start the `trtllm-serve` process. -#### `--tp_size` +#### `tensor_parallel_size` * **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. -#### `--ep_size` +#### `moe_expert_parallel_size` -* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. +* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. -#### `--kv_cache_free_gpu_memory_fraction` +#### `kv_cache_free_gpu_memory_fraction` * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors. * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower. - -#### `--max_batch_size` +#### `max_batch_size` * **Description:** The maximum number of user requests that can be grouped into a single batch for processing. -#### `--max_num_tokens` +#### `max_num_tokens` * **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. -#### `--max_seq_len` +#### `max_seq_len` * **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. -#### `--trust_remote_code` +#### `trust_remote_code`  **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. - -#### Extra LLM API Options (YAML Configuration) - -These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. - #### `kv_cache_config` * **Description**: A section for configuring the Key-Value (KV) cache. diff --git a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md b/docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md similarity index 85% rename from docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md rename to docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md index 17e16583092..7c8c5511276 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md +++ b/docs/source/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md @@ -1,4 +1,4 @@ -# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware +# Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware ## Introduction @@ -43,7 +43,7 @@ docker run --rm -it \ -p 8000:8000 \ -v ~/.cache:/root/.cache:rw \ --name tensorrt_llm \ -nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:x.y.z \ /bin/bash ``` @@ -56,105 +56,103 @@ Note: If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to . -### Creating the TensorRT LLM Server config +### Recommended Performance Settings -We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings. +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. -For low-latency with `TRTLLM` MOE backend: +For low-latency use cases: ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: false -cuda_graph_config: - enable_padding: true - max_batch_size: 720 -moe_config: - backend: TRTLLM -stream_interval: 20 -num_postprocess_workers: 4 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/gpt-oss-120b-latency.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- ``` +```` -For max-throughput with `CUTLASS` MOE backend: +For max-throughput use cases: ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: true -cuda_graph_config: - enable_padding: true - max_batch_size: 720 -moe_config: - backend: CUTLASS -stream_interval: 20 -num_postprocess_workers: 4 -attention_dp_config: - enable_balance: true - batching_wait_iters: 50 - timeout_iters: 1 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/gpt-oss-120b-throughput.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- ``` +```` ### Launch the TensorRT LLM Server -Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section. +Below is an example command to launch the TensorRT LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section. ```shell -trtllm-serve openai/gpt-oss-120b \ - --host 0.0.0.0 \ - --port 8000 \ - --max_batch_size 720 \ - --max_num_tokens 16384 \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --tp_size 8 \ - --ep_size 8 \ - --trust_remote_code \ - --extra_llm_api_options ${EXTRA_LLM_API_FILE} +trtllm-serve openai/gpt-oss-120b --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE} ``` After the server is set up, the client can now send prompt requests to the server and receive results. -### Configs and Parameters +### LLM API Options (YAML Configuration) -These options are used directly on the command line when you start the `trtllm-serve` process. + -#### `--tp_size` +These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. + +#### `tensor_parallel_size` * **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. -#### `--ep_size` +#### `moe_expert_parallel_size` -* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. +* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. -#### `--kv_cache_free_gpu_memory_fraction` +#### `kv_cache_free_gpu_memory_fraction` * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors. * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower. -#### `--max_batch_size` +#### `max_batch_size` * **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output). -#### `--max_num_tokens` +#### `max_num_tokens` * **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. -#### `--max_seq_len` +#### `max_seq_len` -* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config. +* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. If not set, it will be inferred from model config. -#### `--trust_remote_code` +#### `trust_remote_code` * **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. - -#### Extra LLM API Options (YAML Configuration) - -These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. - #### `cuda_graph_config` * **Description**: A section for configuring CUDA graphs to optimize performance. diff --git a/docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md b/docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md similarity index 88% rename from docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md rename to docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md index 011920a9f07..6c16d1c2ca5 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md +++ b/docs/source/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md @@ -1,4 +1,4 @@ -# Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware +# Deployment Guide for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware ## Introduction @@ -39,7 +39,7 @@ docker run --rm -it \ -p 8000:8000 \ -v ~/.cache:/root/.cache:rw \ --name tensorrt_llm \ -nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:x.y.z \ /bin/bash ``` @@ -52,81 +52,78 @@ Note: If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) -### Creating the TensorRT LLM Server config +### Recommended Performance Settings -We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings. +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: false -cuda_graph_config: - enable_padding: true - max_batch_size: 1024 -kv_cache_config: - dtype: fp8 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/llama-3.3-70b.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- ``` +```` ### Launch the TensorRT LLM Server -Below is an example command to launch the TensorRT LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section. +Below is an example command to launch the TensorRT LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section. ```shell -trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \ - --host 0.0.0.0 \ - --port 8000 \ - --max_batch_size 1024 \ - --max_num_tokens 2048 \ - --max_seq_len 2048 \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --tp_size 1 \ - --ep_size 1 \ - --trust_remote_code \ - --extra_llm_api_options ${EXTRA_LLM_API_FILE} +trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE} ``` After the server is set up, the client can now send prompt requests to the server and receive results. -### Configs and Parameters +### LLM API Options (YAML Configuration) -These options are used directly on the command line when you start the `trtllm-serve` process. -#### `--tp_size` + + +These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. + +#### `tensor_parallel_size`  **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. -#### `--ep_size` +#### `moe_expert_parallel_size` - **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. + **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. -#### `--kv_cache_free_gpu_memory_fraction` +#### `kv_cache_free_gpu_memory_fraction`  **Description:** A value between 0.0 and 1.0 that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.  **Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower. -#### `--max_batch_size` +#### `max_batch_size`  **Description:** The maximum number of user requests that can be grouped into a single batch for processing. -#### `--max_num_tokens` +#### `max_num_tokens`  **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. -#### `--max_seq_len` +#### `max_seq_len`  **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. -#### `--trust_remote_code` +#### `trust_remote_code`  **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. - -#### Extra LLM API Options (YAML Configuration) - -These options provide finer control over performance and are set within a YAML file passed to the trtllm-serve command via the \--extra\_llm\_api\_options argument. - #### `kv_cache_config`  **Description**: A section for configuring the Key-Value (KV) cache. diff --git a/docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md b/docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md similarity index 87% rename from docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md rename to docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md index 0ea925e4718..9fb6b6165af 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md +++ b/docs/source/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md @@ -1,4 +1,4 @@ -# Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware +# Deployment Guide for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware ## Introduction @@ -38,7 +38,7 @@ docker run --rm -it \ -p 8000:8000 \ -v ~/.cache:/root/.cache:rw \ --name tensorrt_llm \ -nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:x.y.z \ /bin/bash ``` @@ -51,81 +51,77 @@ Note: If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) -### Creating the TensorRT LLM Server config +### Recommended Performance Settings -We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings. +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: false -cuda_graph_config: - enable_padding: true - max_batch_size: 1024 -kv_cache_config: - dtype: fp8 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/llama-4-scout.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- ``` +```` ### Launch the TensorRT LLM Server -Below is an example command to launch the TensorRT LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section. +Below is an example command to launch the TensorRT LLM server with the Llama-4-Scout-17B-16E-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “LLM API Options (YAML Configuration)” section. ```shell -trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 \ - --host 0.0.0.0 \ - --port 8000 \ - --max_batch_size 1024 \ - --max_num_tokens 2048 \ - --max_seq_len 2048 \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --tp_size 1 \ - --ep_size 1 \ - --trust_remote_code \ - --extra_llm_api_options ${EXTRA_LLM_API_FILE} +trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE} ``` After the server is set up, the client can now send prompt requests to the server and receive results. -### Configs and Parameters +### LLM API Options (YAML Configuration) + + -These options are used directly on the command line when you start the `trtllm-serve` process. +These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. -#### `--tp_size` +#### `tensor_parallel_size` * **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. -#### `--ep_size` +#### `moe_expert_parallel_size` -* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. +* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. -#### `--kv_cache_free_gpu_memory_fraction` +#### `kv_cache_free_gpu_memory_fraction` * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors. * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower. -#### `--max_batch_size` +#### `max_batch_size` * **Description:** The maximum number of user requests that can be grouped into a single batch for processing. -#### `--max_num_tokens` +#### `max_num_tokens` * **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. -#### `--max_seq_len` +#### `max_seq_len` * **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. -#### `--trust_remote_code` +#### `trust_remote_code`  **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. - -#### Extra LLM API Options (YAML Configuration) - -These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. - #### `kv_cache_config` * **Description**: A section for configuring the Key-Value (KV) cache. diff --git a/docs/source/deployment-guide/quick-start-recipe-for-qwen3-next-on-trtllm.md b/docs/source/deployment-guide/deployment-guide-for-qwen3-next-on-trtllm.md similarity index 85% rename from docs/source/deployment-guide/quick-start-recipe-for-qwen3-next-on-trtllm.md rename to docs/source/deployment-guide/deployment-guide-for-qwen3-next-on-trtllm.md index ce192b9f5fe..246fc74a567 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-qwen3-next-on-trtllm.md +++ b/docs/source/deployment-guide/deployment-guide-for-qwen3-next-on-trtllm.md @@ -1,4 +1,4 @@ -# Quick Start Recipe for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware +# Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware ## Introduction @@ -29,27 +29,31 @@ make -C docker release_build IMAGE_TAG=qwen3-next-local make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-next-local LOCAL_USER=1 ``` -### Creating the TensorRT LLM Server config +### Recommended Performance Settings -We create a YAML configuration file `/tmp/config.yml` for the TensorRT LLM Server with the following content: +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: false -cuda_graph_config: - enable_padding: true - max_batch_size: 720 -moe_config: - backend: TRTLLM -stream_interval: 20 -num_postprocess_workers: 4 -kv_cache_config: - enable_block_reuse: false - free_gpu_memory_fraction: 0.6 -EOF +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3-next.yaml +``` + +Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below. + +````{admonition} Show code +:class: dropdown + +```{literalinclude} ../../../examples/configs/qwen3-next.yaml +--- +language: shell +prepend: | + EXTRA_LLM_API_FILE=/tmp/config.yml + + cat << EOF > ${EXTRA_LLM_API_FILE} +append: EOF +--- ``` +```` ### Launch the TensorRT LLM Server @@ -57,59 +61,47 @@ EOF Below is an example command to launch the TensorRT LLM server with the Qwen3-Next model from within the container. ```shell -trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \ - --host 0.0.0.0 \ - --port 8000 \ - --max_batch_size 16 \ - --max_num_tokens 4096 \ - --tp_size 4 \ - --pp_size 1 \ - --ep_size 4 \ - --trust_remote_code \ - --extra_llm_api_options ${EXTRA_LLM_API_FILE} +trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE} ``` After the server is set up, the client can now send prompt requests to the server and receive results. -### Configs and Parameters +### LLM API Options (YAML Configuration) + + -These options are used directly on the command line when you start the `trtllm-serve` process. +These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. -#### `--tp_size` +#### `tensor_parallel_size` * **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. -#### `--ep_size` +#### `moe_expert_parallel_size` -* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. +* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. -#### `--kv_cache_free_gpu_memory_fraction` +#### `kv_cache_config.free_gpu_memory_fraction` * **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors. * **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower. -#### `--max_batch_size` +#### `max_batch_size` * **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output). -#### `--max_num_tokens` +#### `max_num_tokens` * **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. -#### `--max_seq_len` +#### `max_seq_len` * **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config. -#### `--trust_remote_code` +#### `trust_remote_code` * **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. - -#### Extra LLM API Options (YAML Configuration) - -These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. - #### `cuda_graph_config` * **Description**: A section for configuring CUDA graphs to optimize performance. diff --git a/docs/source/deployment-guide/index.rst b/docs/source/deployment-guide/index.rst index 2327de50005..a5a085d6e2f 100644 --- a/docs/source/deployment-guide/index.rst +++ b/docs/source/deployment-guide/index.rst @@ -1,13 +1,94 @@ Model Recipes ================ +Quick Start for Popular Models +------------------------------- + +The table below contains ``trtllm-serve`` commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more. + +We maintain LLM API configuration files for these models containing recommended performance settings in the `examples/configs `_ directory. The TensorRT LLM Docker container makes the config files available at ``/app/tensorrt_llm/examples/configs``, but you can customize this as needed: + +.. code-block:: bash + + export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment + +.. note:: + + The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, you may benefit from additional tuning. In the future, we plan to provide more configs for a wider range of traffic patterns. + +This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below. + +.. list-table:: + :header-rows: 1 + :widths: 20 15 15 20 30 + + * - Model Name + - GPU + - Inference Scenario + - Config + - Command + * - `DeepSeek-R1 `_ + - H100, H200 + - Max Throughput + - `deepseek-r1-throughput.yaml `_ + - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml`` + * - `DeepSeek-R1 `_ + - B200, GB200 + - Max Throughput + - `deepseek-r1-deepgemm.yaml `_ + - ``trtllm-serve deepseek-ai/DeepSeek-R1-0528 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-deepgemm.yaml`` + * - `DeepSeek-R1 (NVFP4) `_ + - B200, GB200 + - Max Throughput + - `deepseek-r1-throughput.yaml `_ + - ``trtllm-serve nvidia/DeepSeek-R1-FP4 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-throughput.yaml`` + * - `DeepSeek-R1 (NVFP4) `_ + - B200, GB200 + - Min Latency + - `deepseek-r1-latency.yaml `_ + - ``trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/deepseek-r1-latency.yaml`` + * - `gpt-oss-120b `_ + - Any + - Max Throughput + - `gpt-oss-120b-throughput.yaml `_ + - ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-throughput.yaml`` + * - `gpt-oss-120b `_ + - Any + - Min Latency + - `gpt-oss-120b-latency.yaml `_ + - ``trtllm-serve openai/gpt-oss-120b --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/gpt-oss-120b-latency.yaml`` + * - `Qwen3-Next-80B-A3B-Thinking `_ + - Any + - Max Throughput + - `qwen3-next.yaml `_ + - ``trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3-next.yaml`` + * - Qwen3 family (e.g. `Qwen3-30B-A3B `_) + - Any + - Max Throughput + - `qwen3.yaml `_ + - ``trtllm-serve Qwen/Qwen3-30B-A3B --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/qwen3.yaml`` (swap to another Qwen3 model name as needed) + * - `Llama-3.3-70B (FP8) `_ + - Any + - Max Throughput + - `llama-3.3-70b.yaml `_ + - ``trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-3.3-70b.yaml`` + * - `Llama 4 Scout (FP8) `_ + - Any + - Max Throughput + - `llama-4-scout.yaml `_ + - ``trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --extra_llm_api_options ${TRTLLM_DIR}/examples/configs/llama-4-scout.yaml`` + +Model-Specific Deployment Guides +--------------------------------- + +The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM. + .. toctree:: :maxdepth: 1 - :caption: Model Recipes - :name: Model Recipes - - quick-start-recipe-for-deepseek-r1-on-trtllm.md - quick-start-recipe-for-llama3.3-70b-on-trtllm.md - quick-start-recipe-for-llama4-scout-on-trtllm.md - quick-start-recipe-for-gpt-oss-on-trtllm.md - quick-start-recipe-for-qwen3-next-on-trtllm.md + :name: Deployment Guides + + deployment-guide-for-deepseek-r1-on-trtllm.md + deployment-guide-for-llama3.3-70b-on-trtllm.md + deployment-guide-for-llama4-scout-on-trtllm.md + deployment-guide-for-gpt-oss-on-trtllm.md + deployment-guide-for-qwen3-next-on-trtllm.md diff --git a/docs/source/helper.py b/docs/source/helper.py index 8e343c78eb4..675bd697e9f 100644 --- a/docs/source/helper.py +++ b/docs/source/helper.py @@ -346,24 +346,25 @@ def generate_llmapi(): def update_version(): - version_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), - "../../tensorrt_llm/version.py")) + """Replace the placeholder container version in all docs source files.""" + version_path = (Path(__file__).parent.parent.parent / "tensorrt_llm" / + "version.py").resolve() spec = importlib.util.spec_from_file_location("version_module", version_path) version_module = importlib.util.module_from_spec(spec) spec.loader.exec_module(version_module) version = version_module.__version__ - file_list = [ - "docs/source/quick-start-guide.md", - "docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md" - ] - for file in file_list: - file_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), "../../" + file)) + + docs_source_dir = Path(__file__).parent.resolve() + md_files = list(docs_source_dir.rglob("*.md")) + + for file_path in md_files: with open(file_path, "r") as f: content = f.read() - content = content.replace("x.y.z", version) + content = content.replace( + "nvcr.io/nvidia/tensorrt-llm/release:x.y.z", + f"nvcr.io/nvidia/tensorrt-llm/release:{version}", + ) with open(file_path, "w") as f: f.write(content) diff --git a/docs/source/quick-start-guide.md b/docs/source/quick-start-guide.md index 9fd9bb0914d..4d70b2eba84 100644 --- a/docs/source/quick-start-guide.md +++ b/docs/source/quick-start-guide.md @@ -5,7 +5,9 @@ This is the starting point to try out TensorRT LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT LLM. -## Launch Docker on a node with NVIDIA GPUs deployed +## Launch Docker Container + +The [TensorRT LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) maintained by NVIDIA contains all of the required dependencies pre-installed. You can start the container on a machine with NVIDIA GPUs via: ```bash docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:x.y.z @@ -13,7 +15,7 @@ docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=671 (deploy-with-trtllm-serve)= -## Deploy online serving with trtllm-serve +## Deploy Online Serving with trtllm-serve You can use the `trtllm-serve` command to start an OpenAI compatible server to interact with a model. To start the server, you can run a command like the following example inside a Docker container: @@ -23,7 +25,7 @@ trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" ``` ```{note} -If you are running trtllm-server inside a Docker container, you have two options for sending API requests: +If you are running `trtllm-serve` inside a Docker container, you have two options for sending API requests: 1. Expose a port (e.g., 8000) to allow external access to the server from outside the container. 2. Open a new terminal and use the following command to directly attach to the running container: ```bash @@ -77,7 +79,11 @@ _Example Output_ For detailed examples and command syntax, refer to the [trtllm-serve](commands/trtllm-serve/trtllm-serve.rst) section. -## Run Offline inference with LLM API +```{note} +Pre-configured settings for deploying popular models with `trtllm-serve` can be found in our [deployment guides](deployment-guide/index.rst). +``` + +## Run Offline Inference with LLM API The LLM API is a Python API designed to facilitate setup and inference with TensorRT LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing model loading, optimization, and inference, all through a single `LLM` instance. Here is a simple example to show how to use the LLM API with TinyLlama. @@ -100,6 +106,7 @@ In this Quick Start Guide, you have: To continue your journey with TensorRT LLM, explore these resources: - **[Installation Guide](installation/index.rst)** - Detailed installation instructions for different platforms +- **[Model-Specific Deployment Guides](deployment-guide/index.rst)** - Instructions for serving specific models with TensorRT LLM - **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios - **[Model Support](models/supported-models.md)** - Check which models are supported and how to add new ones - **CLI Reference** - Explore TensorRT LLM command-line tools: diff --git a/examples/configs/README.md b/examples/configs/README.md new file mode 100644 index 00000000000..b9a47281d20 --- /dev/null +++ b/examples/configs/README.md @@ -0,0 +1,5 @@ +# Recommended LLM API Configuration Settings + +This directory contains recommended [LLM API](https://nvidia.github.io/TensorRT-LLM/llm-api/) performance settings for popular models. They can be used out-of-the-box with `trtllm-serve` via the `--extra_llm_api_options` CLI flag, or you can adjust them to your specific use case. + +For model-specific deployment guides, please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/deployment-guide/index.html). diff --git a/examples/configs/deepseek-r1-deepgemm.yaml b/examples/configs/deepseek-r1-deepgemm.yaml new file mode 100644 index 00000000000..bc12f12b452 --- /dev/null +++ b/examples/configs/deepseek-r1-deepgemm.yaml @@ -0,0 +1,19 @@ +max_batch_size: 1024 +max_num_tokens: 3200 +kv_cache_free_gpu_memory_fraction: 0.8 +tensor_parallel_size: 8 +moe_expert_parallel_size: 8 +trust_remote_code: true +enable_attention_dp: true +cuda_graph_config: + enable_padding: true + max_batch_size: 128 +kv_cache_config: + dtype: fp8 +stream_interval: 10 +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 +moe_config: + backend: DEEPGEMM + max_num_tokens: 3200 diff --git a/examples/configs/deepseek-r1-latency.yaml b/examples/configs/deepseek-r1-latency.yaml new file mode 100644 index 00000000000..80aaedc8b50 --- /dev/null +++ b/examples/configs/deepseek-r1-latency.yaml @@ -0,0 +1,14 @@ +max_batch_size: 4 +tensor_parallel_size: 8 +moe_expert_parallel_size: 2 +max_num_tokens: 32768 +trust_remote_code: true +kv_cache_free_gpu_memory_fraction: 0.75 +moe_backend: TRTLLM +use_cuda_graph: true +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + use_relaxed_acceptance_for_thinking: true + relaxed_topk: 10 + relaxed_delta: 0.6 diff --git a/examples/configs/deepseek-r1-throughput.yaml b/examples/configs/deepseek-r1-throughput.yaml new file mode 100644 index 00000000000..4e59d9acb24 --- /dev/null +++ b/examples/configs/deepseek-r1-throughput.yaml @@ -0,0 +1,16 @@ +max_batch_size: 1024 +max_num_tokens: 3200 +kv_cache_free_gpu_memory_fraction: 0.8 +tensor_parallel_size: 8 +moe_expert_parallel_size: 8 +trust_remote_code: true +enable_attention_dp: true +cuda_graph_config: + enable_padding: true + max_batch_size: 128 +kv_cache_config: + dtype: fp8 +stream_interval: 10 +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 diff --git a/examples/configs/gpt-oss-120b-latency.yaml b/examples/configs/gpt-oss-120b-latency.yaml new file mode 100644 index 00000000000..371ac06874a --- /dev/null +++ b/examples/configs/gpt-oss-120b-latency.yaml @@ -0,0 +1,14 @@ +max_batch_size: 720 +max_num_tokens: 16384 +kv_cache_free_gpu_memory_fraction: 0.9 +tensor_parallel_size: 8 +moe_expert_parallel_size: 8 +trust_remote_code: true +enable_attention_dp: false +cuda_graph_config: + enable_padding: true + max_batch_size: 720 +moe_config: + backend: TRTLLM +stream_interval: 20 +num_postprocess_workers: 4 diff --git a/examples/configs/gpt-oss-120b-throughput.yaml b/examples/configs/gpt-oss-120b-throughput.yaml new file mode 100644 index 00000000000..5cbe64f46b4 --- /dev/null +++ b/examples/configs/gpt-oss-120b-throughput.yaml @@ -0,0 +1,18 @@ +max_batch_size: 720 +max_num_tokens: 16384 +kv_cache_free_gpu_memory_fraction: 0.9 +tensor_parallel_size: 8 +moe_expert_parallel_size: 8 +trust_remote_code: true +enable_attention_dp: true +cuda_graph_config: + enable_padding: true + max_batch_size: 720 +moe_config: + backend: TRTLLM +stream_interval: 20 +num_postprocess_workers: 4 +attention_dp_config: + enable_balance: true + batching_wait_iters: 50 + timeout_iters: 1 diff --git a/examples/configs/llama-3.3-70b.yaml b/examples/configs/llama-3.3-70b.yaml new file mode 100644 index 00000000000..8887bd2955f --- /dev/null +++ b/examples/configs/llama-3.3-70b.yaml @@ -0,0 +1,12 @@ +max_batch_size: 1024 +max_num_tokens: 2048 +kv_cache_free_gpu_memory_fraction: 0.9 +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +trust_remote_code: true +enable_attention_dp: false +cuda_graph_config: + enable_padding: true + max_batch_size: 1024 +kv_cache_config: + dtype: fp8 diff --git a/examples/configs/llama-4-scout.yaml b/examples/configs/llama-4-scout.yaml new file mode 100644 index 00000000000..8887bd2955f --- /dev/null +++ b/examples/configs/llama-4-scout.yaml @@ -0,0 +1,12 @@ +max_batch_size: 1024 +max_num_tokens: 2048 +kv_cache_free_gpu_memory_fraction: 0.9 +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +trust_remote_code: true +enable_attention_dp: false +cuda_graph_config: + enable_padding: true + max_batch_size: 1024 +kv_cache_config: + dtype: fp8 diff --git a/examples/configs/qwen3-disagg-prefill.yaml b/examples/configs/qwen3-disagg-prefill.yaml new file mode 100644 index 00000000000..93de3e7cf5d --- /dev/null +++ b/examples/configs/qwen3-disagg-prefill.yaml @@ -0,0 +1,8 @@ +max_batch_size: 161 +max_num_tokens: 1160 +kv_cache_free_gpu_memory_fraction: 0.8 +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +trust_remote_code: true +print_iter_log: true +enable_attention_dp: true diff --git a/examples/configs/qwen3-next.yaml b/examples/configs/qwen3-next.yaml new file mode 100644 index 00000000000..b78921a6c27 --- /dev/null +++ b/examples/configs/qwen3-next.yaml @@ -0,0 +1,16 @@ +max_batch_size: 16 +max_num_tokens: 4096 +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +trust_remote_code: true +enable_attention_dp: false +cuda_graph_config: + enable_padding: true + max_batch_size: 720 +moe_config: + backend: TRTLLM +stream_interval: 20 +num_postprocess_workers: 4 +kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 diff --git a/examples/configs/qwen3.yaml b/examples/configs/qwen3.yaml new file mode 100644 index 00000000000..c47d4904c33 --- /dev/null +++ b/examples/configs/qwen3.yaml @@ -0,0 +1,20 @@ +max_batch_size: 161 +max_num_tokens: 1160 +kv_cache_free_gpu_memory_fraction: 0.8 +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 384 +print_iter_log: true +enable_attention_dp: true diff --git a/examples/models/core/qwen/README.md b/examples/models/core/qwen/README.md index cd216a15e29..1326523d4b0 100644 --- a/examples/models/core/qwen/README.md +++ b/examples/models/core/qwen/README.md @@ -740,40 +740,22 @@ python3 benchmarks/cpp/prepare_dataset.py \ ``` ### Serving + +#### Recommended Performance Settings + +We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case. + +```shell +TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment +EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml +``` + #### trtllm-serve To serve the model using `trtllm-serve`: ```bash -cat >./extra-llm-api-config.yml <./ctx-extra-llm-api-config.yml < output_ctx & +trtllm-serve Qwen3-30B-A3B/ --port 8001 --extra_llm_api_options ${EXTRA_LLM_API_FILE} &> output_ctx & ``` And you can launch two generation servers on port 8002 and 8003 with: ```bash export TRTLLM_USE_UCX_KVCACHE=1 - -cat >./gen-extra-llm-api-config.yml < output_gen_${port} & \ +trtllm-serve Qwen3-30B-A3B/ --port ${port} --extra_llm_api_options ${EXTRA_LLM_API_FILE} &> output_gen_${port} & \ done ```