Skip to content

Commit 78b72d3

Browse files
committed
amd post edits
1 parent cf07250 commit 78b72d3

File tree

2 files changed

+28
-25
lines changed

2 files changed

+28
-25
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ To add a new blogpost, please refer to `_posts/2023-06-20-vllm.md` as an example
1414
- Push your edits to this repo to save your changes.
1515

1616
To publish:
17+
- `JEKYLL_ENV=production bundle exec jekyll build` to compile the blogpost fresh.
1718
- After you finish writing, copy the whole content of `_site/` to `vllm-project.github.io` and push to the github repo.
1819
- Note that there is a `CNAME` file in the `vllm-project.github.io` that is not included in `_site/`. Please do not delete it.
1920

_posts/2024-10-23-vllm-serving-amd.md

+27-25
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
22
layout: post
33
title: "Serving LLMs on AMD MI300X: Best Practices"
4-
author: "Embedded LLM and Hot Aisles Inc."
4+
author: "Guest Post by Embedded LLM and Hot Aisles Inc."
55
---
66

7-
**TL;DR:** vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B. This guide explores key vLLM settings to maximize efficiency, showing you how to leverage the power of open-source LLM inference on AMD. If you just want to see the optimal parameters, jump to the [Quick Start Guide](#quick-start-guide).
7+
**TL;DR:** vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B. This guide explores 8 key vLLM settings to maximize efficiency, showing you how to leverage the power of open-source LLM inference on AMD. If you just want to see the optimal parameters, jump to the [Quick Start Guide](#quick-start-guide).
88

99

1010
<p align="center">
@@ -35,7 +35,7 @@ ROCm, AMD's answer to CUDA, might be less familiar to some, but it's rapidly mat
3535

3636
### vLLM v.s. TGI
3737

38-
vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B.
38+
vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B.
3939

4040
On Llama 3.1 405B, vLLM demonstrates significantly better performance compared to TGI in both time to first token (TTFT) and throughput across various query-per-second (QPS) scenarios. For TTFT, vLLM achieves approximately 3.8x faster response times on average compared to TGI at 16 QPS in the optimized configuration. Throughput-wise, vLLM consistently outperforms TGI, with the highest throughput of 5.76 requests/second on the ShareGPT dataset at 1000 QPS in the optimized setup, compared to TGI's 3.55 requests/second.
4141

@@ -58,23 +58,23 @@ vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (FP16, QPS 16, 32, 100
5858

5959
We've been extensively testing various vLLM settings to identify optimal configurations for MI300X. Here's what we've learned:
6060

61-
- **Chunked Prefill**: The rule of thumb is to disable it in most cases for better performance.
61+
- **Chunked Prefill**: The rule of thumb is to disable it for now on MI300X in most cases for better performance.
6262
- **Multi-Step Scheduling**: Significant gains in GPU utilization and overall performance can be achieved with multi-step scheduling. Set the `--num-scheduler-steps` to a value between 10 and 15 to optimize GPU utilization and performance.
63-
- **Prefix Caching**: Combining prefix caching with chunked prefill can enhance performance in specific scenarios. However, if user requests have a low prefix caching hit rate, it might be advisable to disable both chunked prefill and prefix caching as a general rule of thumb.
64-
- **Graph Capture**: When working with models that support long context lengths, set the `--max-seq-len-to-capture` to 16384. However, be aware that increasing this value doesn't always guarantee performance improvements and may sometimes lead to degradation due to suboptimal bucket sizes.
63+
- **Prefix Caching**: Combining prefix caching with chunked prefill can enhance performance in specific scenarios. However, if user requests have a low prefix caching hit rate, it might be advisable to disable both chunked prefill and prefix caching.
64+
- **Graph Capture**: When working with models that support long context lengths, set the `--max-seq-len-to-capture` to 16384. However, be aware that increasing this value doesn't always guarantee performance improvements and may sometimes lead to degradation due to suboptimal bucket sizes.
6565
- **AMD-Specific Optimizations**: Disabling NUMA balancing and tuning `NCCL_MIN_NCHANNELS` can yield further performance improvements.
6666
- **KV Cache Data Type**: For optimal performance, use the default KV cache data type, which automatically matches the model's data type.
6767
- **Tensor Parallelism**: For throughput optimization, use the minimum tensor parallelism (TP) that accommodates the model weights and context, and run multiple vLLM instances. For latency optimization, set TP equal to the number of GPUs in a node.
6868
- **Maximum Number of Sequences**: To optimize performance, increase `--max-num-seqs` to 512 or higher, based on your GPU's memory and compute resources. This can significantly improve resource utilization and throughput, especially for models handling shorter inputs and outputs.
69-
- **Use CK Flash Attention**: the CK Flash Attention implementation is a lot faster than triton implementation
69+
- **Use CK Flash Attention**: the CK Flash Attention implementation is a lot faster than triton implementation.
7070

71-
#### Detailed Analysis and Experiments
71+
#### Detailed Analysis and Experiments
7272

7373
##### Case 1: Chunked Prefill
7474

7575
Chunked prefill is an experimental feature in vLLM that allows large prefill requests to be divided into smaller chunks batched together with decode requests. This improves system efficiency by overlapping compute-bound prefill requests with memory-bound decode requests. You can enable it by setting `--enable_chunked_prefill=True` in the LLM constructor or using the `--enable-chunked-prefill` command line option.
7676

77-
Based on the experiment we ran, we found that there’s a slight improvement with tuning the chunked prefill values over disabling the chunked prefill feature. However, if you’re not sure whether to enable chunked prefill or not, simply start off by disabling it and you should generally expect better performance than with using the default settings.
77+
Based on the experiment we ran, we found that there’s a slight improvement with tuning the chunked prefill values over disabling the chunked prefill feature. However, if you’re not sure whether to enable chunked prefill or not, simply start off by disabling it and you should generally expect better performance than with using the default settings. This is specific to MI300X GPUs.
7878

7979

8080
<p align="center">
@@ -116,10 +116,10 @@ Chunked Prefill and prefix caching are optimization techniques in vLLM that impr
116116

117117
By default, vLLM will automatically _enable the chunked prefill feature if a model has a context length of more than 32k tokens_. The maximum number of tokens to be chunked for prefill is set to 512 by default.
118118

119-
Before we dive deep into the graph, we’ll first try to clear off some of the used jargon. **_Fresh Run_** refers to the situation where the prefix caching memory is not populated at all. **_2nd Run_** refers to rerunning the benchmark script again after the _Fresh Run_. In general, when rerunning the ShareGPT benchmark dataset on the _2nd Run_, we get around a _50%_ prefix caching hit-rate.
119+
Before we dive deep into the graph, we’ll first try to explain the terminology used in the experiment. **_Fresh Run_** refers to the situation where the prefix caching memory is not populated at all. **_2nd Run_** refers to rerunning the benchmark script again after the _Fresh Run_. In general, when rerunning the ShareGPT benchmark dataset on the _2nd Run_, we get around a _50%_ prefix caching hit-rate.
120120

121121
Looking at the graphs below, we can make three observations about this experiment.
122-
1. Based on the comparison of Bar 2 (red) with the baseline (blue), there is a huge gain in performance.
122+
1. Based on the comparison of Bar 2 (red) with the baseline (blue), there is a huge gain in performance.
123123
2. Based on the comparison of Bar 3 (yellow), Bar 5 (orange) and Bar 6 (teal) with the baseline, the chunked prefill performance depends on the user request input prompt length distribution.
124124
3. In our experiments we found that the prefix caching hit rates of Bar 3 (yellow) and Bar 4 (green) are around _0.9%_ and _50%_. Based on the comparison of Bar 3 (yellow) and Bar 4 (green) with the baseline and Bar 2 (red), this tells us that if the user requests do not have high prefix caching hit rate, disabling both chunked prefill and prefix caching might be considered a good rule of thumb.
125125

@@ -138,17 +138,17 @@ Looking at the graphs below, we can make three observations about this experimen
138138

139139
##### Case 4: Max sequence length to capture
140140

141-
The `--max-seq-len-to-capture` argument in vLLM controls the maximum sequence length that can be handled by CUDA graphs, which optimize performance by capturing and replaying GPU operations. If a sequence exceeds this length, the system reverts to "eager mode," executing operations one by one, which can be less efficient. This applies to both regular and encoder-decoder models.
141+
The `--max-seq-len-to-capture` argument in vLLM controls the maximum sequence length that can be handled by CUDA/HIP graphs, which optimize performance by capturing and replaying GPU operations. If a sequence exceeds this length, the system reverts to eager mode executing operations one by one, which can be less efficient. This applies to both regular and encoder-decoder models.
142142

143-
Our benchmarks reveal an interesting trend: increasing `--max-seq-len-to-capture` doesn't always improve performance and can sometimes even degrade it. This might be due to how vLLM creates "buckets" for different sequence lengths.
143+
Our benchmarks reveal an interesting trend: increasing `--max-seq-len-to-capture` doesn't always improve performance and can sometimes even degrade it. This might be due to how vLLM creates buckets for different sequence lengths.
144144

145145
Here's why:
146146
- **Bucketing**: vLLM uses buckets to group sequences of similar lengths, optimizing graph capture for each bucket.
147147
- **Optimal Buckets**: Initially, the buckets are finely grained (e.g., [4, 8, 12,..., 2048, 4096]), allowing for efficient graph capture for various sequence lengths.
148148
- **Coarser Buckets**: Increasing `--max-seq-len-to-capture` can lead to coarser buckets (e.g., [4, 8, 12, 2048, 8192]).
149-
- **Performance Impact**: When input sequences fall into these larger, less precise buckets, the captured CUDA graphs may not be optimal, potentially leading to reduced performance.
149+
- **Performance Impact**: When input sequences fall into these larger, less precise buckets, the captured CUDA/HIP graphs may not be optimal, potentially leading to reduced performance.
150150

151-
Therefore, while capturing longer sequences with CUDA graphs seems beneficial, it's crucial to consider the potential impact on bucketing and overall performance. Finding the optimal `--max-seq-len-to-capture` value may require experimentation to balance graph capture efficiency with appropriate bucket sizes for your specific workload.
151+
Therefore, while capturing longer sequences with CUDA/HIP graphs seems beneficial, it's crucial to consider the potential impact on bucketing and overall performance. Finding the optimal `--max-seq-len-to-capture` value may require experimentation to balance graph capture efficiency with appropriate bucket sizes for your specific workload.
152152

153153

154154
<p align="center">
@@ -199,9 +199,9 @@ Even though the gains might be small, fine-tuning these environment variables ca
199199

200200
##### Case 6: KVCache Type Auto/FP8
201201

202-
By default, vLLM will automatically allocate a KV Cache type that matches the model’s data type. However, vLLM also supports native FP8 on MI300X which we can exploit to reduce the memory requirement of KVCache and thereby increasing the deployable context length of the model.
202+
By default, vLLM will automatically allocate a KV Cache type that matches the model’s data type. However, vLLM also supports native FP8 on MI300X which we can exploit to reduce the memory requirement of KVCache and thereby increasing the deployable context length of the model.
203203

204-
We experiment by using Auto KVCache type and KV Cache type FP8 and compare it to the default baseline. We can see from the figure below that using Auto KVCache type (red) achieves a higher request per second rate than using KV Cache type set to FP8 (yellow). Theoretically, this might be due to a quantization overhead in Llama-3.1-70B-Instruct (bfloat16) model, but since the cost of the overhead seems to be small, it could still be a good tradeoff in some cases to obtain a huge reduction in the KVCache requirements.
204+
We experiment by using Auto KVCache type and KV Cache type FP8 and compare it to the default baseline. We can see from the figure below that using Auto KVCache type (red) achieves a higher request per second rate than using KV Cache type set to FP8 (yellow). Theoretically, this might be due to a quantization overhead in `Llama-3.1-70B-Instruct (bfloat16)` model, but since the cost of the overhead seems to be small, it could still be a good tradeoff in some cases to obtain a huge reduction in the KVCache requirements.
205205

206206

207207

@@ -220,9 +220,9 @@ We experiment by using Auto KVCache type and KV Cache type FP8 and compare it to
220220

221221
##### Case 7: Performance Difference between TP 4 and TP 8
222222

223-
Tensor parallelism is a technique for distributing the computational load of large models. It works by splitting individual tensors across multiple devices, allowing for parallel processing of specific operations or layers. This approach reduces the memory footprint of the model and enables scaling across multiple GPUs.
223+
Tensor parallelism is a technique for distributing the computational load of large models. It works by splitting individual tensors across multiple devices, allowing for parallel processing of specific operations or layers. This approach reduces the memory footprint of the model and enables scaling across multiple GPUs.
224224

225-
While increasing the tensor parallelism degree can improve performance by providing more compute resources, the gains aren't always linear. This is because communication overhead increases as more devices are involved, and the workload on each individual GPU decreases. Given the substantial processing power of the MI300X, smaller workloads per GPU can actually lead to underutilization, further hindering performance scaling.
225+
While increasing the tensor parallelism degree can improve performance by providing more compute resources, the gains aren't always linear. This is because communication overhead increases as more devices are involved, and the workload on each individual GPU decreases. Given the substantial processing power of the MI300X, smaller workloads per GPU can actually lead to underutilization, further hindering performance scaling.
226226
227227
Therefore, when optimizing for throughput, we recommend launching multiple instances of vLLM instead of aggressively increasing tensor parallelism. This approach tends to yield more linear performance improvements. However, if minimizing latency is the priority, increasing the tensor parallelism degree may be the more effective strategy.
228228
@@ -240,7 +240,7 @@ Therefore, when optimizing for throughput, we recommend launching multiple insta
240240
241241
##### Case 8: Effect of Maximum Number of (Parallel) Sequences
242242
243-
The `--max-num-seqs` argument specifies the maximum number of sequences that can be processed per iteration. This parameter controls the number of concurrent requests in a batch, impacting memory usage and performance. In the ShareGPT benchmark, due to the shorter input and output length of the samples, the Llama-3.1-70B-Instruct hosted on MI300X can process a large number of requests per iteration. In our experiment, the `--max-num-seqs` is still a limiting factor, even if `--max-num-seqs` is set at 1024.
243+
The `--max-num-seqs` argument specifies the maximum number of sequences that can be processed per iteration. This parameter controls the number of concurrent requests in a batch, impacting memory usage and performance. In the ShareGPT benchmark, due to the shorter input and output length of the samples, the `Llama-3.1-70B-Instruct` hosted on MI300X can process a large number of requests per iteration. In our experiment, the `--max-num-seqs` is still a limiting factor, even if `--max-num-seqs` is set at 1024.
244244
245245
<p align="center">
246246
<picture>
@@ -274,9 +274,9 @@ VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --host
274274
```
275275
276276
For quick setup, we have compiled the Docker Image of vLLM 0.6.2 (commit: _cb3b2b9ba4a95c413a879e30e2b8674187519a93_) to Github Container Registry.
277-
To get download the image:
277+
To get download the image:
278278
```bash
279-
# v0.6.2 post
279+
# v0.6.2 post
280280
docker pull ghcr.io/embeddedllm/vllm-rocm:cb3b2b9
281281
# P.S. We also have compiled the image for v0.6.3.post1 at commit 717a5f8
282282
docker pull ghcr.io/embeddedllm/vllm-rocm:v0.6.3.post1-717a5f8
@@ -304,12 +304,14 @@ VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --host
304304
```
305305
306306
### Conclusion
307-
This guide has explored the power of vLLM for serving large language models on AMD MI300X GPUs. By meticulously tuning key settings like chunked prefill, multi-step scheduling, and CUDA graph capture, we've demonstrated how to achieve substantial performance gains over standard configurations and alternative serving solutions. vLLM unlocks significantly higher throughput and faster response times, making it an ideal choice for deploying LLMs on AMD hardware.
307+
This guide has explored the power of vLLM for serving large language models on AMD MI300X GPUs. By meticulously tuning key settings like chunked prefill, multi-step scheduling, and CUDA graph capture, we've demonstrated how to achieve substantial performance gains over standard configurations and alternative serving solutions. vLLM unlocks significantly higher throughput and faster response times, making it an ideal choice for deploying LLMs on AMD hardware.
308308

309-
However, it's important to acknowledge that our exploration has focused primarily on general chatbot usage with short inputs and outputs. Further investigation is needed to optimize vLLM for specific use cases like summarization or long-form content generation. Additionally, a deeper dive into the performance differences between Triton and CK attention kernels could yield further insights.
309+
However, it's important to acknowledge that our exploration has focused primarily on general chatbot usage with short inputs and outputs. Further investigation is needed to optimize vLLM for specific use cases like summarization or long-form content generation. Additionally, a deeper dive into the performance differences between Triton and CK attention kernels could yield further insights.
310+
311+
We also want to acknolwedge [this wonderful blogpost](https://shisa.ai/blog/posts/tuning-vllm-mi300x/) by Leonard Lin on how to further optimize vLLM for MI300X, including hipBLAS vs hipBLASLt, CK Flash Attention vs Triton Flash Attention, Tensor Parallelism vs Pipeline Parallelism, etc.
310312
311313
### Acknowledgements
312-
This blog post is drafted by the team at [Embedded LLM](https://embeddedllm.com/) and Thank you to [Hot Aisles Inc.](https://hotaisle.xyz/) for sponsoring MI300X for benchmarking vLLM.
314+
This blog post is drafted by the team at [Embedded LLM](https://embeddedllm.com/) and thank you to [Hot Aisles Inc.](https://hotaisle.xyz/) for sponsoring MI300X for benchmarking vLLM.
313315
314316
### Appendix
315317

0 commit comments

Comments
 (0)