KV cache in VLLM serve #17054

farouk09 · 2025-04-23T12:54:37Z

farouk09
Apr 23, 2025

Hi everyone! 👋

I'm trying to run the model Qwen/Qwen2.5-14B-Instruct-1M on an NVIDIA RTX A6000 (49.1 GB VRAM) using vllm serve with the --dtype auto option. However, I'm getting the following error:

ValueError: The model's max seq len (1010000) is larger than the maximum number of tokens that can be stored in KV cache (73792). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

From my understanding:

The model has ~14B parameters, so in FP16 it should use ~28GB of VRAM (please correct me if that’s inaccurate).
Running in auto mode (which defaults to FP16), the weights should take ~14GB.
That leaves ~36GB, which I thought would be enough for the KV cache for a 1M context length. But apparently not?

So I have two questions:

Is my assumption about the memory breakdown (weights vs KV cache) correct?
How can we estimate the required VRAM for a given number of context tokens? Is there a formula or rule of thumb to calculate the KV cache size needed based on context length and model size?

More broadly, I'd really appreciate if someone could explain how to estimate the total VRAM usage in vllm serve, including weights, KV cache, context window, etc.

Thanks in advance!

JunjieAraoXiong · 2025-11-25T14:32:43Z

JunjieAraoXiong
Nov 25, 2025

Hi @farouk09,

I see exactly what is happening here. The short answer is that a 1 million token context window requires a massive amount of VRAM which is way more than a single A6000 can handle. The error you are seeing is vLLM doing its safety check and realizing that even with the empty space on your card, it cannot physically allocate enough memory pages to support that full context length.

Regarding your math, the estimate for the model weights was a little bit low. Since you are running in auto mode which defaults to FP16, each of those 14 billion parameters takes up 2 bytes of memory. That means the model weights alone are occupying about 28 or 29 GB of your VRAM. Since vLLM reserves a specific percentage of the GPU for overhead and execution, you are likely left with only about 14 GB of free space for the KV cache. This lines up perfectly with the error message as it is telling you that you have just enough room for about 73,000 tokens rather than the full million.

To be precise, you can actually calculate exactly how much memory a single token takes up using the standard KV cache formula. For this specific Qwen model architecture which uses Grouped Query Attention, the calculation looks like this:

$$\text{Size per token} = 2 \times \text{Layers} \times \text{KV Heads} \times \text{Head Dimension} \times \text{Precision}$$

When you plug in the specific numbers for Qwen2.5-14B, it comes out to roughly 0.2 MB per token. If you multiply that by 1 million tokens, you end up needing nearly 200 GB of VRAM just for the history, which explains why your single card is crashing. To get this running on your current hardware, you need to manually lower the expectations by adding this argument to your command

--max-model-len 65536

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

KV cache in VLLM serve #17054

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

KV cache in VLLM serve #17054

Uh oh!

farouk09 Apr 23, 2025

Replies: 1 comment

Uh oh!

JunjieAraoXiong Nov 25, 2025

farouk09
Apr 23, 2025

JunjieAraoXiong
Nov 25, 2025