Replies: 1 comment
-
|
Hi @farouk09, I see exactly what is happening here. The short answer is that a 1 million token context window requires a massive amount of VRAM which is way more than a single A6000 can handle. The error you are seeing is vLLM doing its safety check and realizing that even with the empty space on your card, it cannot physically allocate enough memory pages to support that full context length. Regarding your math, the estimate for the model weights was a little bit low. Since you are running in To be precise, you can actually calculate exactly how much memory a single token takes up using the standard KV cache formula. For this specific Qwen model architecture which uses Grouped Query Attention, the calculation looks like this: When you plug in the specific numbers for Qwen2.5-14B, it comes out to roughly 0.2 MB per token. If you multiply that by 1 million tokens, you end up needing nearly 200 GB of VRAM just for the history, which explains why your single card is crashing. To get this running on your current hardware, you need to manually lower the expectations by adding this argument to your command --max-model-len 65536 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone! 👋
I'm trying to run the model Qwen/Qwen2.5-14B-Instruct-1M on an NVIDIA RTX A6000 (49.1 GB VRAM) using vllm serve with the --dtype auto option. However, I'm getting the following error:
ValueError: The model's max seq len (1010000) is larger than the maximum number of tokens that can be stored in KV cache (73792). Try increasinggpu_memory_utilizationor decreasingmax_model_lenwhen initializing the engine.From my understanding:
So I have two questions:
More broadly, I'd really appreciate if someone could explain how to estimate the total VRAM usage in
vllm serve, including weights, KV cache, context window, etc.Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions