Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Make memory profiler account for speculative draft model weights #14067

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benchislett
Copy link
Contributor

@benchislett benchislett commented Feb 28, 2025

The memory instability of speculative decoding has been well-known in vLLM for a long time. A major contributor to this issue is that the weights of the draft model are not included in memory profiling, and therefore vLLM will go out-of-memory if the draft model weights can not be stored in the remainder outside of the gpu-memory-utilization.

This PR adds the proposer's memory usage onto the scorer worker's memory usage statistic before the memory profiler is run.

I feel strongly that this behaviour is the correct one, but it will definitely mess with existing deployment configurations if merged. I am open to discussion on how to best handle this issue.

Example usage on 1x4090 with Qwen 2.5 Coder 3B, draft model Qwen 2.5 Coder 1.5B, see the bottom line of each block:

vllm serve "Qwen/Qwen2.5-Coder-3B-Instruct" --gpu-memory-utilization 0.9 --speculative-model "Qwen/Qwen2.5-Coder-1.5B-Instruct" --num-speculative-tokens 4 --max-model-len 4096 --max-num-seqs 8 --tensor-parallel-size 1

Before

INFO 02-28 16:46:16 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-Coder-3B-Instruct...
INFO 02-28 16:46:16 [weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  3.93it/s]

INFO 02-28 16:46:17 [model_runner.py:1117] Loading model weights took 5.7915 GB and 0.729359 seconds
INFO 02-28 16:46:17 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-Coder-1.5B-Instruct...
INFO 02-28 16:46:17 [weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 02-28 16:46:17 [weight_utils.py:304] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.19it/s]

INFO 02-28 16:46:17 [model_runner.py:1117] Loading model weights took 2.8787 GB and 0.454114 seconds
INFO 02-28 16:46:17 [spec_decode_worker.py:380] [Speculative Decoding] Use batch expansion for scoring proposals.
INFO 02-28 16:46:18 [worker.py:267] Memory profiling takes 0.35 seconds
INFO 02-28 16:46:18 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.55GiB) x gpu_memory_utilization (0.90) = 21.19GiB
INFO 02-28 16:46:18 [worker.py:267] model weights take 5.79GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 0.32GiB; the rest of the memory reserved for KV Cache is 14.98GiB.

An OOM occurs shortly after:

ERROR 02-28 16:46:20 [engine.py:409]   File "/home/benchislett/Repos/centml_vllm_fork/vllm/vllm/worker/worker_base.py", line 158, in initialize_cache
ERROR 02-28 16:46:20 [engine.py:409]     self.worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 02-28 16:46:20 [engine.py:409]   File "/home/benchislett/Repos/centml_vllm_fork/vllm/vllm/worker/worker.py", line 307, in initialize_cache
ERROR 02-28 16:46:20 [engine.py:409]     self._init_cache_engine()
ERROR 02-28 16:46:20 [engine.py:409]   File "/home/benchislett/Repos/centml_vllm_fork/vllm/vllm/worker/worker.py", line 313, in _init_cache_engine
ERROR 02-28 16:46:20 [engine.py:409]     CacheEngine(self.cache_config, self.model_config,
ERROR 02-28 16:46:20 [engine.py:409]   File "/home/benchislett/Repos/centml_vllm_fork/vllm/vllm/worker/cache_engine.py", line 69, in __init__
ERROR 02-28 16:46:20 [engine.py:409]     self.gpu_cache = self._allocate_kv_cache(
ERROR 02-28 16:46:20 [engine.py:409]                      ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-28 16:46:20 [engine.py:409]   File "/home/benchislett/Repos/centml_vllm_fork/vllm/vllm/worker/cache_engine.py", line 103, in _allocate_kv_cache
ERROR 02-28 16:46:20 [engine.py:409]     layer_kv_cache = torch.zeros(alloc_shape,
ERROR 02-28 16:46:20 [engine.py:409]                      ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-28 16:46:20 [engine.py:409] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 240.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 120.69 MiB is free. Including non-PyTorch memory, this process has 22.53 GiB memory in use. Of the allocated memory 21.81 GiB is allocated by PyTorch, with 2.00 MiB allocated in private pools (e.g., CUDA Graphs), and 231.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

After

INFO 02-28 16:40:27 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-Coder-3B-Instruct...
INFO 02-28 16:40:27 [weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.00it/s]

INFO 02-28 16:40:27 [model_runner.py:1117] Loading model weights took 5.7915 GB and 0.794862 seconds
INFO 02-28 16:40:27 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-Coder-1.5B-Instruct...
INFO 02-28 16:40:28 [weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 02-28 16:40:28 [weight_utils.py:304] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.19it/s]

INFO 02-28 16:40:28 [model_runner.py:1117] Loading model weights took 2.8787 GB and 0.495663 seconds
INFO 02-28 16:40:28 [spec_decode_worker.py:380] [Speculative Decoding] Use batch expansion for scoring proposals.
INFO 02-28 16:40:29 [worker.py:267] Memory profiling takes 0.40 seconds
INFO 02-28 16:40:29 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.55GiB) x gpu_memory_utilization (0.90) = 21.19GiB
INFO 02-28 16:40:29 [worker.py:267] model weights take 8.67GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 0.32GiB; the rest of the memory reserved for KV Cache is 12.10GiB.

No OOM error.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant