You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
I'm trying to run Llama-405B, FP8, on 4 nodes of 4xA40, each with ~44GB VRAM. Theoretically, speaking, this is plenty of VRAM for 405B, FP8, given that I only need 405GB VRAM total +/- 50GB during inference. However, I keep running into OOM issues, after all safetensors checkpoints have been loaded, which does not seem to make much sense to me.
I've tried to reduce cache, context length, etc, but that does not seem to have too much of an effect on the OOM.
Reproduction
On each of the nodes, I run the following command:
I have 4x4xA40 nodes, for distributed inference. They're linked by a 25GBe backbone. All of them have the same environment. Here is one of them detailed below:
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A40
GPU 0,1,2,3 Compute Capability: 8.6
CUDA_HOME: None
PyTorch: 2.4.0+cu121
sglang: 0.3.1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.114.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.1
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.45.1
anthropic: 0.34.2
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 SYS SYS 0,2,4,6,8,10 0 N/A
GPU1 NV4 X SYS SYS 0,2,4,6,8,10 0 N/A
GPU2 SYS SYS X NV4 1,3,5,7,9,11 1 N/A
GPU3 SYS SYS NV4 X 1,3,5,7,9,11 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1024
The text was updated successfully, but these errors were encountered:
I have same issue with DeepSeek-V2 and 4xL40S. I have 192GB VRAM Total and it crash on OOM during the load. I see it will use only one GPU. ExllamaV2 can utilize it much better than lVVM. I think lVVM is buggy.
Checklist
Describe the bug
I'm trying to run Llama-405B, FP8, on 4 nodes of 4xA40, each with ~44GB VRAM. Theoretically, speaking, this is plenty of VRAM for 405B, FP8, given that I only need 405GB VRAM total +/- 50GB during inference. However, I keep running into OOM issues, after all safetensors checkpoints have been loaded, which does not seem to make much sense to me.
I've tried to reduce cache, context length, etc, but that does not seem to have too much of an effect on the OOM.
Reproduction
On each of the nodes, I run the following command:
Environment
I have 4x4xA40 nodes, for distributed inference. They're linked by a 25GBe backbone. All of them have the same environment. Here is one of them detailed below:
The text was updated successfully, but these errors were encountered: