Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

Open
5 tasks done
sumukshashidhar opened this issue Sep 16, 2024 · 2 comments
Open
5 tasks done

[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

sumukshashidhar opened this issue Sep 16, 2024 · 2 comments

Comments

@sumukshashidhar
Copy link

sumukshashidhar commented Sep 16, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I'm trying to run Llama-405B, FP8, on 4 nodes of 4xA40, each with ~44GB VRAM. Theoretically, speaking, this is plenty of VRAM for 405B, FP8, given that I only need 405GB VRAM total +/- 50GB during inference. However, I keep running into OOM issues, after all safetensors checkpoints have been loaded, which does not seem to make much sense to me.

I've tried to reduce cache, context length, etc, but that does not seem to have too much of an effect on the OOM.

Reproduction

On each of the nodes, I run the following command:

GLOO_SOCKET_IFNAME=eno12399np0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 16 --nccl-init-addr 172.22.224.17:20000 --nnodes 4 --node-rank 3 --disable-cuda-graph --kv-cache-dtype fp8_e5m2 --chunked-prefill-size 1024 --mem-fraction-static 0.9 --disable-disk-cache

Environment

I have 4x4xA40 nodes, for distributed inference. They're linked by a 25GBe backbone. All of them have the same environment. Here is one of them detailed below:

Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A40
GPU 0,1,2,3 Compute Capability: 8.6
CUDA_HOME: None
PyTorch: 2.4.0+cu121
sglang: 0.3.1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.114.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.1
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.45.1
anthropic: 0.34.2
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     SYS     SYS     0,2,4,6,8,10    0               N/A
GPU1    NV4      X      SYS     SYS     0,2,4,6,8,10    0               N/A
GPU2    SYS     SYS      X      NV4     1,3,5,7,9,11    1               N/A
GPU3    SYS     SYS     NV4      X      1,3,5,7,9,11    1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024
@merrymercy
Copy link
Contributor

merrymercy commented Sep 22, 2024

  1. could you show the full log?
  2. reduce --mem-fraction-static 0.9 to prevent OOM instead of increasing it. The default value is 0.8 in this case.

@cyberluke
Copy link

I have same issue with DeepSeek-V2 and 4xL40S. I have 192GB VRAM Total and it crash on OOM during the load. I see it will use only one GPU. ExllamaV2 can utilize it much better than lVVM. I think lVVM is buggy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants