-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] oom,torch.OutOfMemoryError: seems to only use one gpu on A800-80G,available 40g on each card #1463
Comments
You need to use |
thank you so much. it works like charm: [00:19:29] server_args=ServerArgs(model_path='/workspace/model/Qwen2.5-72-int4', tokenizer_path='/workspace/model/Qwen2.5-72-int4', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization='gptq_marlin', served_model_name='/workspace/model/Qwen2.5-72-int4', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.9, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=595017997, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None) [00:23:56 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=14.86 GB |
See hyperparameter_tuning.md on tuning hyperparameters for better performance. |
thank you! |
close the issue as it has been solved. |
Checklist
Describe the bug
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in torch_function
return func(*args, **kwargs)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 0 has a total capacity of 79.25 GiB of which 2.27 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 36.46 GiB is allocated by PyTorch, and 20.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Reproduction
docker run --gpus all -it -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface -v /data/xgp:/workspace -v /data/llm:/workspace/model --env "HF_TOKEN=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" --env CUDA_VISIBLE_DEVICES=0,1 --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path /workspace/model/Qwen2.5-72-int4 --host 0.0.0.0 --port 30000 --quantization gptq_marlin --mem-fraction-static 0.9 --disable-cuda-grap
Environment
Python: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0]
CUDA available: True
GPU 0,1: NVIDIA A800 80GB PCIe
GPU 0,1 Compute Capability: 8.0
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 11.5, V11.5.119
CUDA Driver Version: 560.35.03
PyTorch: 2.4.0+cu121
sglang: 0.3.0
flashinfer: Module Not Found
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.113.0
hf_transfer: Module Not Found
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.0
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.6.0
multipart: Module Not Found
openai: 1.43.1
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0-23,48-71 0 N/A
GPU1 NODE X 0-23,48-71 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1024
The text was updated successfully, but these errors were encountered: