-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 458752000 and alignment 16 in AlignedAllocator #1405
Comments
We will take a look soon. In the meanwhile, you can try to increase this value sglang/python/sglang/global_config.py Line 26 in c33d82a
|
Ok I’ll take a look asap |
@josephydu Can you try it again with sglang v0.3.1.post3? I run the same command on 8xH100 and did not find any issues. |
Same. I use 2 A100, sglang v0.3.1.post3, and cuda graph disabled. |
I still got the problem in 8xA100. But when I try to increase |
@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix and this is the error i get: "RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator" I get the same error with the following command variations as well python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7 I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does. Not sure about in between 0.2.7 and 0.2.14 |
Maybe you can try to increse |
@josephydu I think I found a pattern, which may help you in debugging this. 0.3.0 and up, if I remove "disable-radix-cache", I do not get the error. i..e if run this:
instead of
changing size of flashinfer_workspace_size gave me a different issue |
Checklist
Describe the bug
I run the same benchmark script in the following two commits:
old: cb99ba4
new:c33d82a
I run it failed in the new commit but sccussed in the old commit.
I get the following error output:
Reproduction
server:
python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.8 --dp-size 2 --load-balance-method round_robin
benchmark:
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer Qwen/Qwen2-7B --model Qwen/Qwen2-7B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --request-rate 15.7 --num-prompts 200
Environment
I run the script in A100 40G 8GPU
The text was updated successfully, but these errors were encountered: