Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 458752000 and alignment 16 in AlignedAllocator #1405

Open
5 tasks done
josephydu opened this issue Sep 12, 2024 · 9 comments
Assignees

Comments

@josephydu
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I run the same benchmark script in the following two commits:
old: cb99ba4
new:c33d82a
I run it failed in the new commit but sccussed in the old commit.
I get the following error output:
image

Reproduction

server:
python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.8 --dp-size 2 --load-balance-method round_robin
benchmark:
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer Qwen/Qwen2-7B --model Qwen/Qwen2-7B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --request-rate 15.7 --num-prompts 200

Environment

I run the script in A100 40G 8GPU

@yukavio yukavio mentioned this issue Sep 12, 2024
4 tasks
@merrymercy
Copy link
Contributor

cc @yzh119 @zhyncs

@merrymercy
Copy link
Contributor

We will take a look soon. In the meanwhile, you can try to increase this value

self.flashinfer_workspace_size = 384 * 1024 * 1024

@zhyncs zhyncs self-assigned this Sep 12, 2024
@zhyncs
Copy link
Member

zhyncs commented Sep 12, 2024

Ok I’ll take a look asap

@merrymercy
Copy link
Contributor

@josephydu Can you try it again with sglang v0.3.1.post3?

I run the same command on 8xH100 and did not find any issues.

@York-Cheung
Copy link

York-Cheung commented Sep 25, 2024

Same. I use 2 A100, sglang v0.3.1.post3, and cuda graph disabled.

@josephydu
Copy link
Author

@josephydu Can you try it again with sglang v0.3.1.post3?

I run the same command on 8xH100 and did not find any issues.

I still got the problem in 8xA100. But when I try to increase flashinfer_workspace_size to 384 * 1024 * 1024 * 2, it works.
However, I still don't understand why in the old version the default value for this flashinfer_workspace_size only needs to be 192 * 1024 * 1024, but in the new version it needs to be 384 * 1024 * 1024

@dmakhervaks
Copy link

dmakhervaks commented Sep 26, 2024

@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s

This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix

and this is the error i get:

"RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator"

I get the same error with the following command variations as well

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7

I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does.

Not sure about in between 0.2.7 and 0.2.14

@josephydu
Copy link
Author

@merrymercy I am also getting the same issue when running llama 405B FP8 from neuralmagic on 8h100s

This is how I launch the server: python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix

and this is the error i get:

"RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 550502400 and alignment 16 in AlignedAllocator"

I get the same error with the following command variations as well

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla

python3 -m sglang.launch_server --model /models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --disable-mla --disable-cuda-graph

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix --mem-fraction-static 0.7

I checked and this problem does not happen in 0.2.7, but 0.2.14 and onwards it does.

Not sure about in between 0.2.7 and 0.2.14

Maybe you can try to increse flashinfer_workspace_size . It can temporarily solve the problem, but the real reason is still unknown.
sglang/python/sglang/global_config.py
self.flashinfer_workspace_size = 384 * 1024 * 1024

@dmakhervaks
Copy link

dmakhervaks commented Sep 27, 2024

@josephydu I think I found a pattern, which may help you in debugging this.

0.3.0 and up, if I remove "disable-radix-cache", I do not get the error.

i..e if run this:

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8

instead of

python3 -m sglang.launch_server --model /projects/xlab/ZLM/models/neuralmagic-Meta-Llama-3.1-405B-Instruct-FP8/ --tp 8 --disable-radix-cache

changing size of flashinfer_workspace_size gave me a different issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants