Skip to content

model : do not repack if a GPU device is present #12498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 21, 2025

Conversation

ggerganov
Copy link
Member

fix #12481
fix #12490
ref #12481 (comment)

Disable repacking (i.e. extra buffer types) if a GPU device is going to be available.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
ggml-ci
@ggerganov ggerganov requested a review from slaren March 21, 2025 13:39
@ggerganov ggerganov merged commit af04481 into master Mar 21, 2025
56 of 58 checks passed
@ggerganov ggerganov deleted the gg/repack-skip-if-gpu branch March 21, 2025 14:14
Ivy233 pushed a commit to Ivy233/llama.cpp that referenced this pull request Mar 23, 2025
@Djip007
Copy link
Contributor

Djip007 commented Mar 24, 2025

why disable and not simply change order:
// CPU: ACCEL -> CPU extra -> GPU host -> CPU
to
// CPU: ACCEL -> GPU host -> CPU extra -> CPU
???

@ggerganov
Copy link
Member Author

I think keeping the non-offloaded layers without repack would allow to dynamically move the ops to the GPU which would end up generally more efficient than using the optimized repacked CPU implementations.

@Djip007
Copy link
Contributor

Djip007 commented Mar 26, 2025

Yes, I understood that you prefer to have dynamically move the ops to the GPU.
But wouldn't it have had the same effect in that case by changing the order of addition (priority) of the buffers type, without completely disabling them?

@ggerganov
Copy link
Member Author

Hm, I guess you might be right. Do you want to give this a try, or should I go ahead and create a PR.

@Djip007
Copy link
Contributor

Djip007 commented Mar 28, 2025

OK. I'll have a try!

=> #12632 🤞

Djip007 pushed a commit to Djip007/llama.cpp that referenced this pull request Mar 28, 2025
this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (ggml-org#12498) without
completely disable CPU extra buffer.
slaren pushed a commit that referenced this pull request Mar 29, 2025
… CPU (#12632)

this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (#12498) without
completely disable CPU extra buffer.

Co-authored-by: philou <philou@framework>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants