Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load all MoE experts during warmup #11571

Merged
merged 4 commits into from
Mar 14, 2025

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Feb 1, 2025

This PR adds new API call that allows to enable and disable model warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.
I couldn't find a better way to do it, let me know if one exists.

If the warmup mode is enabled then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during llama_decode() call.

Fixes #11163

@cpumaxx
Copy link
Contributor

cpumaxx commented Feb 3, 2025

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available.
I will try a test on a non-MoE large model as well to make sure there are no regressions in that case.
Thanks for this fix!

@jukofyork
Copy link
Contributor

I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?)

@ggerganov
Copy link
Member

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@fairydreaming
Copy link
Collaborator Author

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@ggerganov if you are going to work on warmup then take a look at this: #11733

TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems.

@jukofyork
Copy link
Contributor

@fairydreaming Any chance you can resolve the conflicts for this PR?

I was just about to do the final tests on the MLA PR but need this and #11397 to do it! :)

@fairydreaming
Copy link
Collaborator Author

@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code.

I guess I will close it for now, as it's no longer a valid solution.

@jukofyork
Copy link
Contributor

@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮

I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down.

@fairydreaming fairydreaming reopened this Mar 14, 2025
@fairydreaming
Copy link
Collaborator Author

I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

@fairydreaming fairydreaming merged commit 8fcb563 into ggml-org:master Mar 14, 2025
47 checks passed
jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: model warmup doesn't work correctly for MoE models
5 participants