-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load all MoE experts during warmup #11571
Conversation
…f nodes during warmup
A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available. |
I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?) |
I'll consider adding proper support for this in #11213. |
@ggerganov if you are going to work on warmup then take a look at this: #11733 TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems. |
@fairydreaming Any chance you can resolve the conflicts for this PR? I was just about to do the final tests on the MLA PR but need this and #11397 to do it! :) |
@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code. I guess I will close it for now, as it's no longer a valid solution. |
@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮 I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down. |
I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:
|
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <[email protected]>
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <[email protected]>
This PR adds new API call that allows to enable and disable model warmup mode:
LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);
This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.I couldn't find a better way to do it, let me know if one exists.
If the warmup mode is enabled then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during
llama_decode()
call.Fixes #11163