Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 or V3 Inference on consumer GPU #492

Closed
marvin-0042 opened this issue Jan 30, 2025 · 1 comment

Comments

@marvin-0042
Copy link

Running DeekSeep-R1 or V3 inference needs 8xH100 80GB due to huge memory footprint, and it's very challenging to do R1 or V3 inference on single consumer GPU RAM (e.g. 24GB 4090) + limited CPU memory (say 32GB) with 685B MoE params even with low-bit quantization.

But since V3 and R1 has only 37B activated params (INT4 37B weights is 18.5GB), is it possible for the MoE inference to only load the 37B "activated experts (s)" related weights to GPU mem, and leave other non-activated or non-used expert's weight some in CPU memory(e.g.32GB), but majority weights on disk because CPU memory is also limited, and only load/unload these weights when in use ?

I'm wondering if similar features is available or WIP inside DeepSeek-V3 github or any popular inference frameworks ?

Really appreciate your help!

@mowentian
Copy link
Contributor

We do not offer private deployment or related support services. Please seek assistance from other communities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants