MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 or V3 Inference on consumer GPU #492

marvin-0042 · 2025-01-30T23:34:24Z

Running DeekSeep-R1 or V3 inference needs 8xH100 80GB due to huge memory footprint, and it's very challenging to do R1 or V3 inference on single consumer GPU RAM (e.g. 24GB 4090) + limited CPU memory (say 32GB) with 685B MoE params even with low-bit quantization.

But since V3 and R1 has only 37B activated params (INT4 37B weights is 18.5GB), is it possible for the MoE inference to only load the 37B "activated experts (s)" related weights to GPU mem, and leave other non-activated or non-used expert's weight some in CPU memory(e.g.32GB), but majority weights on disk because CPU memory is also limited, and only load/unload these weights when in use ?

I'm wondering if similar features is available or WIP inside DeepSeek-V3 github or any popular inference frameworks ?

Really appreciate your help!

mowentian · 2025-02-08T09:18:29Z

We do not offer private deployment or related support services. Please seek assistance from other communities.

mowentian closed this as completed Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 or V3 Inference on consumer GPU #492

MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 or V3 Inference on consumer GPU #492

marvin-0042 commented Jan 30, 2025

mowentian commented Feb 8, 2025

MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 or V3 Inference on consumer GPU #492

MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 or V3 Inference on consumer GPU #492

Comments

marvin-0042 commented Jan 30, 2025

mowentian commented Feb 8, 2025