Feature Request: Additional commandline arguments for llama.cpp (custom offloading MoE Layers)

Normally, -ot ".ffn_.*_exps.=CPU"  offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

From here: https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune#improving-generation-speed

Please also consider: The [latest llama.cpp release](https://github.com/ggml-org/llama.cpp/pull/14363) also introduces high throughput mode. Use llama-parallel. Read more about it [here](https://github.com/ggml-org/llama.cpp/tree/master/examples/parallel). You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Additional commandline arguments for llama.cpp (custom offloading MoE Layers) #1190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Additional commandline arguments for llama.cpp (custom offloading MoE Layers) #1190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions