Skip to content

Conversation

@DocShotgun
Copy link
Contributor

#18530 / #17026

Makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32 if not specified to keep current behavior intact.

This is helpful when running large MoEs with a significant size of weights stored in host buffers on CPU, causing a bottleneck when op offloading with small batches that are still larger than the default 32. The optimal value, or "break even point" here depends on characteristics of the hardware + model, and is best determined empirically (ref: #17026 (comment)).

Make sure to read the contributing guidelines before submitting a PR

* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs labels Jan 2, 2026
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var should be read only once upon device initiali§ation and then queried from the device context.

@DocShotgun
Copy link
Contributor Author

DocShotgun commented Jan 2, 2026

Took a crack at this, let me know if you'd recommend doing anything differently.

AI-assisted with searching for the relevant code in the Metal backend and with debugging compile failures.

  • For CUDA, CANN, SYCL, and Vulkan, added op_offload_min_batch_size to the device context struct. We read the env var once prior to the loop that creates the device context(s), and then assign this value to the context for each device.
  • For Metal we instead add the field to the device props, which we can then fetch from the offload op check.
  • dev is no longer flagged as unused in the backend offload op checks. In Metal, op was also previously flagged as unused.
  • CANN had an issue where ggml_backend_cann_offload_op is declared before ggml_backend_cann_device_context. This didn't cause any problems before when device context was unused. I moved it down to roughly match the other backends.

I tested CUDA locally on a Qwen3-Coder-30B-A3B-Instruct-IQ4_XS.gguf on my 7950X + 4090 Windows machine with -b 4096 -ub 4096 and --cpu-moe and it seems to work as expected:

  • GGML_OP_OFFLOAD_MIN_BATCH=50000 with 1532 tokens prompt, op offload not triggered -> PP 146.81 T/s
  • GGML_OP_OFFLOAD_MIN_BATCH=64 with 1532 tokens prompt, op offload triggered -> PP 2320.88 T/s
  • No env var set with 1532 tokens prompt, defaults to 32, op offload triggered -> PP 2322.76 T/s

@taronaeo taronaeo linked an issue Jan 3, 2026 that may be closed by this pull request
4 tasks
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me for SYCL backend part.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 5, 2026

The Vulkan changes are fine.

Co-authored-by: Aman Gupta <[email protected]>
@am17an
Copy link
Collaborator

am17an commented Jan 8, 2026

@ggerganov merge?

@ggerganov ggerganov merged commit 9a5724d into ggml-org:master Jan 8, 2026
79 of 80 checks passed
gary149 pushed a commit to gary149/llama-agent that referenced this pull request Jan 8, 2026
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <[email protected]>

---------

Co-authored-by: Aman Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add configurable op offload min batch size

5 participants