CUDA: cache intermediate tensors #18538

am17an · 2026-01-02T07:19:03Z

Cache quantized activations and mul_mat_id_helper output for the gate tensor. Modest TG and PP gains

JohannesGaessler · 2026-01-02T18:56:49Z

I think the bigger potential to do with quantizing the activations lies in doing the quantization prior to copying them to other GPUs in a multi GPU scenario. For --split-mode row this has made a significant difference, though the quantization is still duplicated. It's still unclear though how much benefit there would be for a proper parallelization of multiple GPUs.

am17an · 2026-01-03T03:14:21Z

Yes, we can re-use the same structure for copying over quantizations as well. Some perf numbers (5090)

Model	Test	t/s `c8a3798`	t/s cuda-cache	Speedup
gpt-oss 20B MXFP4 MoE	pp2048	14084.00	14407.10	1.02
gpt-oss 20B MXFP4 MoE	pp4096	14023.45	14320.17	1.02
gpt-oss 20B MXFP4 MoE	pp8192	13704.75	14011.17	1.02
qwen3moe 30B.A3B Q4_0	pp2048	7856.41	8047.80	1.02
qwen3moe 30B.A3B Q4_0	pp4096	7663.71	7846.45	1.02
qwen3moe 30B.A3B Q4_0	pp8192	7283.03	7444.40	1.02

am17an · 2026-01-06T11:47:00Z

Add the cache to mmvq helps in TG as well

Model	Test	t/s `c8a3798`	t/s cuda-cache	Speedup
gpt-oss 20B MXFP4 MoE	tg128	389.55	396.06	1.02
mistral3 14B Q8_0	tg128	100.54	101.12	1.01
qwen3moe 30B.A3B Q4_0	tg128	325.67	332.69	1.02

am17an requested a review from JohannesGaessler as a code owner January 2, 2026 07:19

loci-dev mentioned this pull request Jan 2, 2026

UPSTREAM PR #18538: CUDA: cache intermediate tensors auroralabs-loci/llama.cpp#784

Open

CUDA: cache intermediate tensors

23d04b3

am17an force-pushed the cuda-cache branch from c15e1c9 to 23d04b3 Compare January 2, 2026 07:37

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: cache intermediate tensors #18538

CUDA: cache intermediate tensors #18538

am17an commented Jan 2, 2026 •

edited

Loading

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

am17an commented Jan 3, 2026 •

edited

Loading

Uh oh!

am17an commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: cache intermediate tensors #18538

Are you sure you want to change the base?

CUDA: cache intermediate tensors #18538

Conversation

am17an commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

am17an commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

am17an commented Jan 2, 2026 •

edited

Loading

am17an commented Jan 3, 2026 •

edited

Loading

am17an commented Jan 6, 2026 •

edited

Loading