Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Jan 2, 2026

Cache quantized activations and mul_mat_id_helper output for the gate tensor. Modest TG and PP gains

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 2, 2026
@JohannesGaessler
Copy link
Collaborator

I think the bigger potential to do with quantizing the activations lies in doing the quantization prior to copying them to other GPUs in a multi GPU scenario. For --split-mode row this has made a significant difference, though the quantization is still duplicated. It's still unclear though how much benefit there would be for a proper parallelization of multiple GPUs.

@am17an
Copy link
Collaborator Author

am17an commented Jan 3, 2026

Yes, we can re-use the same structure for copying over quantizations as well. Some perf numbers (5090)

Model Test t/s c8a3798 t/s cuda-cache Speedup
gpt-oss 20B MXFP4 MoE pp2048 14084.00 14407.10 1.02
gpt-oss 20B MXFP4 MoE pp4096 14023.45 14320.17 1.02
gpt-oss 20B MXFP4 MoE pp8192 13704.75 14011.17 1.02
qwen3moe 30B.A3B Q4_0 pp2048 7856.41 8047.80 1.02
qwen3moe 30B.A3B Q4_0 pp4096 7663.71 7846.45 1.02
qwen3moe 30B.A3B Q4_0 pp8192 7283.03 7444.40 1.02

@am17an
Copy link
Collaborator Author

am17an commented Jan 6, 2026

Add the cache to mmvq helps in TG as well

Model Test t/s c8a3798 t/s cuda-cache Speedup
gpt-oss 20B MXFP4 MoE tg128 389.55 396.06 1.02
mistral3 14B Q8_0 tg128 100.54 101.12 1.01
qwen3moe 30B.A3B Q4_0 tg128 325.67 332.69 1.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants