Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713

engrtipusultan · 2025-10-22T05:21:58Z

engrtipusultan
Oct 22, 2025

Hi, I have a moderate setup without any dedicated GPU. My main purpose of buying this setup was to buy something within my budget for experimentation while keeping running cost low as well (15W to 35W TDP).

MoE models and llama.cpp providing vulkan back-end is only inference engine which enables AI inference accessible to everyday users.

I am sharing some benchmarks of running models at Q8 (Almost full precision) which everyday consumers might be able to run on their setups. If you have more models to share please go ahead add awareness for other people.

llama.cpp build: fb34984 (6812) Vulkan Backend

My Setup:

Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Vulkan: Mesa 25.2.5 (apiVersion= 1.4.318)
Hardware: GMKtec M5 PLUS (Mini PC)
CPU: AMD Ryzen 7 5825U (8 cores, 16 Threads)
GPU: Radeon Vega 8 (gfx_target_version= gfx90c)
RAM: 64GB DDR4-3200 (32GB x 2)
Storage: 512 GB M.2 2280 PCIe Gen 3

Conclusion thus far:


Model @ Q8	pp512 (Packet Processing) Token / sec	tg128 (Token Generation) Token / sec	Comments
Qwen3-Coder-30B-A3B	95.76	12.97	Maybe the best option at my setup
Qwen3-30B-A3B-Instruct-2507	95.76	12.97
Qwen3-30B-A3B-Thinking-2507	95.76	12.97
gpt-oss-20b	131.74	11.55
Granite-4.0-h-tiny	201.17	21.15	Best option in terms of memory utilization requirements and speed.
Ling-mini-2.0	227.23	34.29	Fastest option
Ring-mini-2.0	227.23	34.29

Details of benchmarks ran

Model: Qwen3-Coder-30B-A3B same for (Qwen3-30B-A3B-Instruct-2507 and Qwen3-30B-A3B-Thinking-2507)

llama-bench -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Vulkan	99	4	512	4096	0	pp512	95.76 ± 0.78
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Vulkan	99	4	512	4096	0	tg128	12.97 ± 0.02

Model: gpt-oss-20b
llama-bench -m /home/tipu/AI/models/other/jinx-gpt-oss/jinx-gpt-oss-20b-mxfp4.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
gpt-oss 20B F16	12.83 GiB	20.91 B	Vulkan	99	4	512	4096	0	pp512	131.74 ± 0.81
gpt-oss 20B F16	12.83 GiB	20.91 B	Vulkan	99	4	512	4096	0	tg128	11.55 ± 0.01

Model: Granite-4.0-h-tiny
llama-bench -m /home/tipu/AI/models/other/granite-4.0-h-tiny/granite-4.0-h-tiny-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	4	512	4096	0	pp512	201.17 ± 1.52
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	4	512	4096	0	tg128	21.15 ± 0.04

Model: Ling-mini-2.0
llama-bench -m /home/tipu/AI/models/other/Huihui-Ling-mini-2.0/Huihui-Ling-mini-2.0-abliterated-q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	4	512	4096	0	pp512	227.23 ± 2.13
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	4	512	4096	0	tg128	34.29 ± 0.04

engrtipusultan · 2025-10-23T11:03:55Z

engrtipusultan
Oct 23, 2025
Author

Sharing some of my understanding for new comers.

For most of the models with the increase in context packet processing speed decreases. So keep that in mind while choosing your model. Similarly for bigger response generation speed also decreases. Following are some benchmarks:

Qwen3-Coder 30B.A3B

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	pp512	105.52 ± 0.00
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	pp32768	27.93 ± 0.00
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	tg512	12.70 ± 0.00
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	tg16768	5.88 ± 0.00

Ling-mini-2.0

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	pp512	228.53 ± 2.52
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	pp32768	101.08 ± 0.20
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	tg512	33.82 ± 0.01
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	tg32768	12.87 ± 0.01

granite-4.0-h-tiny
Granite 4.0 introduces a hybrid Mamba-2/transformer architecture. They are said to be have better throughput at higher contexts and larger text generations. Same shows in benchmark.

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	pp512	202.28 ± 0.00
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	pp32768	171.45 ± 0.00
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	tg512	21.16 ± 0.00
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	tg16768	19.50 ± 0.00

Ling linear and Qwen3 next are not support at the moment in llama.cpp (I believe in progress). They are suppose to be better at higher context and larger generation.

0 replies

engrtipusultan · 2025-10-27T05:48:35Z

engrtipusultan
Oct 27, 2025
Author

One more observation. Maybe some expert can explain why.

I have higher packet processing at Q8 as compared to Q4. But lower token generation at Q8 as compared to Q4.
Token generation behavior is expected.
@CISC is it possible for you to spare some time point to any documentation or brief explanation why pocket processing is higher speed at Q8 as compared to Q4.

1 reply

CISC Oct 27, 2025
Collaborator

Quite simply because it's more complex to dequantize Q4 than Q8, which a lot of time is spent doing at prompt processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713

Uh oh!

Uh oh!

engrtipusultan Oct 22, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

engrtipusultan Oct 23, 2025 Author

Uh oh!

engrtipusultan Oct 27, 2025 Author

Uh oh!

CISC Oct 27, 2025 Collaborator

engrtipusultan
Oct 22, 2025

Replies: 2 comments 1 reply

engrtipusultan
Oct 23, 2025
Author

engrtipusultan
Oct 27, 2025
Author

CISC Oct 27, 2025
Collaborator