Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713
Replies: 1 comment
-
| Sharing some of my understanding for new comers. For most of the models with the increase in context packet processing speed decreases. So keep that in mind while choosing your model. Similarly for bigger response generation speed also decreases. Following are some benchmarks: Qwen3-Coder 30B.A3B 
 Ling-mini-2.0 
 granite-4.0-h-tiny 
 Ling linear and Qwen3 next are not support at the moment in llama.cpp (I believe in progress). They are suppose to be better at higher context and larger generation. | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I have a moderate setup without any dedicated GPU. My main purpose of buying this setup was to buy something within my budget for experimentation while keeping running cost low as well (15W to 35W TDP).
MoE models and llama.cpp providing vulkan back-end is only inference engine which enables AI inference accessible to everyday users.
I am sharing some benchmarks of running models at Q8 (Almost full precision) which everyday consumers might be able to run on their setups. If you have more models to share please go ahead add awareness for other people.
llama.cpp build: fb34984 (6812) Vulkan Backend
My Setup:
Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Vulkan: Mesa 25.2.5 (apiVersion= 1.4.318)
Hardware: GMKtec M5 PLUS (Mini PC)
CPU: AMD Ryzen 7 5825U (8 cores, 16 Threads)
GPU: Radeon Vega 8 (gfx_target_version= gfx90c)
RAM: 64GB DDR4-3200 (32GB x 2)
Storage: 512 GB M.2 2280 PCIe Gen 3
Conclusion thus far:
Details of benchmarks ran
Model: Qwen3-Coder-30B-A3B same for (Qwen3-30B-A3B-Instruct-2507 and Qwen3-30B-A3B-Thinking-2507)
llama-bench -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8Model: gpt-oss-20b
llama-bench -m /home/tipu/AI/models/other/jinx-gpt-oss/jinx-gpt-oss-20b-mxfp4.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8Model: Granite-4.0-h-tiny
llama-bench -m /home/tipu/AI/models/other/granite-4.0-h-tiny/granite-4.0-h-tiny-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8Model: Ling-mini-2.0
llama-bench -m /home/tipu/AI/models/other/Huihui-Ling-mini-2.0/Huihui-Ling-mini-2.0-abliterated-q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8Beta Was this translation helpful? Give feedback.
All reactions