IQ quant performance #17842

ThatGuyWhoAsked · 2025-12-07T10:26:58Z

ThatGuyWhoAsked
Dec 7, 2025

Prior sentiment was that IQ quants were slower on apple silicon, however is that still true? My benchmarks show it being FASTER then the similar quality quantisation:

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_device_init: GPU name:   Apple M3
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3vl 8B IQ4_XS - 4.25 bpw   |   4.24 GiB |     8.19 B | Metal,BLAS |       4 |           pp512 |        190.79 ± 2.77 |
| qwen3vl 8B IQ4_XS - 4.25 bpw   |   4.24 GiB |     8.19 B | Metal,BLAS |       4 |           tg128 |         19.25 ± 0.13 |

and

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M3
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3vl 8B Q4_K - Small        |   4.47 GiB |     8.19 B | Metal,BLAS |       4 |           pp512 |        187.90 ± 0.98 |
| qwen3vl 8B Q4_K - Small        |   4.47 GiB |     8.19 B | Metal,BLAS |       4 |           tg128 |         18.68 ± 0.04 |

build: ecf74a841 (7220)

I also read that apply family 9 (m3 and m4) are faster at this ( I have m3) does anyone have an idea as to why it is faster?
Prior Discussion: #5617

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IQ quant performance #17842

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

IQ quant performance #17842

Uh oh!

ThatGuyWhoAsked Dec 7, 2025

Replies: 0 comments

ThatGuyWhoAsked
Dec 7, 2025