ggml-cpu: faster AVX2 variant for IQ1_M #12216

remyoudompheng · 2025-03-06T05:23:40Z

This is an additional optimization for AVX2 after #12154

Benchmarks on AMD Zen 4:

Before
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  1932 runs -   532.51 us/run - 117.44 MFLOP/run - 220.54 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   980 runs -  1057.22 us/run - 234.88 MFLOP/run - 222.17 GFLOPS

After
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2553 runs -   394.78 us/run - 117.44 MFLOP/run - 297.48 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  1225 runs -   816.50 us/run - 234.88 MFLOP/run - 287.67 GFLOPS

model	size	params	ngl	test	master t/s	PR t/s
qwen2 7B IQ1_M - 1.75 bpw	1.90 GiB	7.62 B	0	pp512	204.84 ± 0.31	204.07 ± 0.58
qwen2 7B IQ1_M - 1.75 bpw	1.90 GiB	7.62 B	0	tg128	14.48 ± 0.02	17.71 ± 0.18

slaren · 2025-03-06T12:42:15Z

13900k:

Model	Threads	Test	t/s master	t/s optim-x86-2	Speedup
llama 8B IQ1_M - 1.75 bpw	8	pp128	20.85	25.11	1.20
llama 8B IQ1_M - 1.75 bpw	8	tg32	16.34	19.54	1.20
llama 8B IQ1_M - 1.75 bpw	16	pp128	29.23	35.90	1.23
llama 8B IQ1_M - 1.75 bpw	16	tg32	20.84	24.17	1.16
llama 8B IQ1_M - 1.75 bpw	24	pp128	32.41	39.21	1.21
llama 8B IQ1_M - 1.75 bpw	24	tg32	22.41	25.95	1.16
llama 8B IQ1_M - 1.75 bpw	32	pp128	37.96	45.32	1.19
llama 8B IQ1_M - 1.75 bpw	32	tg32	22.78	26.94	1.18

ggml-cpu: faster AVX2 variant for IQ1_M

1ef5b81

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 6, 2025

slaren approved these changes Mar 6, 2025

View reviewed changes

ggerganov merged commit 68d0027 into ggml-org:master Mar 7, 2025
47 checks passed

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

ggml-cpu: faster AVX2 variant for IQ1_M (ggml-org#12216)

6dfcb03

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025

ggml-cpu: faster AVX2 variant for IQ1_M (ggml-org#12216)

c3e4a76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: faster AVX2 variant for IQ1_M #12216

ggml-cpu: faster AVX2 variant for IQ1_M #12216

remyoudompheng commented Mar 6, 2025

slaren commented Mar 6, 2025

ggml-cpu: faster AVX2 variant for IQ1_M #12216

ggml-cpu: faster AVX2 variant for IQ1_M #12216

Conversation

remyoudompheng commented Mar 6, 2025

slaren commented Mar 6, 2025