Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

Merged
merged 6 commits into from
Mar 6, 2025

Conversation

remyoudompheng
Copy link
Contributor

AFAIK the CPU backend does not contain any x86 BMI2 instructions yet.
Is it fine to introduce code using BMI2 instructions?
Is it fine to simply use the __BMI2__ since "NATIVE" build is now the standard?

Some numbers on Zen 4 (new code is about 50% faster)

master (gcc 14.2):
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1725 runs -   588.98 us/run - 117.44 MFLOP/run - 199.40 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1242 runs -   821.32 us/run - 117.44 MFLOP/run - 142.99 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    875 runs -  1161.02 us/run - 234.88 MFLOP/run - 202.31 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    630 runs -  1630.03 us/run - 234.88 MFLOP/run - 144.10 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 292929.00 us/run -  60.13 GFLOP/run - 205.27 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    3 runs - 412216.00 us/run -  60.13 GFLOP/run - 145.87 GFLOPS

master (clang 19):
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1725 runs -   585.87 us/run - 117.44 MFLOP/run - 200.45 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1449 runs -   721.34 us/run - 117.44 MFLOP/run - 162.81 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1015 runs -  1013.40 us/run - 234.88 MFLOP/run - 231.78 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    700 runs -  1490.70 us/run - 234.88 MFLOP/run - 157.56 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 267433.50 us/run -  60.13 GFLOP/run - 224.84 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    3 runs - 375016.67 us/run -  60.13 GFLOP/run - 160.34 GFLOPS

This PR (gcc 14.2):   
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   2622 runs -   388.58 us/run - 117.44 MFLOP/run - 302.23 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1932 runs -   532.51 us/run - 117.44 MFLOP/run - 220.54 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1295 runs -   783.70 us/run - 234.88 MFLOP/run - 299.71 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    980 runs -  1057.22 us/run - 234.88 MFLOP/run - 222.17 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    6 runs - 195505.17 us/run -  60.13 GFLOP/run - 307.56 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 271548.50 us/run -  60.13 GFLOP/run - 221.43 GFLOPS

This PR (clang 19):   
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   2070 runs -   490.61 us/run - 117.44 MFLOP/run - 239.38 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1656 runs -   613.57 us/run - 117.44 MFLOP/run - 191.41 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1015 runs -  1009.67 us/run - 234.88 MFLOP/run - 232.63 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    945 runs -  1071.45 us/run - 234.88 MFLOP/run - 219.22 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    5 runs - 247839.00 us/run -  60.13 GFLOP/run - 242.62 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 264696.00 us/run -  60.13 GFLOP/run - 227.16 GFLOPS

Note that some old CPUs (AMD Zen 2 and older) support BMI2 but emulate instructions using microcode, resulting in catastrophic slowdowns: owners of such hardware would need to manually disable BMI2 in compiler using -mno-bmi2.

Before:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  966 runs -  1076.29 us/run - 117.44 MFLOP/run - 109.12 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  690 runs -  1596.64 us/run - 117.44 MFLOP/run -  73.55 GFLOPS

After:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  138 runs - 11684.07 us/run - 117.44 MFLOP/run -  10.05 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   69 runs - 16669.00 us/run - 117.44 MFLOP/run -   7.05 GFLOPS

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 2, 2025
@slaren
Copy link
Member

slaren commented Mar 2, 2025

Is it fine to simply use the __BMI2__ since "NATIVE" build is now the standard?

Please add also an option to enable it manually, add a check in cpu-feats-x86.cpp, and add it to the CPU variant list in:

ggml_add_cpu_backend_variant(sandybridge AVX)
ggml_add_cpu_backend_variant(haswell AVX F16C AVX2 FMA)
ggml_add_cpu_backend_variant(skylakex AVX F16C AVX2 FMA AVX512)
ggml_add_cpu_backend_variant(icelake AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
ggml_add_cpu_backend_variant(alderlake AVX F16C AVX2 FMA AVX_VNNI)

You could also check for Zen 2 in cpu-feats-x86.cpp, and if necessary add a variant for Zen 2 that excludes this feature.

@JohnLoveJoy
Copy link

https://github.com/zwegner/zp7

Integrating something like the ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill) into llama.cpp could be a smart way to address the performance issues with PDEP and PEXT on AMD Zen 2 and earlier CPUs while maintaining compatibility and efficiency across platforms. Just a polite suggestion.

@remyoudompheng
Copy link
Contributor Author

Update with CMakeLists changes (no Zen 2 specific case, maybe a separate PR can add AMD microarchitectures).

@slaren
Copy link
Member

slaren commented Mar 4, 2025

Looks good, thanks.

It would also be necessary to add a ggml_cpu_has_bmi2 function and report it in ggml_backend_cpu_get_features:

static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t reg) {

I suspect that MSVC will enable BMI2 with /arch:avx2 or higher. After you make this change, then you can check in the "system info" string if BMI2 is being enabled. If so, then the definition would also need to be added here:

elseif (GGML_AVX2)
list(APPEND ARCH_FLAGS /arch:AVX2)
list(APPEND ARCH_DEFINITIONS GGML_AVX2 GGML_FMA GGML_F16C)
elseif (GGML_AVX)

I can check for you if you don't have access to a machine with MSVC.

@remyoudompheng
Copy link
Contributor Author

Done.
For MSVC it seems that it needs a manual define like AVXVNNI (tests on godbolt.org suggest that it always compiles intrinsics regardless of the /arch flag)

@slaren
Copy link
Member

slaren commented Mar 6, 2025

13900k:

Model Threads Test t/s master t/s optim-x86 Speedup
llama 8B IQ1_M - 1.75 bpw 8 pp128 15.84 21.98 1.39
llama 8B IQ1_M - 1.75 bpw 8 tg32 13.45 17.31 1.29
llama 8B IQ1_M - 1.75 bpw 16 pp128 19.43 30.45 1.57
llama 8B IQ1_M - 1.75 bpw 16 tg32 16.81 22.04 1.31
llama 8B IQ1_M - 1.75 bpw 24 pp128 25.88 28.29 1.09
llama 8B IQ1_M - 1.75 bpw 24 tg32 18.73 23.52 1.26
llama 8B IQ1_M - 1.75 bpw 32 pp128 30.42 34.84 1.15
llama 8B IQ1_M - 1.75 bpw 32 tg32 19.85 24.61 1.24
llama 8B IQ1_S - 1.5625 bpw 8 pp128 19.30 29.50 1.53
llama 8B IQ1_S - 1.5625 bpw 8 tg32 17.57 21.46 1.22
llama 8B IQ1_S - 1.5625 bpw 16 pp128 32.00 41.51 1.30
llama 8B IQ1_S - 1.5625 bpw 16 tg32 22.62 27.48 1.21
llama 8B IQ1_S - 1.5625 bpw 24 pp128 35.53 46.19 1.30
llama 8B IQ1_S - 1.5625 bpw 24 tg32 24.82 27.44 1.11
llama 8B IQ1_S - 1.5625 bpw 32 pp128 41.95 52.90 1.26
llama 8B IQ1_S - 1.5625 bpw 32 tg32 25.55 26.71 1.05

@slaren slaren merged commit 07d1572 into ggml-org:master Mar 6, 2025
46 of 47 checks passed
@remyoudompheng remyoudompheng deleted the optim-x86 branch March 6, 2025 05:16
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
…l-org#12154)

* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
…l-org#12154)

* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC
@sandrohanea
Copy link
Contributor

Hello @slaren , @remyoudompheng ,

It seems that after this PR x86 with AVX2 build for MSVC is failing:

image
https://github.com/sandrohanea/whisper.net/actions/runs/13965684322/job/39095442481

cmake command:

cmake -S . -DGGML_NATIVE=OFF -A Win32 -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_FMA=ON -DGGML_F16C=ON -B build/win-x86

Do you have any recommendation on how to fix this issue?

@sandrohanea
Copy link
Contributor

Hello @slaren , @remyoudompheng ,

It seems that after this PR x86 with AVX2 build for MSVC is failing:

image https://github.com/sandrohanea/whisper.net/actions/runs/13965684322/job/39095442481

cmake command:

cmake -S . -DGGML_NATIVE=OFF -A Win32 -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_FMA=ON -DGGML_F16C=ON -B build/win-x86

Do you have any recommendation on how to fix this issue?

Nevermind, just disabled the support for BMI2 on Win32 using -DGGML_BMI2=OFF.

@rudiservo
Copy link
Contributor

Hey guys, having issues with this commit, I don't know why. I put all the relevant information and what I could find issue, I did try and compile with various CUDA versions and kind of worked my way to the the current commit.
Having issues with old Bulldozer.

@rudiservo
Copy link
Contributor

Just a heads up, I am confirming that the BMI2 detection is probably wrong because it's forcing BMI2 on a non BMI2 CPU.
On the docker cuda builds this is breaking some stuff, should BMI not be compiled by default or is there something else that needs to be done to CPUID so it will detect BMI2 better and fix the issue I have?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants