-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark results #89
Comments
Results for
|
Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).
|
|
|
|
This performance is impressing!
|
Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares. By the way, AVX-512 is not supported on |
compiled with MinGW64 gcc 11.3 |
Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)
Compiled with The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds. |
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@trholding You can generate a table with performance results by simply running the extra/bench_all.sh script. Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance. |
This comment was marked as outdated.
This comment was marked as outdated.
Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor... |
Dell Precision 5560 laptop results:
|
|
|
Compiled with VS 2022 Something is off, right? |
Yup - you are missing the |
OK, the
Compiled with VS 2022 |
|
From the stream repo
I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores
I tried to compile with openBlas but seemed to kill the make From the master repo as didn't think about the repo after trying streaming input
|
8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:
|
Results for AWS Graviton 3 Processor ( Compiled with
|
@matth Do you observe significant performance difference with / without |
@ggerganov
Results without any
I have tried to improve by using OpenBlas and Are there any possibilities for further optimisations in |
Results for the new Raspberry Pi 5. Tests performed on a board with the active cooler.
These results are 4.5 to 6.2 times faster than the Raspberry Pi 4. NOTE: The packaged version of OpenBLAS has not been recompiled for the new CPU architecture, so it is about 50% slower than The
|
CPU details: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Running benchmark for all models
|
Whats happening with commit 8a2bee6?
Opi5 4gb
|
@nickovs These are some very interesting results. Looking forward to the OpenBLAS results as well. @StuartIanNaylor The PP timing is the "prompt processing" time for a prompt of 256 tokens. As we transcribe with whisper, the context (i.e. the previously transcribed text) grows up to |
By way of comparison to the benchmarks I posted above, here is are the matrix multiplication numbers for the same Raspberry Pi 5 using OpenBLAS. It is notable that Whisper.cpp's native NEON code outperforms OpenBLAS on the Pi5 for everything except FP32, where OpenBLAS wins by some margin.
I have not tried all the tuning options in OpenBLAS, but the options I did try didn't really change the performance compared to the pre-packaged version. |
I think this is where we benefit from ArmV8.2 and being a subgroup of Apple Silicon first-class citizen - optimized via ARM NEON. These results are 4.5 to 6.2 times faster than the Raspberry Pi 4. Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor
Linux raspberrypi 6.1.0-rpi4-rpi-2712 Rpi5 4GB performance governor
I dunno to be honest why Gflops is higher but whilst the |
@StuartIanNaylor Here is a straight up comparison of the same 54c978c commit between the Pi4 and the Pi5, both running the code compiled on the Pi4 on the Pi5 and then also recompiling the same commit on the Pi5.
This suggests that there is a little better than a 2-fold performance improvement on encode, and more like a 2.8 fold improvement on decode, just moving the code from the Pi4 to the Pi5. Recompiling on the Pi5 raises the encode performance to between 4.74 and 6.54 times faster that on the Pi4, but the decode performance remains only about 2.8 times faster than the Pi4 and doesn't benefit a great deal from the recompilation. (Note that this table hits GitHub's 10 column limit, so the decode speedup may not be displayed, but the numbers are in the comment source.) The key thing here as far as I'm concerned is that on the Pi5 the |
It would be great to have a test results db for this. I'm thinking similar to what DRM info does |
@jwinarske that would be great as maybe a seperate repo of fixed commits as we are not benching the software but the hardware. Linux ubuntu 6.6.0 #1 SMP PREEMPT Opi5 4GB performance governor 54c978c
@nickovs Dunno as before as A76 gets vector mat/mul and the code is optimised for ArmV8,2+ that the poor Pi4 with openBlas was approx < 5 times slower than a RK3588s.
|
"lib" is needed for windows. With this change, you can build whisper.cpp with OpenBLAS's prebuilt DLL. 1. extract a zip from https://github.com/xianyi/OpenBLAS/releases 2. copy the headers in (openblas)/include to the root directory of whisper.cpp 3. invoke cmake with -DCMAKE_LIBRARY_PATH=(openblas)\lib -DWHISPER_SUPPORT_OPENBLAS=ON 4. copy (openblas)/bin/libopenblas.dll to the same directory of whisper.dll after msbuild ggerganov/whisper.cpp#89 (comment)
Update whisper.cpp
Here is the result for NVIDIA GeForce GT 755M on Debian GNU/Linux 12 Bookworm using GCC 12.2.0 build with -DWHISPER_CLBLAST=ON:
|
benchmark result with 11th Gen Intel Core(TM) i7-11700F @ 2.50GHz + Ubuntu 20.04 + gcc version 9.4.0
there is an impressive benchmark result(compare to above bench result in PC which was purchased by RMB12000(about USD 1700) a few years ago) with Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) + Xiaomi's HyperOS(derived from Android 14) + Android NDK r21e: updated on 03-20-2024, Xiaomi 14 + Android NDK r26c( NDK r26c is required for special build optimization:https://github.com/cdeos/kantv/blob/master/external/whispercpp/CMakeLists.txt#L60) |
|
Different results for different code commits - older version is much faster!CPU: AMD Ryzen 9 7950X3D 16-Core
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 whisper_print_timings: load time = 64.61 ms
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | whisper_print_timings: load time = 83.24 ms |
A quick question: When would you want us to run this / report results? For context, we're looking on using space on one of our old nodes to run a large number of files through Whisper (cpp). It's a server with multiple RTX2080TIs clustered together. I just don't know if knowing that Whisper.cpp runs fast on this out of date (but high spec'd for it's time) setup is useful. Thanks! |
Hello all, I'm trying to benchmark all whisper backends but am having trouble benchmarking whisper.cpp. Since I'm unfamiliar with "compiling" I'm forced to use python bindings. I'm only aware of the following bindings but they all either haven't been updated in a long time or don't implement gpu acceleration:
Also, does whisper.cpp have "batching" by chance? Here's a sample graph I've created. Any feedback would be welcome regarding either how I'm graphing as well as how to test fairly with identical parameters and what not. Thanks! P.S. faster-whisper doesn't have batching yet so, obviously, that's why there's only one graph for it... |
Excuse me, may I ask which way you generated the benchmark app?I am now worried because I am not able to benchmark on my phone. Thanks for your answer. |
running the original "bench(which generated by the original build system in project whisper.cpp)" in X86-Linux(Ubuntu 20.04). benchmark on Android phone is another topic and scenario. the official project whisper.cpp doesn't care this:they focus on core implementation/improvement and focus on MacOS(iOS)/Windows/Linux(I personally think the Android OS is another special Linux distribution). I maintained a dedicated ggml learning&study project focus on Android and some benchmark items are also provided in this ggml learning&study project accordingly. BTW, the codes of above two benchmark items are exactly same to the original codes of above benchmark in the project whisper.cpp essentially/technically. |
https://github.com/ggerganov/whisper.cpp?tab=readme-ov-file#quick-start |
System Info
memcpy./bench -w 1 -t 1
memcpy: 4.48 GB/s (heat-up)
memcpy: 5.13 GB/s ( 1 thread)
memcpy: 5.48 GB/s ( 1 thread)
sum: -1535998239.000000 ggml_mul_mat./bench -w 2 -t 1
64 x 64: Q4_0 2.6 GFLOPS (128 runs) | Q4_1 2.6 GFLOPS (128 runs)
64 x 64: Q5_0 2.4 GFLOPS (128 runs) | Q5_1 2.3 GFLOPS (128 runs) | Q8_0 2.8 GFLOPS (128 runs)
64 x 64: F16 3.2 GFLOPS (128 runs) | F32 0.7 GFLOPS (128 runs)
128 x 128: Q4_0 4.3 GFLOPS (128 runs) | Q4_1 4.5 GFLOPS (128 runs)
128 x 128: Q5_0 4.2 GFLOPS (128 runs) | Q5_1 4.0 GFLOPS (128 runs) | Q8_0 5.4 GFLOPS (128 runs)
128 x 128: F16 5.7 GFLOPS (128 runs) | F32 2.9 GFLOPS (128 runs)
256 x 256: Q4_0 6.9 GFLOPS (128 runs) | Q4_1 6.0 GFLOPS (128 runs)
256 x 256: Q5_0 6.0 GFLOPS (128 runs) | Q5_1 5.4 GFLOPS (128 runs) | Q8_0 9.5 GFLOPS (128 runs)
256 x 256: F16 8.3 GFLOPS (128 runs) | F32 5.4 GFLOPS (128 runs)
512 x 512: Q4_0 9.2 GFLOPS ( 35 runs) | Q4_1 8.0 GFLOPS ( 30 runs)
512 x 512: Q5_0 7.1 GFLOPS ( 27 runs) | Q5_1 7.1 GFLOPS ( 27 runs) | Q8_0 11.2 GFLOPS ( 42 runs)
512 x 512: F16 9.0 GFLOPS ( 34 runs) | F32 5.0 GFLOPS ( 19 runs)
1024 x 1024: Q4_0 10.2 GFLOPS ( 5 runs) | Q4_1 9.1 GFLOPS ( 5 runs)
1024 x 1024: Q5_0 8.4 GFLOPS ( 4 runs) | Q5_1 8.1 GFLOPS ( 4 runs) | Q8_0 13.4 GFLOPS ( 7 runs)
1024 x 1024: F16 8.8 GFLOPS ( 5 runs) | F32 4.0 GFLOPS ( 3 runs)
2048 x 2048: Q4_0 11.4 GFLOPS ( 3 runs) | Q4_1 10.2 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 7.9 GFLOPS ( 3 runs) | Q5_1 7.5 GFLOPS ( 3 runs) | Q8_0 11.3 GFLOPS ( 3 runs)
2048 x 2048: F16 7.8 GFLOPS ( 3 runs) | F32 4.4 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 9.7 GFLOPS ( 3 runs) | Q4_1 9.7 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 7.9 GFLOPS ( 3 runs) | Q5_1 7.4 GFLOPS ( 3 runs) | Q8_0 11.5 GFLOPS ( 3 runs)
|
System Info
memcpy./bench -w 1 -t 1
memcpy: 13.44 GB/s (heat-up)
memcpy: 13.53 GB/s ( 1 thread)
memcpy: 13.49 GB/s ( 1 thread)
sum: -1535998239.000000 ggml_mul_mat./bench -w 2 -t 1
64 x 64: Q4_0 10.3 GFLOPS (128 runs) | Q4_1 9.8 GFLOPS (128 runs)
64 x 64: Q5_0 9.3 GFLOPS (128 runs) | Q5_1 8.7 GFLOPS (128 runs) | Q8_0 11.0 GFLOPS (128 runs)
64 x 64: F16 11.0 GFLOPS (128 runs) | F32 3.0 GFLOPS (128 runs)
128 x 128: Q4_0 15.5 GFLOPS (128 runs) | Q4_1 15.1 GFLOPS (128 runs)
128 x 128: Q5_0 13.7 GFLOPS (128 runs) | Q5_1 13.2 GFLOPS (128 runs) | Q8_0 17.6 GFLOPS (128 runs)
128 x 128: F16 15.6 GFLOPS (128 runs) | F32 9.7 GFLOPS (128 runs)
256 x 256: Q4_0 20.0 GFLOPS (128 runs) | Q4_1 19.1 GFLOPS (128 runs)
256 x 256: Q5_0 16.5 GFLOPS (128 runs) | Q5_1 16.0 GFLOPS (128 runs) | Q8_0 23.3 GFLOPS (128 runs)
256 x 256: F16 19.4 GFLOPS (128 runs) | F32 14.5 GFLOPS (128 runs)
512 x 512: Q4_0 24.0 GFLOPS ( 90 runs) | Q4_1 23.8 GFLOPS ( 89 runs)
512 x 512: Q5_0 20.1 GFLOPS ( 76 runs) | Q5_1 19.7 GFLOPS ( 74 runs) | Q8_0 27.8 GFLOPS (104 runs)
512 x 512: F16 22.9 GFLOPS ( 86 runs) | F32 13.6 GFLOPS ( 51 runs)
1024 x 1024: Q4_0 26.6 GFLOPS ( 13 runs) | Q4_1 27.1 GFLOPS ( 13 runs)
1024 x 1024: Q5_0 21.7 GFLOPS ( 11 runs) | Q5_1 21.5 GFLOPS ( 11 runs) | Q8_0 32.3 GFLOPS ( 16 runs)
1024 x 1024: F16 23.9 GFLOPS ( 12 runs) | F32 13.2 GFLOPS ( 7 runs)
2048 x 2048: Q4_0 28.0 GFLOPS ( 3 runs) | Q4_1 29.1 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 22.4 GFLOPS ( 3 runs) | Q5_1 23.3 GFLOPS ( 3 runs) | Q8_0 34.4 GFLOPS ( 3 runs)
2048 x 2048: F16 24.6 GFLOPS ( 3 runs) | F32 12.7 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 29.3 GFLOPS ( 3 runs) | Q4_1 30.3 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 22.9 GFLOPS ( 3 runs) | Q5_1 24.0 GFLOPS ( 3 runs) | Q8_0 35.3 GFLOPS ( 3 runs)
4096 x 4096: F16 24.4 GFLOPS ( 3 runs) | F32 11.1 GFLOPS ( 3 runs)
|
System Info
memcpy./bench -w 1 -t 1
memcpy: 13.62 GB/s (heat-up)
memcpy: 13.54 GB/s ( 1 thread)
memcpy: 13.62 GB/s ( 1 thread)
sum: -1535998239.000000 ggml_mul_mat./bench -w 2 -t 1
64 x 64: Q4_0 12.2 GFLOPS (128 runs) | Q4_1 11.3 GFLOPS (128 runs)
64 x 64: Q5_0 11.3 GFLOPS (128 runs) | Q5_1 10.1 GFLOPS (128 runs) | Q8_0 13.2 GFLOPS (128 runs)
64 x 64: F16 15.3 GFLOPS (128 runs) | F32 3.6 GFLOPS (128 runs)
128 x 128: Q4_0 19.4 GFLOPS (128 runs) | Q4_1 16.9 GFLOPS (128 runs)
128 x 128: Q5_0 17.0 GFLOPS (128 runs) | Q5_1 15.5 GFLOPS (128 runs) | Q8_0 21.5 GFLOPS (128 runs)
128 x 128: F16 22.3 GFLOPS (128 runs) | F32 10.7 GFLOPS (128 runs)
256 x 256: Q4_0 24.7 GFLOPS (128 runs) | Q4_1 20.5 GFLOPS (128 runs)
256 x 256: Q5_0 20.4 GFLOPS (128 runs) | Q5_1 18.8 GFLOPS (128 runs) | Q8_0 28.2 GFLOPS (128 runs)
256 x 256: F16 29.2 GFLOPS (128 runs) | F32 15.4 GFLOPS (128 runs)
512 x 512: Q4_0 28.9 GFLOPS (108 runs) | Q4_1 25.7 GFLOPS ( 96 runs)
512 x 512: Q5_0 24.9 GFLOPS ( 93 runs) | Q5_1 23.4 GFLOPS ( 87 runs) | Q8_0 34.3 GFLOPS (128 runs)
512 x 512: F16 35.0 GFLOPS (128 runs) | F32 13.8 GFLOPS ( 52 runs)
1024 x 1024: Q4_0 33.6 GFLOPS ( 16 runs) | Q4_1 30.2 GFLOPS ( 15 runs)
1024 x 1024: Q5_0 28.3 GFLOPS ( 14 runs) | Q5_1 26.9 GFLOPS ( 13 runs) | Q8_0 40.4 GFLOPS ( 19 runs)
1024 x 1024: F16 33.3 GFLOPS ( 16 runs) | F32 12.9 GFLOPS ( 7 runs)
2048 x 2048: Q4_0 36.1 GFLOPS ( 3 runs) | Q4_1 32.8 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 29.5 GFLOPS ( 3 runs) | Q5_1 28.5 GFLOPS ( 3 runs) | Q8_0 42.6 GFLOPS ( 3 runs)
2048 x 2048: F16 31.0 GFLOPS ( 3 runs) | F32 12.2 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 36.6 GFLOPS ( 3 runs) | Q4_1 33.6 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 30.7 GFLOPS ( 3 runs) | Q5_1 29.5 GFLOPS ( 3 runs) | Q8_0 42.9 GFLOPS ( 3 runs)
4096 x 4096: F16 30.1 GFLOPS ( 3 runs) | F32 11.7 GFLOPS ( 3 runs)
|
what is faster on mac M1, turbo compiled with coreml or turbo_q5 without it? |
M4 Mac Mini (Base Model) CoreML flags
M1 Ultra 48 Core GPU 64 GB - Standard Metal
i5-14600k 4070 Ti Super 16GB (555 drivers), 32GB, Ubuntu 24.04 - CUDA Version
What is strange is in the standard So the M4 is quite a beefy CPU, the ANE is nice though limiting in what it can do, GPU when running MLX models is about 2x M1 performance. E.g. getting 24 tokens per second on M1 vs 45 on M4, vs 120 on M1 Ultra using Llama 3.2 3B 4bit MLX. I'm surprised that the 4_k quant running on a 4070 Ti Super also gets about 120 tokens/s. |
Encoder
Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.
Suggestions for better summary of the results are welcome
memcpy
MacBook M1 Pro
Ryzen 9 5950X
ggml_mul_mat
MacBook M1 Pro
Ryzen 9 5950X
The text was updated successfully, but these errors were encountered: