Let's say you're interested in performing LLM inference on Apple hardware. You care about speed, but don't know which model or framework to pick.
Do you:
- use PyTorch with the Metal Performance Shaders backend,
- use Apple's MLX, built directly for Metal,
- use
LM Studioand itsllama.cppengine for Metal, - use
Ollama, - or use
llama.cppdirectly?
We aim to help you make this choice, by benchmarking inference for a few common models and operators. Results can be found at https://aukejw.github.io/mlx_transformers_benchmark/.
Before you start, you will need:
To (optionally) benchmark Metal+llama.cpp models in common interfaces, you may also need:
To get started:
-
Clone the repo:
git clone [email protected]:aukejw/mlx_transformers_benchmark.git cd mlx_transformers_benchmark -
Set up a python3.11 virtual environment using
uv:make setup -
For good measure, run the tests. This also tells you whether we can use the GPU.
make test -
Run benchmarking, here for the 0.5B parameter
Qwen2.5model:uv run python scripts/run_llm_benchmarks.py \ --run_only_benchmarks qwen-2.5-0.5b-it \ --dtypes \["int4","int8"\] \ --num_iterations 3This creates a new result in the
measurementsfolder.Optionally, to run a full benchmark for
bfloat16,int8,int4datatypes, you can use:make run-llm-benchmarksThis will take a longer time however, so make sure you aren't busy!
-
To create a HTML report of all available measurements and open the index page:
make show-llm-benchmarksThis should open a page similar to https://aukejw.github.io/mlx_transformers_benchmark/.
If you have an Apple device, additional measurements are always welcome! The easiest way to contribute is to fork the repo, and run benchmarks for common LLMs and/or operators.
See CONTRIBUTING.md for more info.
As Apple machines share memory with other background processes, these benchmarks are not exact, certainly not for Macbooks. Still, the numbers should give a decent idea of the performance to expect.
Although the default parameters do not result in thermal throttling for a Macbook M4 Pro, older machines may have trouble with the heavier models and operators. We do try to skip large models, but you may still have too little RAM and fall back on swap space. If you see huge memory pressure or outlier measurements, do take a closer look!
Note
For a large number of iterations, the GPU will certainly heat up. If needed, you can
increase the cooldown period using the cooldown_time_fraction argument. Monitoring GPU
temperature programatically requires admin privileges, but you can use third-party apps like
stats, also available as
homebrew.
Apple silicon is fairly cost-effective for LLM inference due to its unified memory architecture. As LLM inference is mostly memory-bound for low batch sizes, devices with high memory bandwidth typically obtain high tokens/sec in inference benchmarks.
This benchmark focuses on the inference time of easy-to-run LLMs and unquantized transformer ops, primarily useful when running inference locally, or when finetuning custom models for (or on!) Apple devices.
You may also be interested in:
-
Tristan Bilot's comprehensive benchmark for fundamental operators for
mlx,torch+mps, andtorch+cuda(link). Placing bothmlxandtorchfunctions in a single benchmark class makes it easy to see the differences between the two, and we adopt the same strategy here. -
The work of Feng et al. comparing training on Nvidia cards vs Apple Silicon.