Benchmarking transformer operators on Apple silicon

Let's say you're interested in performing LLM inference on Apple hardware. You care about speed, but don't know which model or framework to pick.

Do you:

use PyTorch with the Metal Performance Shaders backend,
use Apple's MLX, built directly for Metal,
use LM Studio and its llama.cpp engine for Metal,
use Ollama,
or use llama.cpp directly?

We aim to help you make this choice, by benchmarking inference for a few common models and operators. Results can be found at https://aukejw.github.io/mlx_transformers_benchmark/.

Installation

Before you start, you will need:

uv to manage dependencies, available as homebrew

To (optionally) benchmark Metal+llama.cpp models in common interfaces, you may also need:

To get started:

Clone the repo:

git clone [email protected]:aukejw/mlx_transformers_benchmark.git
cd mlx_transformers_benchmark

Set up a python3.11 virtual environment using uv:
```
make setup
```
For good measure, run the tests. This also tells you whether we can use the GPU.
```
make test
```
Run benchmarking, here for the 0.5B parameter Qwen2.5 model:
```
uv run python scripts/run_llm_benchmarks.py \
   --run_only_benchmarks qwen-2.5-0.5b-it \
   --dtypes \["int4","int8"\] \
   --num_iterations 3 
```
This creates a new result in the measurements folder.

Optionally, to run a full benchmark for bfloat16, int8, int4 datatypes, you can use:
```
make run-llm-benchmarks
```
This will take a longer time however, so make sure you aren't busy!
To create a HTML report of all available measurements and open the index page:
```
make show-llm-benchmarks
```
This should open a page similar to https://aukejw.github.io/mlx_transformers_benchmark/.

Contributing

If you have an Apple device, additional measurements are always welcome! The easiest way to contribute is to fork the repo, and run benchmarks for common LLMs and/or operators.

See CONTRIBUTING.md for more info.

On reproducibility

As Apple machines share memory with other background processes, these benchmarks are not exact, certainly not for Macbooks. Still, the numbers should give a decent idea of the performance to expect.

Although the default parameters do not result in thermal throttling for a Macbook M4 Pro, older machines may have trouble with the heavier models and operators. We do try to skip large models, but you may still have too little RAM and fall back on swap space. If you see huge memory pressure or outlier measurements, do take a closer look!

Note

For a large number of iterations, the GPU will certainly heat up. If needed, you can increase the cooldown period using the cooldown_time_fraction argument. Monitoring GPU temperature programatically requires admin privileges, but you can use third-party apps like stats, also available as homebrew.

Notes

Apple silicon is fairly cost-effective for LLM inference due to its unified memory architecture. As LLM inference is mostly memory-bound for low batch sizes, devices with high memory bandwidth typically obtain high tokens/sec in inference benchmarks.

This benchmark focuses on the inference time of easy-to-run LLMs and unquantized transformer ops, primarily useful when running inference locally, or when finetuning custom models for (or on!) Apple devices.

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
.github/workflows		.github/workflows
measurements		measurements
mtb		mtb
scripts		scripts
tests		tests
visualizations		visualizations
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking transformer operators on Apple silicon

Installation

Contributing

On reproducibility

Notes

About

Uh oh!

Releases

Packages

Languages

License

datanerdie/mlx_transformers_benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmarking transformer operators on Apple silicon

Installation

Contributing

On reproducibility

Notes

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages