Skip to content

datanerdie/mlx_transformers_benchmark

 
 

Repository files navigation

Benchmarking transformer operators on Apple silicon

tests-Mac

Let's say you're interested in performing LLM inference on Apple hardware. You care about speed, but don't know which model or framework to pick.

Do you:

We aim to help you make this choice, by benchmarking inference for a few common models and operators. Results can be found at https://aukejw.github.io/mlx_transformers_benchmark/.

Installation

Before you start, you will need:

  • uv to manage dependencies, available as homebrew

To (optionally) benchmark Metal+llama.cpp models in common interfaces, you may also need:

To get started:

  1. Clone the repo:

    git clone [email protected]:aukejw/mlx_transformers_benchmark.git
    cd mlx_transformers_benchmark
    
  2. Set up a python3.11 virtual environment using uv:

    make setup
    
  3. For good measure, run the tests. This also tells you whether we can use the GPU.

    make test
    
  4. Run benchmarking, here for the 0.5B parameter Qwen2.5 model:

    uv run python scripts/run_llm_benchmarks.py \
       --run_only_benchmarks qwen-2.5-0.5b-it \
       --dtypes \["int4","int8"\] \
       --num_iterations 3 
    

    This creates a new result in the measurements folder.

    Optionally, to run a full benchmark for bfloat16, int8, int4 datatypes, you can use:

    make run-llm-benchmarks
    

    This will take a longer time however, so make sure you aren't busy!

  5. To create a HTML report of all available measurements and open the index page:

    make show-llm-benchmarks
    

    This should open a page similar to https://aukejw.github.io/mlx_transformers_benchmark/.

Contributing

If you have an Apple device, additional measurements are always welcome! The easiest way to contribute is to fork the repo, and run benchmarks for common LLMs and/or operators.

See CONTRIBUTING.md for more info.

On reproducibility

As Apple machines share memory with other background processes, these benchmarks are not exact, certainly not for Macbooks. Still, the numbers should give a decent idea of the performance to expect.

Although the default parameters do not result in thermal throttling for a Macbook M4 Pro, older machines may have trouble with the heavier models and operators. We do try to skip large models, but you may still have too little RAM and fall back on swap space. If you see huge memory pressure or outlier measurements, do take a closer look!

Note

For a large number of iterations, the GPU will certainly heat up. If needed, you can increase the cooldown period using the cooldown_time_fraction argument. Monitoring GPU temperature programatically requires admin privileges, but you can use third-party apps like stats, also available as homebrew.

Notes

Apple silicon is fairly cost-effective for LLM inference due to its unified memory architecture. As LLM inference is mostly memory-bound for low batch sizes, devices with high memory bandwidth typically obtain high tokens/sec in inference benchmarks.

This benchmark focuses on the inference time of easy-to-run LLMs and unquantized transformer ops, primarily useful when running inference locally, or when finetuning custom models for (or on!) Apple devices.

You may also be interested in:

  • Tristan Bilot's comprehensive benchmark for fundamental operators for mlx, torch+mps, and torch+cuda (link). Placing both mlx and torch functions in a single benchmark class makes it easy to see the differences between the two, and we adopt the same strategy here.

  • The work of Feng et al. comparing training on Nvidia cards vs Apple Silicon.

About

Benchmarking LLMs on Apple silicon.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.5%
  • HTML 3.4%
  • Other 1.1%