Skip to content

Lightweight Python tool using Optuna for tuning llama.cpp flags: towards optimal tok/s for your machine

License

Notifications You must be signed in to change notification settings

BrunoArsioli/llama-optimus

Repository files navigation

llama-optimus

Running Local AI?

llama-optimus will find the BEST llama.cpp performance flags for YOUR unique hardware

llama-optimus is a lightweight Python tool to automatically optimize llama.cpp performance flags for maximum throughput.

Maximize your tokens/s for prompt processing (pp) & token generation (tg).

Brings Bayesian optimization (Optuna) to your local & embedded AI models.


What does llama-optimus do?

  • Tunes llama.cpp parameters for maximum tokens/sec, using automated parameter search.
  • Bayesian optimization (Optuna) is used to maximize tokens/sec for prompt processing, generation or both
  • Estimates user-specified GPU layer count (-ngl).
  • Supports override patterns for --override-tensor: allows you to optimize advanced memory offloading for large models or low VRAM systems.
  • Built in system warmup to ensure benchmarking is done under real-world, “steady-state” conditions.
  • Grid search over categorical parameters (for flags like --override-tensor and --flash-attn ) combined with Bayesian tuning of numerical ones.
  • CLI interface: All major parameters and paths are settable via command line or environment variable.
  • Built on: Optuna for hyperparameter optimization and llama.cpp for inference.
  • Adapts to Apple Silicon, Linux x86, and NVIDIA GPU systems
  • Outputs copy-paste-ready commands for llama-server and llama-bench
  • Automatically runs llama-bench (at the end) comparing optimized vs. non-optimized results.
  • Quick & robust benchmarks: Controllable via -repeat and --n-tokens flags which set the number of llama-bench repetitions per trial, and the number of tokens used when estimating tokens/s velocity in prompt processing (pp) and text generation (tg).

Requirements

  • Python 3.10+

  • Latest llama.cpp (release version > 3667 : which incorporated --no-warmup flag to llama-bench : b5706)

  • At least one GGUF model file

  • Make sure to have installed the latest version of llamma.cpp (release version > 3667)


Installation I (recommended)

  1. Install llama-optimus from Pypi distribution

    pip install llama-optimus
  2. Launch llama-optimus

    llama-optimus
  3. Provide paths to llama.cpp/build/bin and model.gguf During launch, you will be propted to provide the path to your ~/llama.cpp/build/bin directory, and also the path to your AI model ~/model.gguf. Note: You must have the latest version of llama.cpp (Follow the llama.cpp instructions.)

  4. For a quick test lauch with no warmup flag, and with few token per trail loop

    llama-optimus --trials 5 --repeat 2 --no-warmup --n-tokens 20 --metric tg    

Intallation II (dev option)

  1. Clone this repo:

    git clone https://github.com/BrunoArsioli/llama-optimus
    cd llama-optimus
  2. (Recommended) Create and activate a Python virtualenv:

    python3 -m venv .venv
    source .venv/bin/activate
  3. Install Python dependencies in editable/developer mode:

    pip install -e .

    Or, to install just requirements (if not developing):

    pip install -r requirements.txt
  4. Build llama.cpp:

    • Follow the llama.cpp instructions.
    • Note your full path to llama-bin (e.g., /your_path_to/llama.cpp/build/bin) for this tool to work.

⚙️ Configuration

  • All arguments are optional except --llama-bin and --model (if not set as env variables).
  • CLI flags override environment variables.

Option A: Set paths as environment variables

export LLAMA_BIN=/path_to/llama.cpp/build/bin
export MODEL_PATH=/path_to/model.gguf

if you set those environment variables, you will not need to pass the llama bin and model paths.

llama-optimus

for a robust test, you can use more trials (default: 45), and more benchmark repetitions (default: 3)

llama-optimus --trials 70 -r 5 

to grid search all override-tensor presets and flash-attn options, and optimize numerical flags

llama-optimus --override-mode scan --trials 20 --metric tg

Quick Notes/FAQ:

What is a trial? It is a test of the model performance (e.g. how many tokens/s it can generate) given a configuration of llama.cpp parameters.

e.g.: When running llama-optimus you will see the output for every trial, like this:

Trial 0 finished with value: 72.43 and parameters: {'batch': 32, 'flash_attn': 1, 'u_batch': 8, 'threads': 11, 'gpu_layers': 97}. Best is trial 0 with value: 72.43

Meaning, if you run a llama-server with flags: --batch-size 32 --flash-attn --ubatch-size 8 --threads 11 -ngl 97 ; you will get about 72.43 tokens/s in token generation.

What is a repetition, or -r ?

The result we got from Trial 0 (72.43 tokens/s) is a mean value; It was calculated by running this same Trial 0 configuration -r times.

Why do we need to repeat runs before calculating the result (tokens/s) ?

That is because, everytime you run llama-bench (the llama.cpp benchmark) you get a slightly different estimate for the tokens/s metric. There is a degree of variability in the results; We calculate a mean value to base our final decision on more reliable/robust results.

Option B: Pass as CLI flags

llama-optimus --llama-bin ~my_path_t/llama.cpp/build/bin --model ~my_path_to/models/my-model.gguf

Option C: Source a helper script

You may create a script (e.g., set_local_paths.sh) with the content:

#!/usr/bin/env bash
export LLAMA_BIN=~my_path_to/llama.cpp/build/bin
export MODEL_PATH=~my_path_to/models/my-model.gguf

and source set_local_paths.sh before running llama-optimus.


Usage

run llama-optimus with 25 trials, 3 repetitions per trial, and only estimate token generation tg velocity:

llama-optimus --llama-bin my_path_to/llama.cpp/build/bin --model my_path_to/model.gguf --trials 25 -r 3 --metric tg

Or, if you prefer to use environment variables :

export LLAMA_BIN=my_path_to/llama.cpp/build/bin
export MODEL_PATH=my_path_to/model.gguf
llama-optimus --trials 25 -r 3 --metric tg

Common arguments:

  • --llama-bin  Path to your llama.cpp build/bin directory (or set $LLAMA_BIN)

  • --model   Path to your .gguf model file (or set $MODEL_PATH)

  • --trials   Number of optimization trials (default: 35)

  • --metric   Which throughput metric to optimize: tg = token generation speed, pp = prompt processing speed, mean = average of both

  • -r / --repeat How many repetitions per configuration (default: 2; use 1 for quick/dirty, 5 for robust)

  • --n-tokens Number of tokens to use for benchmarking. Larger = more stable measurements (default: 60).

  • --override-mode  How to treat --override-tensor (default: scan): none: ignore this flag (do not scan over override-tensor space) scan: grid search all preset patterns (from override_patterns.py) custom: ([TBD]future) user-defined override tensor patterns

  • --n-warmup-tokens Number of tokens passed to llama-bench during each warmup loop

  • --warmup-runs Max warm-up iterations before optimisation (default: 30; minimum is 4; For no warmup, use the --no-warmup flag )

  • --no-warmup Skip warmup phase (for test/debug purpose)

See all options:

llama-optimus --help

How it works

  • By default, starts with a warm-up. (see extra notes)

  • By default, starts by estimating your hardware's max -ngl (GPU layers) for a given model. If you know your max, use --ngl-max to skip this step.

  • Uses Optuna to search for the best combination of llama.cpp flags (batch size, ubatch, threads, ngl, etc).

  • Runs quick benchmarks (with llama-bench) for each config, parses results from the CSV, and feeds back to the optimizer.

  • Hierarchical optimization:

    Stage 1: Bayesian search over numerical flags (batch_size, ubatch_size, threads, gpu_layers)

    Stage 2: Grid search over categorical flags (override-tensor, flash-attn), using the best numericals found

    Stage 3: Final fine-tuning over numerical flags with best categorical flags fixed

  • Finds the best flags in minutes —no need to try random combinations!

  • Handles errors (non-working configs are skipped).

  • You can abort Trial at any time, and still follow the best result/configuration achieved so far.

Notes: llama-optimus will try to ensure your system is warmed up before starting optimization, to avoid misleading cold-start results (i.e. cold-starts lead to larger tokens/s counts; This a source of confusion for the community, because cold-start results are usually reported as "found a new configuration which improved tokens/s in xx%", which is misleading when comparing cold-start vs. warmeedup numbers).


Output

  • After running, you'll see:

    • The best configuration found for your hardware/model.
    • A ready-to-copy llama-server command line for running optimal inference.
    • A ready-to-copy llama-bench command for benchmarking both prompt processing and generation speeds.
    • Example output:
Best config: {'batch': 4096, 'flash': 1, 'u_batch': 1024, 'threads': 4, 'gpu_layers': 93}
Best tg tokens/sec: 73.5

# You are ready to run a local llama-server:

# For optimal inference:
llama-server --model my_path_to/model.gguf -t 4 --batch-size 4096 --ubatch-size 1024 -ngl 93 --flash-attn 

# To benchmark both generation and prompt processing speeds:
llama-bench --model my_path_to/model.gguf -t 4 --batch-size 4096 --ubatch-size 1024 -ngl 93 --flash-attn 1 -n 128 -p 128 -r 3 -o csv

# In case you want Image-to-text capabilities on your server, you will need to add a flag: --mmproj my_path_to/mmproj_file.ggfu ; Every Image-to-text model has a projection file available. 

Tip

The default values for prompt processing -p 128 and prompt generation -n 128 gives fast trials.

The user can control this via --n-tokens parameter, which can be passed to llama-optimus during launch: lead to stable and robust results

llama-optimus --n-tokens 256

Later, for a stable final score, re-run llama-bench with the best flags found (don't forget to warm-up first):

llama-bench ... -p 512 -n 256 -r 5 --progress

E.g. of launch a server with optimized flags will look like this:

llama-server --model my_path_to/model.gguf -t 4 --batch-size 4096 --ubatch-size 1024 -ngl 63 --flash-attn  --override-tensor "blk\.(6|7|8|9|[1-9][0-9]+)\.ffn_.*_exps\.=CPU"

Why Do a Warmup?

Initial runs are fast because hardware (especially Apple Silicon) is “cold”—no thermal throttling, RAM/cache is fresh, and CPU governor may be in turbo mode.

After several runs, temperature rises or RAM usage accumulates, causing the system to reduce clocks or evict caches.

If you start Trials (Stage 1 or 2) with “cold” hardware, the “best” config may just be lucky—chosen in a period of turbo clocks, giving misleading results.

For this reason, llama-optimus warms-up before scanning the parameter space with its Bayesian TPESampler.

Keep in mind: never trust cold-start numbers. Warming up the system and waiting for stable, “saturated” (real-world) performance will make your optimizer results much more robust and grounded to real use cases.

Make sure your fans turn on with the default number of warmup runs (default: 40).

If you need more (or less) runs to warmup your system, consider passing --n-warmup-runs flag during llama-optimus launch:

llama-optimus --n-warmup-runs 50

🛟 Troubleshooting 🛟

  • llama-bench not found: Check your --llama-bin path or LLAMA_BIN env var.
  • MODEL_PATH not set: Use --model or set the env variable.
  • Zero tokens/s: Try reducing --ngl-max or verifying your model is compatible with your hardware.

Project Structure

llama-optimus/
│
├── src/
│    └── llama_optimus/
│           ├── __init__.py
│           ├── core.py   # all optimization/benchmark logic
│           └── cli.py    # CLI interface (argparse, entrypoint)
│           └── search_space.py      # the numerical search space 
│           └── override_patterns.py # override-tensor patterns for use with llama.cpp.
│
├── test/
│   └── test_core.py
│
├── assets/
│   └── llama.optimus_logo.png
│
├── requirements.txt
├── setup.py
└── README.md

Coments about other llama.cpp flags

Flag Why it matters Suggested search values Notes
--mmap / --no-mmap (memory-map model vs. fully load) • On fast NVMe & Apple SSD, --mmap 1 (default) is fine.
• On slower HDD/remote disks, disabling mmap (--no-mmap or --mmap 0) and loading the whole model into RAM often gives 10-20 % faster generation (no page-fault stalls).
[0, 1] (boolean) Keep default 1; let Optuna see if 0 wins on a given box.
--cache-type-k / --cache-type-v Setting key/value cache to f16 vs q4 or i8 trades RAM vs speed. Most Apple-Metal & CUDA users stick with f16 (fast, larger). For low-RAM CPUs increasing speed is impossible if it swaps; q4 can shrink cache 2-3× at ~3-5 % speed cost. ["f16","q4"] for both k & v (skip i8 unless you target very tiny devices). Only worth searching when the user is on CPU-only or small-VRAM GPU. You can gate this by detecting “CUDA not found” or VRAM < 8 GB.
--main-gpu / --gpu-split (or --tensor-split) Relevant only for multi-GPU rigs (NVIDIA). Picking the right primary or a tensor split can cut VRAM fragmentation and enable higher -ngl. If multi-GPU detected, expose [0,1] for main-gpu and a handful of tensor-split presets (0,1, 0,0.5,0.5, etc.). Keep disabled on single-GPU/Apple Silicon to avoid wasted trials.
[Preliminary] --flash-attn-type 0/1/2 (v0.2+ of llama.cpp) Metal + CUDA now have two flash-attention kernels (0 ≈ old GEMM, 1 = FMHA, 2 = in-place FMHA). !!Note!!: Not yet merged to llama.cpp main. Some M-series Macs get +5-8 % with type 2 vs 1. [0,1,2] —but only if llama.cpp commit ≥ May 2025. Add a version guard: skip the flag on older builds.

FAQ

Q: Does this tune all llama.cpp options? A: No, just the most performance-relevant (batch sizes, threads, GPU layers, Flash-Attn). Feel free to extend the SEARCH_SPACE dictionary.

Q: Can I use this with non-Meta models? A: Yes! Any GGUF model supported by your llama.cpp build.

Q: Is it safe for my hardware? A: Yes. Non-working configs are skipped, and benchmarks use conservative timeouts.


Development & Testing

Write your own test scripts in /test (see test_core.py for simple mocking).

All major code logic is in llama_optimus/core.py.

Install in dev/edit mode: pip install -e .

Run CLI directly: llama-optimus ... or python -m llama_optimus.cli ...

Contributing

PRs, suggestions, and benchmarks welcome!
Open an issue or submit a PR.


License

MIT License (see LICENSE).


Acknowledgements

Build with Optuna

- [Optuna](https://optuna.org/)

Made for llama.cpp

- [llama.cpp](https://github.com/ggerganov/llama.cpp)

llama.optimus

- [llama.optimus](https://github.com/BrunoArsioli/llama-optimus)

PyPi

- [PyPi](https://pypi.org/)

Optimize your LLM inference—contribute your configs, and help the community improve!

About

Lightweight Python tool using Optuna for tuning llama.cpp flags: towards optimal tok/s for your machine

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published