Running Local AI?
llama-optimus will find the BEST llama.cpp performance flags for YOUR unique hardware
llama-optimus is a lightweight Python tool to automatically optimize llama.cpp performance flags for maximum throughput.
Maximize your tokens/s for prompt processing (pp) & token generation (tg).
Brings Bayesian optimization (Optuna) to your local & embedded AI models.
- Tunes
llama.cppparameters for maximumtokens/sec, using automated parameter search. - Bayesian optimization (Optuna) is used to maximize tokens/sec for prompt processing, generation or both
- Estimates user-specified GPU layer count (
-ngl). - Supports override patterns for --override-tensor: allows you to optimize advanced memory offloading for large models or low VRAM systems.
- Built in system warmup to ensure benchmarking is done under real-world, “steady-state” conditions.
- Grid search over categorical parameters (for flags like --override-tensor and --flash-attn ) combined with Bayesian tuning of numerical ones.
- CLI interface: All major parameters and paths are settable via command line or environment variable.
- Built on: Optuna for hyperparameter optimization and llama.cpp for inference.
- Adapts to Apple Silicon, Linux x86, and NVIDIA GPU systems
- Outputs copy-paste-ready commands for
llama-serverandllama-bench - Automatically runs
llama-bench(at the end) comparing optimized vs. non-optimized results. - Quick & robust benchmarks: Controllable via
-repeatand--n-tokensflags which set the number of llama-bench repetitions per trial, and the number of tokens used when estimating tokens/s velocity in prompt processing (pp) and text generation (tg).
-
Python 3.10+
-
Latest llama.cpp (release version > 3667 : which incorporated --no-warmup flag to llama-bench : b5706)
-
At least one GGUF model file
-
Make sure to have installed the latest version of llamma.cpp (release version > 3667)
-
Install llama-optimus from Pypi distribution
pip install llama-optimus
-
Launch llama-optimus
llama-optimus
-
Provide paths to
llama.cpp/build/binandmodel.ggufDuring launch, you will be propted to provide the path to your ~/llama.cpp/build/bin directory, and also the path to your AI model ~/model.gguf. Note: You must have the latest version of llama.cpp (Follow the llama.cpp instructions.) -
For a quick test lauch with no warmup flag, and with few token per trail loop
llama-optimus --trials 5 --repeat 2 --no-warmup --n-tokens 20 --metric tg
-
Clone this repo:
git clone https://github.com/BrunoArsioli/llama-optimus cd llama-optimus -
(Recommended) Create and activate a Python virtualenv:
python3 -m venv .venv source .venv/bin/activate -
Install Python dependencies in editable/developer mode:
pip install -e .Or, to install just requirements (if not developing):
pip install -r requirements.txt
-
Build llama.cpp:
- Follow the llama.cpp instructions.
- Note your full path to
llama-bin(e.g.,/your_path_to/llama.cpp/build/bin) for this tool to work.
- All arguments are optional except
--llama-binand--model(if not set as env variables). - CLI flags override environment variables.
export LLAMA_BIN=/path_to/llama.cpp/build/bin
export MODEL_PATH=/path_to/model.ggufif you set those environment variables, you will not need to pass the llama bin and model paths.
llama-optimusfor a robust test, you can use more trials (default: 45), and more benchmark repetitions (default: 3)
llama-optimus --trials 70 -r 5 to grid search all override-tensor presets and flash-attn options, and optimize numerical flags
llama-optimus --override-mode scan --trials 20 --metric tgWhat is a trial? It is a test of the model performance (e.g. how many tokens/s it can generate) given a configuration of llama.cpp parameters.
e.g.: When running llama-optimus you will see the output for every trial, like this:
Trial 0 finished with value: 72.43 and parameters: {'batch': 32, 'flash_attn': 1, 'u_batch': 8, 'threads': 11, 'gpu_layers': 97}. Best is trial 0 with value: 72.43
Meaning, if you run a llama-server with flags: --batch-size 32 --flash-attn --ubatch-size 8 --threads 11 -ngl 97 ; you will get about 72.43 tokens/s in token generation.
What is a repetition, or -r ?
The result we got from Trial 0 (72.43 tokens/s) is a mean value; It was calculated by running this same Trial 0 configuration -r times.
Why do we need to repeat runs before calculating the result (tokens/s) ?
That is because, everytime you run llama-bench (the llama.cpp benchmark) you get a slightly different estimate for the tokens/s metric. There is a degree of variability in the results; We calculate a mean value to base our final decision on more reliable/robust results.
llama-optimus --llama-bin ~my_path_t/llama.cpp/build/bin --model ~my_path_to/models/my-model.ggufYou may create a script (e.g., set_local_paths.sh) with the content:
#!/usr/bin/env bash
export LLAMA_BIN=~my_path_to/llama.cpp/build/bin
export MODEL_PATH=~my_path_to/models/my-model.ggufand source set_local_paths.sh before running llama-optimus.
run llama-optimus with 25 trials, 3 repetitions per trial, and only estimate token generation tg velocity:
llama-optimus --llama-bin my_path_to/llama.cpp/build/bin --model my_path_to/model.gguf --trials 25 -r 3 --metric tgOr, if you prefer to use environment variables :
export LLAMA_BIN=my_path_to/llama.cpp/build/bin
export MODEL_PATH=my_path_to/model.gguf
llama-optimus --trials 25 -r 3 --metric tgCommon arguments:
-
--llama-binPath to your llama.cppbuild/bindirectory (or set$LLAMA_BIN) -
--modelPath to your.ggufmodel file (or set$MODEL_PATH) -
--trialsNumber of optimization trials (default: 35) -
--metricWhich throughput metric to optimize:tg= token generation speed,pp= prompt processing speed,mean= average of both -
-r/--repeatHow many repetitions per configuration (default: 2; use 1 for quick/dirty, 5 for robust) -
--n-tokensNumber of tokens to use for benchmarking. Larger = more stable measurements (default: 60). -
--override-modeHow to treat --override-tensor (default: scan):none: ignore this flag (do not scan over override-tensor space)scan: grid search all preset patterns (from override_patterns.py)custom: ([TBD]future) user-defined override tensor patterns -
--n-warmup-tokensNumber of tokens passed to llama-bench during each warmup loop -
--warmup-runsMax warm-up iterations before optimisation (default: 30; minimum is 4; For no warmup, use the--no-warmupflag ) -
--no-warmupSkip warmup phase (for test/debug purpose)
See all options:
llama-optimus --help-
By default, starts with a warm-up. (see extra notes)
-
By default, starts by estimating your hardware's max
-ngl(GPU layers) for a given model. If you know your max, use--ngl-maxto skip this step. -
Uses Optuna to search for the best combination of
llama.cppflags (batch size, ubatch, threads, ngl, etc). -
Runs quick benchmarks (with
llama-bench) for each config, parses results from the CSV, and feeds back to the optimizer. -
Hierarchical optimization:
Stage 1: Bayesian search over numerical flags (batch_size, ubatch_size, threads, gpu_layers)
Stage 2: Grid search over categorical flags (override-tensor, flash-attn), using the best numericals found
Stage 3: Final fine-tuning over numerical flags with best categorical flags fixed
-
Finds the best flags in minutes —no need to try random combinations!
-
Handles errors (non-working configs are skipped).
-
You can abort Trial at any time, and still follow the best result/configuration achieved so far.
Notes: llama-optimus will try to ensure your system is warmed up before starting optimization, to avoid misleading cold-start results (i.e. cold-starts lead to larger tokens/s counts; This a source of confusion for the community, because cold-start results are usually reported as "found a new configuration which improved tokens/s in xx%", which is misleading when comparing cold-start vs. warmeedup numbers).
-
After running, you'll see:
- The best configuration found for your hardware/model.
- A ready-to-copy
llama-servercommand line for running optimal inference. - A ready-to-copy
llama-benchcommand for benchmarking both prompt processing and generation speeds. - Example output:
Best config: {'batch': 4096, 'flash': 1, 'u_batch': 1024, 'threads': 4, 'gpu_layers': 93}
Best tg tokens/sec: 73.5
# You are ready to run a local llama-server:
# For optimal inference:
llama-server --model my_path_to/model.gguf -t 4 --batch-size 4096 --ubatch-size 1024 -ngl 93 --flash-attn
# To benchmark both generation and prompt processing speeds:
llama-bench --model my_path_to/model.gguf -t 4 --batch-size 4096 --ubatch-size 1024 -ngl 93 --flash-attn 1 -n 128 -p 128 -r 3 -o csv
# In case you want Image-to-text capabilities on your server, you will need to add a flag: --mmproj my_path_to/mmproj_file.ggfu ; Every Image-to-text model has a projection file available.
The default values for prompt processing -p 128 and prompt generation -n 128 gives fast trials.
The user can control this via --n-tokens parameter, which can be passed to llama-optimus during launch: lead to stable and robust results
llama-optimus --n-tokens 256Later, for a stable final score, re-run llama-bench with the best flags found (don't forget to warm-up first):
llama-bench ... -p 512 -n 256 -r 5 --progress
E.g. of launch a server with optimized flags will look like this:
llama-server --model my_path_to/model.gguf -t 4 --batch-size 4096 --ubatch-size 1024 -ngl 63 --flash-attn --override-tensor "blk\.(6|7|8|9|[1-9][0-9]+)\.ffn_.*_exps\.=CPU"Initial runs are fast because hardware (especially Apple Silicon) is “cold”—no thermal throttling, RAM/cache is fresh, and CPU governor may be in turbo mode.
After several runs, temperature rises or RAM usage accumulates, causing the system to reduce clocks or evict caches.
If you start Trials (Stage 1 or 2) with “cold” hardware, the “best” config may just be lucky—chosen in a period of turbo clocks, giving misleading results.
For this reason, llama-optimus warms-up before scanning the parameter space with its Bayesian TPESampler.
Keep in mind: never trust cold-start numbers. Warming up the system and waiting for stable, “saturated” (real-world) performance will make your optimizer results much more robust and grounded to real use cases.
Make sure your fans turn on with the default number of warmup runs (default: 40).
If you need more (or less) runs to warmup your system, consider passing --n-warmup-runs flag during llama-optimus launch:
llama-optimus --n-warmup-runs 50llama-bench not found: Check your--llama-binpath orLLAMA_BINenv var.MODEL_PATH not set: Use--modelor set the env variable.- Zero tokens/s: Try reducing
--ngl-maxor verifying your model is compatible with your hardware.
llama-optimus/
│
├── src/
│ └── llama_optimus/
│ ├── __init__.py
│ ├── core.py # all optimization/benchmark logic
│ └── cli.py # CLI interface (argparse, entrypoint)
│ └── search_space.py # the numerical search space
│ └── override_patterns.py # override-tensor patterns for use with llama.cpp.
│
├── test/
│ └── test_core.py
│
├── assets/
│ └── llama.optimus_logo.png
│
├── requirements.txt
├── setup.py
└── README.md| Flag | Why it matters | Suggested search values | Notes |
|---|---|---|---|
--mmap / --no-mmap (memory-map model vs. fully load) |
• On fast NVMe & Apple SSD, --mmap 1 (default) is fine.• On slower HDD/remote disks, disabling mmap ( --no-mmap or --mmap 0) and loading the whole model into RAM often gives 10-20 % faster generation (no page-fault stalls). |
[0, 1] (boolean) |
Keep default 1; let Optuna see if 0 wins on a given box. |
--cache-type-k / --cache-type-v |
Setting key/value cache to f16 vs q4 or i8 trades RAM vs speed. Most Apple-Metal & CUDA users stick with f16 (fast, larger). For low-RAM CPUs increasing speed is impossible if it swaps; q4 can shrink cache 2-3× at ~3-5 % speed cost. |
["f16","q4"] for both k & v (skip i8 unless you target very tiny devices). |
Only worth searching when the user is on CPU-only or small-VRAM GPU. You can gate this by detecting “CUDA not found” or VRAM < 8 GB. |
--main-gpu / --gpu-split (or --tensor-split) |
Relevant only for multi-GPU rigs (NVIDIA). Picking the right primary or a tensor split can cut VRAM fragmentation and enable higher -ngl. |
If multi-GPU detected, expose [0,1] for main-gpu and a handful of tensor-split presets (0,1, 0,0.5,0.5, etc.). |
Keep disabled on single-GPU/Apple Silicon to avoid wasted trials. |
[Preliminary] --flash-attn-type 0/1/2 (v0.2+ of llama.cpp) |
Metal + CUDA now have two flash-attention kernels (0 ≈ old GEMM, 1 = FMHA, 2 = in-place FMHA). !!Note!!: Not yet merged to llama.cpp main. Some M-series Macs get +5-8 % with type 2 vs 1. |
[0,1,2] —but only if llama.cpp commit ≥ May 2025. |
Add a version guard: skip the flag on older builds. |
Q: Does this tune all llama.cpp options?
A: No, just the most performance-relevant (batch sizes, threads, GPU layers, Flash-Attn). Feel free to extend the SEARCH_SPACE dictionary.
Q: Can I use this with non-Meta models? A: Yes! Any GGUF model supported by your llama.cpp build.
Q: Is it safe for my hardware? A: Yes. Non-working configs are skipped, and benchmarks use conservative timeouts.
Write your own test scripts in /test (see test_core.py for simple mocking).
All major code logic is in llama_optimus/core.py.
Install in dev/edit mode: pip install -e .
Run CLI directly: llama-optimus ... or python -m llama_optimus.cli ...
PRs, suggestions, and benchmarks welcome!
Open an issue or submit a PR.
MIT License (see LICENSE).
Build with Optuna
- [Optuna](https://optuna.org/)Made for llama.cpp
- [llama.cpp](https://github.com/ggerganov/llama.cpp)llama.optimus
- [llama.optimus](https://github.com/BrunoArsioli/llama-optimus)PyPi
- [PyPi](https://pypi.org/)Optimize your LLM inference—contribute your configs, and help the community improve!



