What is llamathon ?

Measures how fast llama.cpp can work with a given model and a given backend runtime (Vulkan, ROCm, CUDA etc). Does so with small context as well as medium/large context. Dumps results into a CSV

AMD Strix Halo

AMD's Strix Halo system is very interesting for it's unified memory take in the PC architecture world with 128 GB memory. Sadly, their kernel + gpu firmware + ROCm (AI acceleration layer) needs more to do better in terms of performance and reliability/quality.

Meanwhile, I highly recommend just using https://github.com/kyuz0/amd-strix-halo-toolboxes to simplify setup and remove most of the guess-work.

Measurements

Right now it tests only token throughput or speed. Not intelligence or quality of those tokens.

Key metrics:

pp = prompt processing, in tokens/second. This is how quickly the LLM can process the input before you see it's first response. Agentic tasks usually have more context than chat (which usually starts with just 'hi' or a sentence of two), so you do want to pay attention to this.
tg = token generation, in tokens/second. This is how quickly the LLM can generate it's response after processing your prompt.

Future

Maybe incorporate some sort of quality metrics for intelligence or quality of token output.

Across different models, maybe creating a SPA that visualizes the test results?
K-L divergence for quantized models comparing base vs quantized paper

Results

I've checked in the actual results I got into the results-published/ folder, check those out. You can see the raw logs too if you need some some reason(!). I wanted to understand some of the numbers myself, so I formatted the Excel (Excel files also check into the published results if you want to pivot table away!)

All graphs show

Medium context scenario: 2048 input prompt processing and 512 output token generation. Then,
Small context scenario: 512 input prompt processing and 128 output token generation.

MiniMax M2.1

GLM 4.7 Flash

Google Gemini 3 4B

This is a vision model, it's fast. However, it strictly expects turns like "AI-User-AI-User" while sometimes in agentic layers you can have "AI-User-User-AI" which breaks this. Logically the fix is simple, concatenate the multiple 'User-User' but that required changes to agentic runtimes, which can unnecessarily get in the way.

ROCm vs Vulkan: ROCm is consistently faster than Vulkan except for gpt-oss-20b and gpt-oss-120b models.

See MODEL_RECOMMENDATIONS.md for what AI/Claude thinks. Warning, I've to verify that!

Quick Start

Prerequisites

# Fedora/RHEL
sudo dnf install podman toolbox

# Ubuntu/Debian
sudo apt install podman

Run

git clone https://github.com/SidShetye/llama-thon.git
cd llama-thon

# Edit configuration at top of script:
# - models[] : paths to your GGUF files
# - runners[] : llama-bench backends to test
# - bench_arg_sets[] : benchmark parameters

./llama-thon.sh

Configuration

Edit the arrays at the top of llama-thon.sh:

# Models to test
models=(
    "/path/to/your/model.gguf"
)

# Backends (toolbox containers or local builds)
runners=(
    "toolbox run -c llama-rocm-7.2 -- llama-bench"
    "toolbox run -c llama-vulkan-radv -- llama-bench"
)

# Benchmark parameters
bench_arg_sets=(
    "-fa 1 -ngl 99 -mmp 0"                                    # Standard
    "-fa 1 -ngl 99 -mmp 0 -p 2048 -n 512 -d 32768 -ub 2048"   # Long context
)

Output

Results saved to results/<model-name>/:

results.csv - Structured metrics (incremental, crash-safe)
*.log - Raw llama-bench output per run
rerun_*.csv - Failed tests for retry

Running failed configurations again

You don't need to run every model, with every argument, with every runtime again. Failures are recorded, you can re-run just those specific ones by something like

./llama-thon.sh results/rerun_20260125_143022.csv

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
results-published		results-published
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
MODEL_RECOMMENDATIONS.md		MODEL_RECOMMENDATIONS.md
README.md		README.md
llama-quantumnoise.md		llama-quantumnoise.md
llama-thon.sh		llama-thon.sh
run-rocm-smoke-test.sh		run-rocm-smoke-test.sh
update_llama_cpp.py		update_llama_cpp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is llamathon ?

AMD Strix Halo

Measurements

Future

Results

MiniMax M2.1

GLM 4.7 Flash

Google Gemini 3 4B

Quick Start

Prerequisites

Run

Configuration

Output

Running failed configurations again

License

About

Uh oh!

Releases

Packages

Languages

SidShetye/llama-thon

Folders and files

Latest commit

History

Repository files navigation

What is llamathon ?

AMD Strix Halo

Measurements

Future

Results

MiniMax M2.1

GLM 4.7 Flash

Google Gemini 3 4B

Quick Start

Prerequisites

Run

Configuration

Output

Running failed configurations again

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages