This file provides comprehensive guidance to AI agents (Claude Code, etc.) when working with this repository.
LLM benchmarking project for testing model inference performance using llama-bench across different GPU backends (ROCm versions, Vulkan) on AMD hardware (targeting Strix Halo). Results are stored as CSV and log files.
llama-thon/
├── llama-thon.sh # Main benchmarking script
├── update_llama_cpp.py # Helper to fetch/build llama.cpp variants
├── run-rocm-smoke-test.sh # Quick ROCm sanity check
├── results/ # Benchmark output (per-model subdirs)
│ ├── <model-name>/
│ │ ├── results.csv # Aggregated results for this model
│ │ └── *.log # Individual run logs
│ └── rerun_*.csv # Auto-generated rerun files (if failures)
├── MODEL_RECOMMENDATIONS.md # Analysis and model picks
├── AGENTS.md # This file (agent instructions)
├── CLAUDE.md # Points to AGENTS.md
└── README.md # Public documentation
Full run (Cartesian product of models x runners x bench_arg_sets):
./llama-thon.shRerun specific failures from CSV:
./llama-thon.sh results/rerun_20260125_143022.csvResults saved to results/<model-name>/:
results.csv- Parsed benchmark data (written incrementally, survives interruption)YYYYMMDD_HHMMSS_backend-quant-args.log- Raw llama-bench output per runrerun_YYYYMMDD_HHMMSS.csv- Auto-generated rerun file (only if failures occur)
llama-thon.sh is organized into sections:
-
CONFIGURATION (top) - Arrays to modify:
models[]- GGUF model paths to benchmarkrunners[]- llama-bench invocations (toolbox or local)bench_arg_sets[]- Different benchmark parameter sets
-
HELPER FUNCTIONS -
run_in_dir,sanitize_for_filename, GTT memory monitoring -
SYSTEM INFO FUNCTIONS - OS, kernel, firmware detection
-
LOG PARSING FUNCTIONS - Extract build version, pp/tg values from llama-bench output
-
INFERENCE FUNCTIONS - Derive backend, build type, model name, quantization from paths
-
CSV FUNCTIONS - Write/append with immediate
syncfor durability -
BENCHMARK EXECUTION - Core function
run_single_benchmarkexecutes one (model, runner, bench_args) tuple:- Monitors GTT memory usage during benchmark
- Captures wall clock duration
- Writes results to CSV immediately (crash-safe)
- Logs failures to rerun CSV automatically
-
MAIN EXECUTION - Two modes:
- Full run: Triple loop (models x runners x bench_arg_sets)
- Rerun mode: Read tuples from CSV file and execute each
- Displays failure count and rerun command at end
Runners: llama-bench via toolbox containers (toolbox run -c <container> -- llama-bench) or direct paths. Examples:
"toolbox run -c llama-rocm-6.4.4 -- llama-bench"
"toolbox run -c llama-cpp-rocm-7.250122 -- /usr/bin/llama-bench"
"/home/sid/Projects/ai/llama.cpp-b7813-vulkan-prebuilt/llama-bench"Models: GGUF files; name and quantization inferred from HuggingFace-style paths. Examples:
/mnt/data/projects/ai/models/hub/models--unsloth--GLM-4.7-Flash-GGUF/.../GLM-4.7-Flash-UD-Q4_K_XL.gguf
Model name: "GLM-4.7-Flash (Unsloth)", Quantization: "Q4_K_XL"
Backends: ROCm (6.4.4, 7.1.1, 7.2, nightlies) and Vulkan (radv, amdvlk)
bench_arg_sets: Each string is a complete set of llama-bench args:
"-fa 1 -ngl 99 -mmp 0" # Standard (pp512, tg128)
"-fa 1 -ngl 99 -mmp 0 -p 2048 -n 512 -d 32768 -ub 2048" # Long contextOS,Kernel,linux-firmware,llama-cpp,built,backend,model,quantization,pp_size,pp,tg_size,tg,gtt_peak_mb,duration_sec,params,notes
| Column | Description | Example |
|---|---|---|
| OS | Operating system | Fedora Linux 43 |
| Kernel | Kernel version | 6.18.3-200.fc43.x86_64 |
| linux-firmware | Firmware package version | 20260110-1.fc43.noarch |
| llama-cpp | llama.cpp build number | b7823 |
| built | Build type | toolbox, prebuilt, local |
| backend | GPU backend | rocm-7.2, vulkan-radv |
| model | Model name (inferred) | GLM-4.7-Flash (Unsloth) |
| quantization | Quantization level | Q4_K_M, Q6_K, IQ3_XXS |
| pp_size | Prompt processing batch size | 512, 2048 |
| pp | Prompt processing tokens/sec | 683.42 |
| tg_size | Token generation batch size | 128, 512 |
| tg | Token generation tokens/sec | 45.21 |
| gtt_peak_mb | Peak GTT memory increase (MB) | 12800 |
| duration_sec | Wall clock time (seconds) | 45.2 |
| params | llama-bench arguments used | -fa 1 -ngl 99 -mmp 0 |
| notes | Failures, warnings, log path |
Auto-generated when failures occur, or create manually:
model,runner,bench_args
/path/to/model.gguf,toolbox run -c llama-rocm-7.1.1 -- llama-bench,-fa 1 -ngl 99 -mmp 0Helper script that fetches and builds llama.cpp variants for different acceleration backends.
-
Prebuilt Vulkan
- Source: https://github.com/ggml-org/llama.cpp/releases/latest
- Asset: Linux Vulkan x64 archive
- Install dir:
~/Apps/llama.cpp-<tag>-vulkan-prebuilt
-
Prebuilt ROCm
- Source: https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest
- Asset: Linux ROCm
gfx1151x64 archive - Install dir:
~/Apps/llama.cpp-<tag>-rocm-prebuilt - Version tag resolved from release notes' llama.cpp commit hash
-
Build from local source (Vulkan)
- Source path:
/home/sid/Projects/ai/llama.cpp - Install dir:
~/Apps/llama.cpp-<count>-vulkan - Requires Vulkan SDK headers, loader, and
glslc
- Source path:
-
Build via toolbox Dockerfile (Vulkan)
- Dockerfile: Local or fallback from GitHub
- Container tag:
llama.cpp-<count>-vulkan
-
Build via toolbox Dockerfile (ROCm)
- Dockerfile: Local or fallback from GitHub
- Container tag:
llama.cpp-<count>-rocm-7nightlies
-
Update toolboxes
- Repo:
/home/sid/Projects/ai/amd-strix-halo-toolboxes - Runs
git pullthen./refresh-toolboxes all
- Repo:
# Prebuilt (latest release assets)
./update_llama_cpp.py vulkan-prebuilt
./update_llama_cpp.py rocm-prebuilt
# Build from local source
./update_llama_cpp.py vulkan-build
# Build via toolbox Dockerfiles
./update_llama_cpp.py vulkan-toolbox
./update_llama_cpp.py rocm-toolbox
# Update toolboxes
./update_llama_cpp.py update-toolboxes
# Run all steps
./update_llama_cpp.py all
# Overwrite existing install dirs
./update_llama_cpp.py --force vulkan-prebuilt- Install base:
~/Apps - Container runtime:
podman(fallback:docker), override withLLAMA_CPP_CONTAINER_RUNTIME - Toolbox builds use local context when available; remote Dockerfiles are fallback
- Keep it simple, easy to read yet modular
- Crash-safe: results written incrementally with
sync - Reproducible: all system info captured in CSV
- Resumable: automatic rerun CSV generation on failures