This project provides tools to benchmark Large Language Models (LLMs) using Ollama, measuring performance metrics like inference time, tokens per second, CPU usage, memory consumption, and energy efficiency.This was made in the purpose of benchmarking models on edge devices (Raspberry Pi) in the context of the devellopement of an LLM solution for the RSC Project : rsc.ee. This is part of the master thesis : Benchmarking and Deploying Local Language Models for Social Educational Robots using Edge Devices
You can download this code using this command :
git clone https://github.com/RobotStudyCompanion/Benchmarking_LLM.git
Python 3.8 or higher Ollama (you can download on : https://ollama.com/download)
you will need python packages (versions are the one that has been used during the Thesis): ollama 0.1.0 psutil 5.9.0 matplotlib 3.7.0 pandas 2.0.0 openai 1.0.0 deepeval 0.21.0
you can download everything using this command : pip install ollama psutil matplotlib pandas openai deepeval
open the Excel_models file and add the different models you want to benchmark in the table like the exemple (you can find the ollama_name of the models on the ollama library part of the website). Save this file as .csv into the folder.
Some questions are used for the benchmark, you can add, delete, modified every questions for your purpose. They are stored in question.txt (follow the format)
If you want to test the effectiveness of teaching of your models, you need to import an API key (openAI key). you can get your API key from: https://platform.openai.com/api-keys
Windows (PowerShell):
$env:OPENAI_API_KEY = "your-api-key-here"Windows (CMD):
set OPENAI_API_KEY=your-api-key-heremacOS/Linux:
export OPENAI_API_KEY='your-api-key-here'Keep in mind that depending on the device used, the benchmark can takes a lot of times. There is no checkpoint save during the process so i strongly advice to perform only a few models at a time.
You can now run benchmarking.py
What it does:
- Auto-detects models installed in Ollama that match
Excel_models.csv - Loads questions from questions.txt
- Runs each question through each model
- Measures performance metrics:
- Tokens per second
- Inference time
- CPU usage
- Memory consumption
- Time to first token
- Power consumption (on Raspberry Pi)
- Energy efficiency (tokens per joule)
- Saves results to:
benchmark.csv- Summary resultsresults/benchmark_all_models_YYYYMMDD_HHMMSS.json- Detailed JSONresult/benchmark_all_models_YYYYMMDD_HHMMSS.csv- Detailed CSV
You can run the MMLU benchmark to test the knowledge of each of you models, i recommend to perform it on a powerfull machine as it require a lot of computational power to go through every questions.
python MMLU.pyWhat it does:
- Tests models on 6 MMLU task categories:
- Formal Logic
- Global Facts
- College Computer Science
- College Mathematics
- Marketing
- High School Macroeconomics
- Uses 3-shot learning (provides 3 examples before each question)
- Evaluates model accuracy on multiple-choice questions
- Displays the first 3 prompts and responses for debugging
- Saves progressive checkpoints after each task
- Saves results to:
MMLU/{model-name}_MMLU.json- Summary scores by taskMMLU/{model-name}_MMLU.csv- CSV format resultsMMLU/checkpoints/- Progressive checkpoints per task
Customizing MMLU tests: By default, the script runs a specific model
run_mmlu_single_model('granite4:1b-h')To test a different model, edit this line or modify the script to test all your models:
# Option 1: Test a single model
run_mmlu_single_model('llama3.2:1b')
# Option 2: Test all available models
run_mmlu_for_all_models()After running the benchmark on all the models you want to test, you can run analyse_results.py
What it does:
- Loads all JSON results from
./results/directory - Calculates summary statistics for each model
- Generates visualization graphs in
./analysis_graphs/:tokens_per_second.png- Performance comparisonenergy_efficiency.png- Energy metricsinference_time_distribution.png- Timing distributionsresponse_vs_performance.png- Response length analysisresource_usage.png- CPU, memory, temperaturemodel_radar_chart.png- Multi-dimensional comparison
- Exports summary to
analysis_summary.csv - (Optional) Rates teaching effectiveness using OpenAI API
- Generates
teaching_effectiveness_ratings.json - Creates
teaching_effectiveness_scores.pngandperformance_vs_teaching.png
- Generates
If you have MMLU benchmark results, you can visualize them using visualize_mmlu.py:
What it does:
- Loads MMLU results from
./MMLU/directory (JSON files ending with_MMLU.json) - Categorizes models into small (<2B) and big (≥2B)
- Generates visualizations:
- Overall score comparisons (all models, small models, big models)
- Task-specific performance graphs
- Individual model radar charts
- Saves graphs to
./MMLU/directory