NeMo Evaluator is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables you to run hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. Evaluations execute in open-source Docker containers for auditable and trustworthy results. The platform's containerized architecture allows for the rapid integration of public benchmarks and private datasets.
Tutorial | Supported Benchmarks | Configuration Examples | Contribution Guide
NeMo Evaluator is built on four core principles to provide a reliable and versatile evaluation experience.
- Reproducibility by Default -- All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
- Scale Anywhere -- Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.
- State-of-the-Art Benchmarking -- Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of Supported benchmarks and evaluation harnesses.
- Extensible and Customizable -- Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.
The platform consists of two main components:
nemo-evaluator
(The Evaluation Core Engine): A Python library that manages the interaction between an evaluation harness and the model being tested.nemo-evaluator-launcher
(The CLI and Orchestration): The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.
Most users only need to interact with the nemo-evaluator-launcher
as universal gateway to different benchmarks and harnesses. It is however possible to interact directly with nemo-evaluator
by following this guide.
graph TD
A[User] --> B{NeMo Evaluator Launcher};
B -- " " --> C{Local};
B -- " " --> D{Slurm};
B -- " " --> E{Lepton};
subgraph Execution Environment
C -- "Launches Container" --> F[Evaluation Container];
D -- "Launches Container" --> F;
E -- "Launches Container" --> F;
end
subgraph F[Evaluation Container]
G[Nemo Evaluator] -- " Runs " --> H[Evaluation Harness]
end
H -- "Sends Requests To" --> I[🤖 Model Endpoint];
I -- "Returns Responses" --> H;
Get your first evaluation result in minutes. This guide uses your local machine to run a small benchmark against an OpenAI API-compatible endpoint.
The launcher is the only package required to get started.
pip install nemo-evaluator-launcher
NeMo Evaluator works with any model that exposes an OpenAI-compatible endpoint. For this quickstart, we will use the OpenAI API.
What is an OpenAI-compatible endpoint? A server that exposes /v1/chat/completions and /v1/completions endpoints, matching the OpenAI API specification.
Options for model endpoints:
- Hosted endpoints (fastest): Use ready-to-use hosted models from providers like build.nvidia.com that expose OpenAI-compatible APIs with no hosting required.
- Self-hosted options: Host your own models using tools like NVIDIA NIM, vLLM, or TensorRT-LLM for full control over your evaluation environment.
For detailed setup instructions including self-hosted configurations, see the tutorial guide.
Getting an NGC API Key for build.nvidia.com: To use out-of-the-box build.nvidia.com APIs, you need an API key:
- Register an account at build.nvidia.com
- In the Setup menu under Keys/Secrets, generate an API key
- Set the environment variable by executing
export NGC_API_KEY=<<YOUR_API_KEY>>
Run a small evaluation on your local machine. The launcher automatically pulls the correct container and executes the benchmark. The list of benchmarks is directly configured in the yaml file.
Configuration Examples: Explore ready-to-use configuration files in packages/nemo-evaluator-launcher/examples/
for local, Lepton, and Slurm deployments with various model hosting options (vLLM, NIM, hosted endpoints).
Once you have the example configuration file (either by cloning this repository or downloading e.g. the local_nvidia_nemotron_nano_9b_v2.yaml
file directly), you can run the following command:
nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_nvidia_nemotron_nano_9b_v2 --override execution.output_dir=<YOUR_OUTPUT_LOCAL_DIR>
Upon running this command, you will be able to see a job_id, which can then be used for tracking the job and the reults with all the logs will be available in your <YOUR_OUTPUT_LOCAL_DIR>
.
Results, logs, and run configurations are saved locally. Inspect the status of the evaluation job by using the corresponding job id:
nemo-evaluator-launcher status <job_id_or_invocation_id>
- List all supported benchmarks:
nemo-evaluator-launcher ls tasks
- Explore the Supported Benchmarks to see all available harnesses and benchmarks.
- Scale up your evaluations using the Slurm Executor or Lepton Executor.
- Learn to evaluate self-hosted models in the extended Tutorial guide for nemo-evaluator-launcher.
- Customize your workflow with Custom Exporters or by evaluating with proprietary data.
NeMo Evaluator Launcher provides pre-built evaluation containers for different evaluation harnesses through the NVIDIA NGC catalog. Each harness supports a variety of benchmarks, which can then be called via nemo-evaluator
. This table provides a list of benchmark names per harness. A more detailed list of task names can be found in the list of NGC containers.
Container | Description | NGC Catalog | Latest Tag | Supported benchmarks |
---|---|---|---|---|
agentic_eval | Agentic AI evaluation framework | Link | 25.08.1 |
Agentic Eval Topic Adherence, Agentic Eval Tool Call, Agentic Eval Goal and Answer Accuracy |
bfcl | Function calling | Link | 25.08.1 |
BFCL v2 and v3 |
bigcode-evaluation-harness | Code generation evaluation | Link | 25.08.1 |
MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) |
garak | Safety and vulnerability testing | Link | 25.08.1 |
Garak |
helm | Holistic evaluation framework | Link | 25.08.1 |
MedHelm |
hle | Academic knowledge and problem solving | Link | 25.08.1 |
HLE |
ifbench | Instruction following | Link | 25.08.1 |
IFBench |
livecodebench | Coding | Link | 25.08.1 |
LiveCodeBench (v1-v6, 0724_0125, 0824_0225) |
lm-evaluation-harness | Language model benchmarks | Link | 25.08.1 |
ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MINERVA MMMLU-Pro, RACE, TruthfulQA, AGIEval, BBH, BBQ, CSQA, Frames, Global MMMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande |
mmath | Multilingual math reasoning | Link | 25.08.1 |
EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI |
mtbench | Multi-turn conversation evaluation | Link | 25.08.1 |
MT-Bench |
rag_retriever_eval | RAG system evaluation | Link | 25.08.1 |
RAG, Retriever |
safety-harness | Safety and bias evaluation | Link | 25.08.1 |
Aegis v2, BBQ, WildGuard |
scicode | Coding for scientific research | Link | 25.08.1 |
SciCode |
simple-evals | Common evaluation tasks | Link | 25.08.1 |
GPQA-D, MATH-500, AIME 24 & 25, HumanEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA |
tooltalk | Tool usage evaluation | Link | 25.08.1 |
ToolTalk |
vlmevalkit | Vision-language model evaluation | Link | 25.08.1 |
AI2D, ChartQA, OCRBench, SlideVQA |
We welcome community contributions. Please see our Contribution Guide for instructions on submitting pull requests, reporting issues, and suggesting features.