NeMo Evaluator

NeMo Evaluator is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables you to run hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. Evaluations execute in open-source Docker containers for auditable and trustworthy results. The platform's containerized architecture allows for the rapid integration of public benchmarks and private datasets.

Tutorial | Supported Benchmarks | Configuration Examples | Contribution Guide

Key Pillars

NeMo Evaluator is built on four core principles to provide a reliable and versatile evaluation experience.

Reproducibility by Default -- All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
Scale Anywhere -- Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.
State-of-the-Art Benchmarking -- Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of Supported benchmarks and evaluation harnesses.
Extensible and Customizable -- Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.

How It Works: Launcher and Core Engine

The platform consists of two main components:

nemo-evaluator (The Evaluation Core Engine): A Python library that manages the interaction between an evaluation harness and the model being tested.
nemo-evaluator-launcher (The CLI and Orchestration): The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.

Most users only need to interact with the nemo-evaluator-launcher as universal gateway to different benchmarks and harnesses. It is however possible to interact directly with nemo-evaluator by following this guide.

graph TD
    A[User] --> B{NeMo Evaluator Launcher};
    B -- " " --> C{Local};
    B -- " " --> D{Slurm};
    B -- " " --> E{Lepton};
    subgraph Execution Environment
        C -- "Launches Container" --> F[Evaluation Container];        
        D -- "Launches Container" --> F;
        E -- "Launches Container" --> F;
    end
    subgraph F[Evaluation Container]
        G[Nemo Evaluator] -- " Runs " --> H[Evaluation Harness]
    end
    H -- "Sends Requests To" --> I[🤖 Model Endpoint];
    I -- "Returns Responses" --> H;

🚀 Quickstart

Get your first evaluation result in minutes. This guide uses your local machine to run a small benchmark against an OpenAI API-compatible endpoint.

1. Install the Launcher

The launcher is the only package required to get started.

pip install nemo-evaluator-launcher

2. Set Up Your Model Endpoint

NeMo Evaluator works with any model that exposes an OpenAI-compatible endpoint. For this quickstart, we will use the OpenAI API.

What is an OpenAI-compatible endpoint? A server that exposes /v1/chat/completions and /v1/completions endpoints, matching the OpenAI API specification.

Options for model endpoints:

Hosted endpoints (fastest): Use ready-to-use hosted models from providers like build.nvidia.com that expose OpenAI-compatible APIs with no hosting required.
Self-hosted options: Host your own models using tools like NVIDIA NIM, vLLM, or TensorRT-LLM for full control over your evaluation environment.

For detailed setup instructions including self-hosted configurations, see the tutorial guide.

Getting an NGC API Key for build.nvidia.com: To use out-of-the-box build.nvidia.com APIs, you need an API key:

Register an account at build.nvidia.com
In the Setup menu under Keys/Secrets, generate an API key
Set the environment variable by executing export NGC_API_KEY=<<YOUR_API_KEY>>

3. Run Your First Evaluation

Run a small evaluation on your local machine. The launcher automatically pulls the correct container and executes the benchmark. The list of benchmarks is directly configured in the yaml file.

Configuration Examples: Explore ready-to-use configuration files in packages/nemo-evaluator-launcher/examples/ for local, Lepton, and Slurm deployments with various model hosting options (vLLM, NIM, hosted endpoints).

Once you have the example configuration file (either by cloning this repository or downloading e.g. the local_nvidia_nemotron_nano_9b_v2.yaml file directly), you can run the following command:

nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_nvidia_nemotron_nano_9b_v2 --override execution.output_dir=<YOUR_OUTPUT_LOCAL_DIR>

Upon running this command, you will be able to see a job_id, which can then be used for tracking the job and the reults with all the logs will be available in your <YOUR_OUTPUT_LOCAL_DIR>.

4. Check Your Results

Results, logs, and run configurations are saved locally. Inspect the status of the evaluation job by using the corresponding job id:

nemo-evaluator-launcher status <job_id_or_invocation_id>

Next Steps

List all supported benchmarks:

nemo-evaluator-launcher ls tasks

Explore the Supported Benchmarks to see all available harnesses and benchmarks.
Scale up your evaluations using the Slurm Executor or Lepton Executor.
Learn to evaluate self-hosted models in the extended Tutorial guide for nemo-evaluator-launcher.
Customize your workflow with Custom Exporters or by evaluating with proprietary data.

Supported Benchmarks and Evaluation Harnesses

NeMo Evaluator Launcher provides pre-built evaluation containers for different evaluation harnesses through the NVIDIA NGC catalog. Each harness supports a variety of benchmarks, which can then be called via nemo-evaluator. This table provides a list of benchmark names per harness. A more detailed list of task names can be found in the list of NGC containers.

Container	Description	NGC Catalog	Latest Tag	Supported benchmarks
agentic_eval	Agentic AI evaluation framework	Link	`25.08.1`	Agentic Eval Topic Adherence, Agentic Eval Tool Call, Agentic Eval Goal and Answer Accuracy
bfcl	Function calling	Link	`25.08.1`	BFCL v2 and v3
bigcode-evaluation-harness	Code generation evaluation	Link	`25.08.1`	MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts)
garak	Safety and vulnerability testing	Link	`25.08.1`	Garak
helm	Holistic evaluation framework	Link	`25.08.1`	MedHelm
hle	Academic knowledge and problem solving	Link	`25.08.1`	HLE
ifbench	Instruction following	Link	`25.08.1`	IFBench
livecodebench	Coding	Link	`25.08.1`	LiveCodeBench (v1-v6, 0724_0125, 0824_0225)
lm-evaluation-harness	Language model benchmarks	Link	`25.08.1`	ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MINERVA MMMLU-Pro, RACE, TruthfulQA, AGIEval, BBH, BBQ, CSQA, Frames, Global MMMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande
mmath	Multilingual math reasoning	Link	`25.08.1`	EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI
mtbench	Multi-turn conversation evaluation	Link	`25.08.1`	MT-Bench
rag_retriever_eval	RAG system evaluation	Link	`25.08.1`	RAG, Retriever
safety-harness	Safety and bias evaluation	Link	`25.08.1`	Aegis v2, BBQ, WildGuard
scicode	Coding for scientific research	Link	`25.08.1`	SciCode
simple-evals	Common evaluation tasks	Link	`25.08.1`	GPQA-D, MATH-500, AIME 24 & 25, HumanEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA
tooltalk	Tool usage evaluation	Link	`25.08.1`	ToolTalk
vlmevalkit	Vision-language model evaluation	Link	`25.08.1`	AI2D, ChartQA, OCRBench, SlideVQA

Contribution Guide

We welcome community contributions. Please see our Contribution Guide for instructions on submitting pull requests, reporting issues, and suggesting features.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
.github		.github
docker		docker
docs		docs
packages		packages
scripts		scripts
tests/integration_tests/eval_factory		tests/integration_tests/eval_factory
tutorials		tutorials
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NeMo_Repo_Overview_Eval.png		NeMo_Repo_Overview_Eval.png
README.md		README.md
codecov.yml		codecov.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeMo Evaluator

Key Pillars

How It Works: Launcher and Core Engine

🚀 Quickstart

1. Install the Launcher

2. Set Up Your Model Endpoint

3. Run Your First Evaluation

4. Check Your Results

Next Steps

Supported Benchmarks and Evaluation Harnesses

Contribution Guide

About

Uh oh!

Releases

Packages

Contributors 18

Uh oh!

Languages

License

NVIDIA-NeMo/Eval

Folders and files

Latest commit

History

Repository files navigation

NeMo Evaluator

Key Pillars

How It Works: Launcher and Core Engine

🚀 Quickstart

1. Install the Launcher

2. Set Up Your Model Endpoint

3. Run Your First Evaluation

4. Check Your Results

Next Steps

Supported Benchmarks and Evaluation Harnesses

Contribution Guide

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 18

Uh oh!

Languages

Packages