trutheval is a modular framework for generating datasets with graded factual perturbations. These datasets are
designed to test and validate the effectiveness of factuality evaluation pipelines.
It provides:
- 📚 Tools to generate Q&A datasets with controlled factual perturbations (A0–A4 levels)
- 🧪 Evaluation scripts and interfaces for comparing human vs. model-based assessments (see UI)
- 📊 Data and metrics to support validation of factual scoring algorithms (see datasets)
Core components:
- truthbench: Generates factual QA datasets with controlled perturbations. It's intended to support the development, tuning, and validation of factuality metrics and hallucination detection systems
- truthscore: A fast, open-weight alternative to RAGAS using NLI models for factual scoring—cheaper, faster, and comparably effective. It's intended to evaluate LLMs directly.
Our framework includes extensive experimental validation, where we generate datasets with graduated factual perturbations and benchmark diverse evaluation techniques — including open-weight LLMs and specialized pipelines — demonstrating strong correlation between perturbation severity and factuality scores.
This work is described in detail in an accepted paper at the EvalLLM 2025 workshop (CORIA-TALN).
Generate datasets for validating truthness scores like this...
pip install truthbench[openai]
python -m spacy download en_core_web_sm
export OPENAI_API_KEY="your_openai_api_key_here"
truthbench --input-file path/to/input.json --output-dir path/to/output_dirExample:
Example: Who did the United States win its independence from?
A0 (Reference)
Independence Day, commonly known as the Fourth of July or July Fourth, is a federal holiday in the United States
celebrating the adoption of the Declaration of Independence on July 4, 1776. On this day, the Continental
Congress announced that the thirteen American colonies considered themselves a new nation, called the United States of
America, and were no longer under British rule. Interestingly, the Congress had voted to declare independence *
two days* earlier, on July 2.
A1 (Low perturbation)
... celebrating the adoption of the Declaration of Independence on July 4, 1776 on August 5, 1776 ...
A2 (Medium perturbation)
... celebrating the Declaration of Independence on August 5, 1781. On this day On that moment, ...
A3 (High perturbation)
... is an unofficial event ... celebrating a proposal of the Declaration of Independence **on August 5, 1781
** ...
A4 (Extreme perturbation)
... celebrating a proposal of the drafting of Independence on August 5, 1781 ... called the United States of
the Colonies, and were no longer under Spanish rule.
For more details on how to use this library, see the dedicated docs here.
Plug this metric into your RAGAS pipeline and get all the good stuff for cheaper...
pip install truthscore[open]from langchain_community.llms import OllamaLLM
from ragas import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from truthscore import OpenFactualCorrectness
test_data = {
"user_input": "What happened in Q3 2024?",
"reference": "The company saw an 8% rise in Q3 2024, driven by strong marketing and product efforts.",
"response": "The company experienced an 8% increase in Q3 2024 due to effective marketing strategies and product efforts."
}
sample = SingleTurnSample(**test_data)
evaluator_llm = LangchainLLMWrapper(OllamaLLM(model="gemma3:27b", base_url="http://localhost:11434"))
metric = OpenFactualCorrectness(llm=evaluator_llm)
score = metric.single_turn_score(sample)
print(score) # e.g. 1.0For more details on how to use this library, see the dedicated docs here.
We evaluated how well different factuality scoring methods track increasing degrees of factual perturbation using 500 perturbed examples generated from 100 Q&A pairs from the Google Natural Questions dataset. The table below summarizes the correlation between the intended perturbation levels (A0 to A4) and the factuality scores assigned by each method.
| Method | LLM | Pearson (95% CI) | Kendall (Tau) | Kendall (95% CI) |
|---|---|---|---|---|
| LLM-as-judge | gemma3: 4b | -0.63 [-0.69, -0.58] | -0.79 | [-0.82, -0.77] |
| llama3.3: 70b | -0.74 [-0.78, -0.70] | -0.86 | [-0.88, -0.84] | |
| mistral-small3.1: 24b | -0.71 [-0.75, -0.66] | -0.76 | [-0.79, -0.72] | |
| phi4: 14b | -0.74 [-0.78, -0.70] | -0.81 | [-0.83, -0.78] | |
| prometheus-v2: 7b | -0.62 [-0.67, -0.56] | -0.70 | [-0.75, -0.66] | |
| qwen2.5: 7b | -0.63 [-0.68, -0.57] | -0.72 | [-0.76, -0.67] | |
| RAGAS | gpt-4o-mini | -0.87 [-0.90, -0.85] | -0.95 | [-0.97, -0.93] |
| LLM + NLI | gemma3: 12b | -0.82 [-0.85, -0.79] | -0.96 | [-0.98, -0.94] |
| llama3.3: 70b | -0.83 [-0.86, -0.80] | -0.94 | [-0.96, -0.92] |
Key takeaways:
- Pipeline methods (RAGAS and LLM + NLI) outperform standalone LLM-as-judge models, showing stronger negative correlations that indicate better detection of factual errors.
- The RAGAS pipeline with GPT-4o-mini achieves the highest Pearson correlation (-0.87) and near-perfect Kendall’s tau ( -0.95), reflecting both linear and rank-order accuracy.
- The LLM + NLI (i.e.,
truthscore) approach offers a strong open-weight alternative with competitive performance, enabling efficient and cost-effective factuality evaluation. - Standalone LLM-as-judge methods exhibit weaker correlations (Pearson between -0.62 and -0.74), suggesting lower reliability in capturing factual degradation.
These results demonstrate how TruthBench’s perturbed datasets enable effective benchmarking and comparison of factuality evaluation pipelines, promoting development of accurate and scalable factual robustness assessment algorithms.
We are also open-sourcing the datasets we used to access the quality of our pipeline. In short, we asked annotators to compare the perceived quality of answers between experts and pipeline generated. The annotators needed to decide which option aligned best to a specific set of guidelines (which can be found at our paper; see Appendix C). The annotators have the alternative of accepting both options (if they had perceived similar quality) or rejecting them both (if they both didn't comply with guidelines).
├── datasets
│ ├── evaluation # datasets used for evaluating LLMs and other techniques (Section 5)
│ │ ├── dataset.json # the pipeline generated dataset (with A0 -> A4)
│ │ ├── factual_correctness_eval.jsonl # evaluation for fast-fc (our cost efficient implementation and ragas (default)
│ │ ├── gold-dataset.json # set of Question and ground truths sampled from Google's Natural Questions dataset
│ │ ├── llm_as_judge_eval.jsonl # evaluation of several LLMs for factual correctness
│ │ ├── report.json # detailed report of the question transformations
│ ├── human-assessment # datasets used for validating the quality of the pipeline (Section 4)
│ │ ├── assessment-dataset.json # set of Q&As manually fabricated by experts (including A0 -> A4) with alternative versions produced by our pipeline
│ │ ├── report.json # the pipeline report with details about the incremental changes when producing the "ai" responses in assessment-dataset.json
│ │ ├── results-evaluator-1.json # assessment from evaluator 1 (preferences)
│ │ ├── results-evaluator-2.json # assessment from evaluator 2 (preferences)
We provide a user-friendly webapp to facilitate comparing A0-A4 responses generated by different sources. The UI provides a side-by-side visualization with diff capabilities. This tool was used by annotators to produce evaluate the quality of our pipeline.
ui.mp4
One must provide an input dataset with the following schema:
After evaluation, the results are exported with the following format (dictionary keys are the ids from the previous file).
{
"0": {
"A0": "Both are bad",
"A1": "AI",
"A2": "Both are good",
"A3": "AI",
"A4": "Expert"
},
"1": {
"A0": "...",
"A1": "...",
"A2": "...",
"A3": "...",
"A4": "..."
},
// ...
}To launch the tool, first install the dependencies with pip install -r ui/requirements.txt. Then, you can run
python ui/evaluation_interface.py ./datasets/human-assessment/assessment-dataset.json ./datasets/results-evaluator-x.jsonThe application will start at http://127.0.0.1:7860 which can be access with your browser.
![]() Giovanni Gatti |
Ilyana Guendouz |
Mariia Tokareva |
|
Adele Robaldo |
Sarra Gharsallah |
Raphael Troncy |
@inproceedings{gharsallah2025peut,
title={Peut-on faire confiance aux juges? Validation de m{\'e}thodes d’{\'e}valuation de la factualit{\'e} par perturbation des r{\'e}ponses},
author={Gharsallah, Sarra and Robaldo, Ad{\`e}le and Tokareva, Mariia and Guendouz, Ilyana and Gatti Pinheiro, Giovanni and Troncy, Raphael and Papotti, Paolo and Michiardi, Pietro},
booktitle={Actes de l'atelier {\'E}valuation des mod{\`e}les g{\'e}n{\'e}ratifs (LLM) et challenge 2025 (EvalLLM)},
pages={228--252},
year={2025}
}


{ "questions": [ { "id": 0, "question": "What are the main causes of climate change?", "ground_truth": "Climate change is primarily ...", "answers": { "A0": { "ai": "Human activities are the main drivers of climate change...", "human": "The primary driver of climate change is human activity..." }, "A1": { "ai": "...", "human": "..." }, "A2": { "ai": "...", "human": "..." }, "A3": { "ai": "...", "human": "..." }, "A4": { "ai": "...", "human": "..." } } }, // ... ] }