- [2025/01/11] First upload of the DEFT toolkit to GitHub.
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports, but current benchmarks often fail to capture the complexity of comprehensive report generation.
To address this, we present a unified framework for evaluating and diagnosing deep research agents:
- FINDER (Fine-grained DEepResearch bench): An enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding.
- DEFT (Deep rEsearch Failure Taxonomy): The first failure taxonomy for deep research agents, containing 14 fine-grained failure modes across Reasoning, Retrieval, and Generation dimensions
- Checklist Pass Rate Evaluation - Measures the percentage of checklist items passed by agent outputs 4.Toolkit: A complete set of tools for taxonomy generation and automated evaluation.
The toolkit supports both taxonomy generation (for creating failure categories) and model evaluation (for scoring agent performance).
Run the following command to install the required dependencies from the requirements.txt file:
pip install -r requirements.txtSet up API credentials using environment variables or a .env file:
Option A: Using .env file (Recommended)
# Copy the example configuration
cp env.example .env
# Edit .env file and fill in your API credentials
nano .envOption B: Using environment variables
export API_KEY="your_api_key"
export MODEL_NAME="gpt-4o"
export BASE_URL="https://api.openai.com/v1"A JSONL file containing records of the model performing deep research tasks, with each line a JSON object including at least id, question, and article fields.
Plain-text templates for each pipeline stage.
A Markdown list of starter categories in [level] Name: Description format.
A Markdown list of Level 1 and Level 2 categories in the final taxonomy (DEFT) for evaluation.
The following steps demonstrate an example of forming the categories in DEFT.
python -m deft_toolkit.analyses_generation \
--model gpt-4o \
--data data/input/records.jsonl \
--out_file data/input/records_with_gpt_analysis.jsonl \
--prompt_en_file prompt/analyses_generation_en.txt \
--prompt_zh_file prompt/analyses_generation_zh.txt \
--max_workers 15This adds a failure_analysis field to each record.
python -m deft_toolkit.modes_generation \
--model gpt-4o \
--data data/input/records_with_gpt_analysis.jsonl \
--prompt_file prompt/modes_generation.txt \
--seed_file prompt/seeds.md \
--out_file data/output/files/generation_gpt.jsonl \
--mode_file data/output/modes/modes_gpt.mdUpdates the ModeTree with occurrence counts and descriptions, saving LLM responses per report.
python -m deft_toolkit.refinement \
--model gpt-4o \
--prompt_file prompt/refinement.txt \
--mode_file data/output/modes/modes_gpt.md \
--out_file data/output/refinement.md \
--merge_threshold 0.6 \
--remove_threshold 0.01This stage merges similar categories and removes low-frequency categories below the threshold.
This repository provides three evaluation approaches for deep research agents. Prepare your agent's inference results (as JSONL files) before running evaluations.
Evaluate your agent using the DEFT taxonomy with Taxonomy Positive Metric (S).
Place your agent's inference results in data/input/records.jsonl. Each line should be a JSON object with:
{
"id": "unique_identifier",
"question": "research question",
"article": "agent generated article"
}python -m deft_toolkit.assignment \
--model gpt-4o \
--data data/input/records.jsonl \
--out_file data/output/files/records_annotated.jsonl \
--prompt_file prompt/assignment.txt \
--mode_file data/output/modes/final_DEFT.md \
--max_workers 15python -m deft_toolkit.metrics \
--data data/output/files/records_annotated.jsonl \
--mode_file data/output/modes/final_DEFT.md \
--output_col responsesThe metric score will be printed to console and saved in the output files.
Evaluate your agent against the FINDER (Deep Research Benchmark) framework. Please refer to the deep_research_bench documentation for detailed evaluation instructions and requirements.
Note: We have made several modifications and improvements to the original deep_research_bench to optimize certain issues.
Measure the percentage of checklist criteria met by your agent's outputs.
Place your agent's inference results in checklist_eval/data/. The file should be in JSONL format with either:
articlefield (for generated articles), orpredictionfield (for predictions)
Each line should include an id field matching the checklist data.
Set up the required API credentials for the evaluation model:
export API_KEY="your_api_key"
export MODEL_NAME="your_model_name" # e.g., gpt-4o
export BASE_URL="your_api_base_url"Basic usage with default settings:
cd checklist_eval
python llm_judge_v2.pyOr with custom parameters:
python llm_judge.py \
--input_folder ./data \
--output_folder ./my_results \
--checklist_file ./data/checklist.jsonl \
--max_concurrent_requests 20 \
--request_timeout 180 \
--max_retries 3View help for all options:
python llm_judge.py --helpEvaluation results will be saved in checklist_eval/evaluation_checklist_results/ with:
- Individual item pass/fail status
- Overall pass rate statistics
- Per-article evaluation details
{
"id": "unique_identifier",
"question": "research question text",
"article": "agent generated article content"
}{
"id": "unique_identifier",
"topic": "research topic",
"checklist": [
{
"title": "Criterion Name",
"description": "Detailed description of what to check"
}
]
}We would like to express our sincere gratitude to the following open-source projects for their valuable contributions to this research:
-
DeepResearch Bench: A comprehensive benchmark for systematically evaluating Deep Research Agents. The evaluation framework and methodology provided by this project have been instrumental in guiding our research approach.
-
TopicGPT: A GPT-based topic modeling tool. The innovative methods for topic extraction and analysis from this project have provided valuable technical support for our work.
We welcome contributions! Please feel free to:
- Report bugs or issues
- Suggest new features or improvements
- Submit pull requests
- Improve documentation
This project is licensed under the MIT License - see the LICENSE file for details.
@misc{zhang2025fargenuinelyusefuldeep,
title={How Far Are We from Genuinely Useful Deep Research Agents?},
author={Dingling Zhang and He Zhu and Jincheng Ren and Kangqi Song and Xinran Zhou and Boyu Feng and Shudong Liu and Jiabin Luo and Weihao Xie and Zhaohui Wang and Tianrui Qin and King Zhu and Yuqing Wang and Qianben Chen and Yuchen Eleanor Jiang and Wei Wang and Jiaheng Liu and Wangchunshu Zhou},
year={2025},
eprint={2512.01948},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.01948},
}
