Standardized evaluation suites for measuring research quality.
| Benchmark | What it measures | Dataset size | Agents tested |
|---|---|---|---|
| FreshQA | Factual accuracy on current knowledge | 600 questions | Shallow, Full pipeline |
| Deep Research Bench | Report quality (RACE + FACT metrics) | 100 topics | Deep researcher |
| DeepSearchQA | Document QA across categories | 900 problems | Deep researcher |
:titlesonly:
freshqa.md
deep-research-bench.md
deepsearch-qa.md