Skip to content

Latest commit

 

History

History
22 lines (17 loc) · 772 Bytes

File metadata and controls

22 lines (17 loc) · 772 Bytes

Benchmarks

Standardized evaluation suites for measuring research quality.

Benchmark What it measures Dataset size Agents tested
FreshQA Factual accuracy on current knowledge 600 questions Shallow, Full pipeline
Deep Research Bench Report quality (RACE + FACT metrics) 100 topics Deep researcher
DeepSearchQA Document QA across categories 900 problems Deep researcher
:titlesonly:

freshqa.md
deep-research-bench.md
deepsearch-qa.md