Needle In A Haystack (NIAH) Benchmark

Result: 250/250 — 100% Retrieval

mind-mem achieves 100% retrieval across all haystack sizes, burial depths, and needle types in the Needle In A Haystack benchmark.

Reconfirmed on v3.2.1 (2026-04-20): 250/250 passed in 18:50 (1130.13s). No regression from the v1.9.0 baseline (1128.02s) despite the v3.2.0 backend-protocol refactor and v3.2.1 apply-engine BlockStore routing. Result persists at the same matrix (5 sizes × 5 depths × 10 needles) using the same hybrid BM25 + vector + RRF stack.

Methodology

A single "needle" fact is planted at a controlled depth within a haystack of semantically diverse filler memory blocks. The system must retrieve the needle in its top-5 results using only a natural-language query — no exact-match hints, no metadata filters.

Test Matrix

Parameter	Values	Count
Haystack size	10, 50, 100, 250, 500 blocks	5
Burial depth	0%, 25%, 50%, 75%, 100%	5
Needle type	10 diverse domain facts	10
Total test cases		250

Needle Types

The 10 needles span deliberately diverse domains to test cross-domain retrieval:

#	Domain	Needle
1	Finance/Projects	Project "Velvet Hammer" budget: $7,432,891
2	Chemistry	Compound XR-42 luminescent crystal at 127.3°C
3	IT/Infrastructure	Server rack 17B faulty DIMM in slot J3
4	Engineering	Fibonacci-optimized Picotronix pod timing (13th term, 233μs)
5	Chess/History	1987 Dubai Olympiad queen sacrifice, move 23
6	Biology/Ecology	Pyralidae moth migration, Zanskar Valley, August 17th
7	M&A/Legal	Helix-9 acquisition, $312M, FTC contingency
8	Environmental	Lake Karachay depth (1,847m), tritium (4.2 GBq/L)
9	ML/Hardware	Algorithm Z-Prime, 42.7% latency reduction on H100
10	History/Trade	Treaty of Novgorod-Seversky, 1618, amber trading rights

Haystack Composition

Filler blocks are generated from 20+ topic categories (astrophysics, culinary science, marine biology, urban planning, quantum computing, medieval history, etc.) with unique content per block. Each block contains domain-specific facts, measurements, and proper nouns to create realistic semantic interference.

Burial Depth

0% — Needle is the first block (easiest: recency bias helps)
25% — Needle at 1/4 depth
50% — Needle at midpoint
75% — Needle at 3/4 depth
100% — Needle is the last block (hardest: buried under entire haystack)

Results

========================= 250 passed in 1128.02s (18:48) =========================

By Haystack Size

Size	Passed	Rate	Avg Time/Test
10 blocks	50/50	100%	~0.8s
50 blocks	50/50	100%	~2.5s
100 blocks	50/50	100%	~4.4s
250 blocks	50/50	100%	~8.5s
500 blocks	50/50	100%	~12s

By Burial Depth

Depth	Passed	Rate
0% (top)	50/50	100%
25%	50/50	100%
50% (middle)	50/50	100%
75%	50/50	100%
100% (bottom)	50/50	100%

By Needle Type

Needle	Passed	Rate
Velvet Hammer (finance)	25/25	100%
XR-42 (chemistry)	25/25	100%
Server rack 17B (IT)	25/25	100%
Picotronix timing (eng)	25/25	100%
Chess Olympiad (history)	25/25	100%
Moth migration (ecology)	25/25	100%
Helix-9 M&A (legal)	25/25	100%
Lake Karachay (environ)	25/25	100%
Z-Prime algorithm (ML)	25/25	100%
Novgorod treaty (history)	25/25	100%

Configuration

{
  "recall": {
    "backend": "hybrid",
    "rrf_k": 60,
    "bm25_weight": 1.0,
    "vector_weight": 1.0,
    "model": "BAAI/bge-large-en-v1.5",
    "vector_enabled": true,
    "onnx_backend": true,
    "provider": "sqlite_vec"
  }
}

Search Stack

BM25 (Porter stemming + RM3 query expansion)
  +
BAAI/bge-large-en-v1.5 (384-dim dense vectors via sentence-transformers)
  ↓
Reciprocal Rank Fusion (k=60, equal weights)
  ↓
sqlite-vec (vector storage + ANN search)

Why Hybrid Search Achieves 100%

Neither BM25 nor vector search alone would achieve 100%:

BM25 alone fails on semantic queries where the query terms don't lexically match the needle (e.g., querying "speedup" when the needle says "reduces latency")
Vector search alone can miss needles with critical numeric details that get washed out in dense embedding space (e.g., distinguishing "$7,432,891" from "$7,500,000")
Hybrid (BM25 + Vector + RRF) combines lexical precision with semantic recall, ensuring both exact-term matches and meaning-based matches contribute to ranking

Hardware

CPU: Intel i7-5930K
GPU: NVIDIA RTX 3080 (10GB) — embeddings run on CPU via ONNX
RAM: 64GB DDR4
Storage: NVMe SSD

Reproducing

cd mind-mem
pip install -e ".[test]"
python -m pytest tests/test_niah.py -v

Comparison

System	NIAH Score	Search Method	Notes
mind-mem v3.2.1	100% (250/250)	Hybrid BM25+Vector+RRF	Reconfirmed 2026-04-20
mind-mem v1.9.0	100% (250/250)	Hybrid BM25+Vector+RRF	Original baseline
Mem0	—	Vector only	No published NIAH results
Zep	—	Vector only	No published NIAH results
LangMem	—	Vector only	No published NIAH results
OpenAI memory	—	Unknown	No published NIAH results

Note: Other systems have not published NIAH benchmark results on comparable test matrices. mind-mem's LoCoMo benchmark scores (Mean: 77.9, Adversarial: 82.3, Temporal: 88.5) already exceed Mem0 (66.9), Zep (66.0), and LangMem (58.1) on the established academic benchmark.

Attribution

The Needle In A Haystack test was originally created by Greg Kamradt (November 2023) for evaluating LLM context window recall. This benchmark adapts the methodology for persistent memory retrieval systems — testing recall from a stored memory corpus rather than a single context window.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Needle In A Haystack (NIAH) Benchmark

Result: 250/250 — 100% Retrieval

Methodology

Test Matrix

Needle Types

Haystack Composition

Burial Depth

Results

By Haystack Size

By Burial Depth

By Needle Type

Configuration

Search Stack

Why Hybrid Search Achieves 100%

Hardware

Reproducing

Comparison

Attribution

FilesExpand file tree

NIAH.md

Latest commit

History

NIAH.md

File metadata and controls

Needle In A Haystack (NIAH) Benchmark

Result: 250/250 — 100% Retrieval

Methodology

Test Matrix

Needle Types

Haystack Composition

Burial Depth

Results

By Haystack Size

By Burial Depth

By Needle Type

Configuration

Search Stack

Why Hybrid Search Achieves 100%

Hardware

Reproducing

Comparison

Attribution