mind-mem achieves 100% retrieval across all haystack sizes, burial depths, and needle types in the Needle In A Haystack benchmark.
Reconfirmed on v3.2.1 (2026-04-20): 250/250 passed in 18:50 (1130.13s). No regression from the v1.9.0 baseline (1128.02s) despite the v3.2.0 backend-protocol refactor and v3.2.1 apply-engine BlockStore routing. Result persists at the same matrix (5 sizes × 5 depths × 10 needles) using the same hybrid BM25 + vector + RRF stack.
A single "needle" fact is planted at a controlled depth within a haystack of semantically diverse filler memory blocks. The system must retrieve the needle in its top-5 results using only a natural-language query — no exact-match hints, no metadata filters.
| Parameter | Values | Count |
|---|---|---|
| Haystack size | 10, 50, 100, 250, 500 blocks | 5 |
| Burial depth | 0%, 25%, 50%, 75%, 100% | 5 |
| Needle type | 10 diverse domain facts | 10 |
| Total test cases | 250 |
The 10 needles span deliberately diverse domains to test cross-domain retrieval:
| # | Domain | Needle |
|---|---|---|
| 1 | Finance/Projects | Project "Velvet Hammer" budget: $7,432,891 |
| 2 | Chemistry | Compound XR-42 luminescent crystal at 127.3°C |
| 3 | IT/Infrastructure | Server rack 17B faulty DIMM in slot J3 |
| 4 | Engineering | Fibonacci-optimized Picotronix pod timing (13th term, 233μs) |
| 5 | Chess/History | 1987 Dubai Olympiad queen sacrifice, move 23 |
| 6 | Biology/Ecology | Pyralidae moth migration, Zanskar Valley, August 17th |
| 7 | M&A/Legal | Helix-9 acquisition, $312M, FTC contingency |
| 8 | Environmental | Lake Karachay depth (1,847m), tritium (4.2 GBq/L) |
| 9 | ML/Hardware | Algorithm Z-Prime, 42.7% latency reduction on H100 |
| 10 | History/Trade | Treaty of Novgorod-Seversky, 1618, amber trading rights |
Filler blocks are generated from 20+ topic categories (astrophysics, culinary science, marine biology, urban planning, quantum computing, medieval history, etc.) with unique content per block. Each block contains domain-specific facts, measurements, and proper nouns to create realistic semantic interference.
- 0% — Needle is the first block (easiest: recency bias helps)
- 25% — Needle at 1/4 depth
- 50% — Needle at midpoint
- 75% — Needle at 3/4 depth
- 100% — Needle is the last block (hardest: buried under entire haystack)
========================= 250 passed in 1128.02s (18:48) =========================
| Size | Passed | Failed | Rate | Avg Time/Test |
|---|---|---|---|---|
| 10 blocks | 50/50 | 0 | 100% | ~0.8s |
| 50 blocks | 50/50 | 0 | 100% | ~2.5s |
| 100 blocks | 50/50 | 0 | 100% | ~4.4s |
| 250 blocks | 50/50 | 0 | 100% | ~8.5s |
| 500 blocks | 50/50 | 0 | 100% | ~12s |
| Depth | Passed | Failed | Rate |
|---|---|---|---|
| 0% (top) | 50/50 | 0 | 100% |
| 25% | 50/50 | 0 | 100% |
| 50% (middle) | 50/50 | 0 | 100% |
| 75% | 50/50 | 0 | 100% |
| 100% (bottom) | 50/50 | 0 | 100% |
| Needle | Passed | Failed | Rate |
|---|---|---|---|
| Velvet Hammer (finance) | 25/25 | 0 | 100% |
| XR-42 (chemistry) | 25/25 | 0 | 100% |
| Server rack 17B (IT) | 25/25 | 0 | 100% |
| Picotronix timing (eng) | 25/25 | 0 | 100% |
| Chess Olympiad (history) | 25/25 | 0 | 100% |
| Moth migration (ecology) | 25/25 | 0 | 100% |
| Helix-9 M&A (legal) | 25/25 | 0 | 100% |
| Lake Karachay (environ) | 25/25 | 0 | 100% |
| Z-Prime algorithm (ML) | 25/25 | 0 | 100% |
| Novgorod treaty (history) | 25/25 | 0 | 100% |
{
"recall": {
"backend": "hybrid",
"rrf_k": 60,
"bm25_weight": 1.0,
"vector_weight": 1.0,
"model": "BAAI/bge-large-en-v1.5",
"vector_enabled": true,
"onnx_backend": true,
"provider": "sqlite_vec"
}
}BM25 (Porter stemming + RM3 query expansion)
+
BAAI/bge-large-en-v1.5 (384-dim dense vectors via sentence-transformers)
↓
Reciprocal Rank Fusion (k=60, equal weights)
↓
sqlite-vec (vector storage + ANN search)
Neither BM25 nor vector search alone would achieve 100%:
- BM25 alone fails on semantic queries where the query terms don't lexically match the needle (e.g., querying "speedup" when the needle says "reduces latency")
- Vector search alone can miss needles with critical numeric details that get washed out in dense embedding space (e.g., distinguishing "$7,432,891" from "$7,500,000")
- Hybrid (BM25 + Vector + RRF) combines lexical precision with semantic recall, ensuring both exact-term matches and meaning-based matches contribute to ranking
- CPU: Intel i7-5930K
- GPU: NVIDIA RTX 3080 (10GB) — embeddings run on CPU via ONNX
- RAM: 64GB DDR4
- Storage: NVMe SSD
cd mind-mem
pip install -e ".[test]"
python -m pytest tests/test_niah.py -v| System | NIAH Score | Search Method | Notes |
|---|---|---|---|
| mind-mem v3.2.1 | 100% (250/250) | Hybrid BM25+Vector+RRF | Reconfirmed 2026-04-20 |
| mind-mem v1.9.0 | 100% (250/250) | Hybrid BM25+Vector+RRF | Original baseline |
| Mem0 | — | Vector only | No published NIAH results |
| Zep | — | Vector only | No published NIAH results |
| LangMem | — | Vector only | No published NIAH results |
| OpenAI memory | — | Unknown | No published NIAH results |
Note: Other systems have not published NIAH benchmark results on comparable test matrices. mind-mem's LoCoMo benchmark scores (Mean: 77.9, Adversarial: 82.3, Temporal: 88.5) already exceed Mem0 (66.9), Zep (66.0), and LangMem (58.1) on the established academic benchmark.
The Needle In A Haystack test was originally created by Greg Kamradt (November 2023) for evaluating LLM context window recall. This benchmark adapts the methodology for persistent memory retrieval systems — testing recall from a stored memory corpus rather than a single context window.
Benchmark run: 2026-03-21 | mind-mem v1.9.0 | Test suite: tests/test_niah.py Copyright © 2026 STARGA, Inc. All rights reserved.