Official code for ATM-Bench: a benchmark for long-term multimodal personalized AI memory QA and retrieval.
ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering.
Paper: According to Me: Long-Term Personalized Referential Memory QA
Project Page: https://atmbench.github.io/
- ATM-Bench: Long-Term Personalized Referential Memory QA
- 2026-03-03: arXiv paper release (2603.01990)
- 2026-03-04: Initial codebase release, including baseline implementations for MMRAG, Oracle, NIAH, and four ported third-party baselines (A-Mem, HippoRAG2, mem0, MemoryOS).
- 2026-03-12: Initial General-Purpose Agent benchmark results release for Claude Code, Codex, and OpenCode.
- 2026-03-12: ATM-Bench data release on Hugging Face (Jingbiao/ATM-Bench).
- 2026-03-13: Fixed Opencode Token Accounting and updated OpenClaw results.
- Coming soon: General-Purpose Agents benchmarking support, including OpenClaw.
Initial General-Purpose Agent results on ATM-Bench-Hard are summarized below. The QS score here uses gpt-5-mini as the primary judge. Tokens/QS shows the token cost per percentage point of QS, so lower is more efficient.
| Agent | Model | QS | Total Tokens | Tokens/QS |
|---|---|---|---|---|
| Claude Code | Claude Opus 4.6 | 33.80% | 4.93M | 0.146M |
| Codex | GPT-5.2 | 39.70% | 15.46M | 0.389M |
| Codex | GPT-5.4* | 29.60% | 14.29M | 0.483M |
| OpenCode | GLM-5 | 27.00% | 16.89M | 0.626M |
| OpenCode | Qwen3.5-397B-A17B | 24.50% | 12.06M | 0.492M |
| OpenCode | Kimi K2.5 | 30.30% | 8.46M | 0.279M |
| OpenCode | MiniMax M2.5 | 22.90% | 14.5M | 0.633M |
| OpenCode | MiniMax M2.7 | 27.80% | 13.48M | 0.485M |
| OpenClaw π¦ | Kimi K2.5 | 25.40% | 9.63M | 0.379M |
GPT-5.4results may be unreliable because the Codex service was unstable during evaluation.
The coding agents still struggle on ATM-Bench-Hard, although they perform much better than various agentic memory baselines.
QS is reported with gpt-5-mini as the primary judge.
| Model | Setting | QS |
|---|---|---|
| GPT-5 | Raw | 72.12% |
| Qwen3-VL-8B-Instruct | Raw | 40.14% |
| Qwen3-VL-8B-Instruct | SGM | 27.98% |
| Qwen3-VL-8B-Instruct | D | 21.69% |
For NIAH, we compare the Qwen3-VL-8B-Instruct SGM and Raw settings at different haystack sizes.
| Model | Setting | QS | Avg. Context Tokens |
|---|---|---|---|
| Qwen3-VL-8B-Instruct | Raw, Oracle | 40.14% | 5.7k |
| Qwen3-VL-8B-Instruct | Raw, NIAH-25 | 25.43% | 15.9k |
| Qwen3-VL-8B-Instruct | Raw, NIAH-50 | 24.87% | 29.0k |
| Qwen3-VL-8B-Instruct | Raw, NIAH-100 | 10.90% | 56.0k |
| Qwen3-VL-8B-Instruct | SGM, Oracle | 27.98% | 4.6k |
| Qwen3-VL-8B-Instruct | SGM, NIAH-25 | 16.33% | 12.5k |
| Qwen3-VL-8B-Instruct | SGM, NIAH-50 | 15.77% | 23.9k |
| Qwen3-VL-8B-Instruct | SGM, NIAH-100 | 12.66% | 45.8k |
Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. ATM-Bench addresses this gap with:
- πΌοΈ Multimodal and multi-source data: Images, videos, emails
- π Long-term horizon: ~4 years of personal memory
- π― Referential queries: Resolving personalized references (e.g., "Show me the moments where Grace was trying to be sneaky...")
- π Evidence-grounded: Human-annotated QA pairs with ground-truth memory evidence
- π§© Multi-evidence reasoning: Queries requiring evidence from multiple sources
- β‘ Conflicting evidence: Handling contradictory information
Memory Ingestion is decomposed into:
- Memory preprocessing (how each memory item is represented)
- Memory organization (how items are structured/linked)
We compare two preprocessing representations:
- Descriptive Memory (DM): each memory item is represented as one natural-language description.
- Schema-Guided Memory (SGM): each memory item is represented with fixed text-based key-value fields under a schema.
In SGM, schema fields are modality-aware. For example:
- Image/Video memory:
time,location,entities,ocr,tags - Email memory:
time,summary,body
DM and SGM contain the same underlying information but use different formats.
In this codebase, DM is implemented as caption/description-style text, while SGM is implemented as schema-based key-value text fields.
For organization of the memory store:
- Piled Memory: items are stored without explicit links.
- Linked Memory: items are linked with inferred relations (graph structure); agentic systems can additionally update existing items during organization.
In addition to end-to-end retrieval + generation evaluation, we provide NIAH (Needle In A Haystack):
- Each question is paired with a fixed evidence pool (
niah_evidence_ids) that contains all ground-truth items. - The rest of the pool is filled with realistic distractors.
- This isolates answer generation/reasoning quality from retrieval quality.
See:
conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .Set via environment variables:
export OPENAI_API_KEY="your-key"
export VLLM_API_KEY="your-key"Or use local key files (gitignored):
api_keys/.openai_keyapi_keys/.vllm_key
Before running MMRAG or Oracle, generate the image/video batch_results.json files:
# Optional but recommended: preload reverse-geocoding cache
# Cache files are keyed by media filename stem, so the cache bundle must match
# the current image/video filenames.
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache
# Generate memory itemization results
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.sh# MMRAG (runs both ATM-bench and ATM-bench-hard)
bash scripts/QA_Agent/MMRAG/run.sh
# Oracle (upper bound; raw multimodal evidence)
bash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh
- Core baselines (
MMRAG,Oracle,NIAH) are tested in the mainatmbenchenvironment. - Third-party memory-system baselines in this repo include:
A-MemHippoRAG2mem0MemoryOS
MemoryOSis strongly recommended to run in a separate conda environment.A-Mem,HippoRAG2, andmem0are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility and dependency isolation.- Setup references for these baselines are under
third_party/:third_party/A-mem/third_party/HippoRAG/third_party/mem0/third_party/MemoryOS/
- OpenClaw support is planned; We will shortly release the evaluation setup for all General-Purpose Agents (Claude Code, Codex, OpenCode, OpenClaw) on ATM-Bench.
For detailed setup, data layout, and reproducibility settings, see:
ATMBench/
βββ memqa/ # Core memory QA implementation
βββ scripts/ # Experiment scripts
βββ docs/ # Documentation
βββ data/ # Data directory (user-provided)
βββ third_party/ # Vendored agentic memory systems
βββ output/ # Experiment outputs (gitignored)
docs/README.md- Getting started guidedocs/data.md- Data format and preparationdocs/baseline.md- Baseline implementationsdocs/niah.md- NIAH protocol and usagedocs/metrics.md- Evaluation metricsdocs/reproducibility.md- Reproduction instructionsdocs/repo_structure.md- Repository organization
If you use ATM-Bench in your research, please cite:
@article{mei2026atm,
title={According to Me: Long-Term Personalized Referential Memory QA},
author={Mei, Jingbiao and Chen, Jinghong and Yang, Guangyu and Hou, Xinyu and Li, Margaret and Byrne, Bill},
journal={arXiv preprint arXiv:2603.01990},
year={2026},
url={https://arxiv.org/abs/2603.01990},
doi={10.48550/arXiv.2603.01990}
}- π Paper: https://arxiv.org/abs/2603.01990
- π€ Dataset: https://huggingface.co/datasets/Jingbiao/ATM-Bench
- π» Code: https://github.com/JingbiaoMei/ATM-Bench
- π Issues: https://github.com/JingbiaoMei/ATM-Bench/issues
This project is licensed under the MIT License - see the LICENSE file for details.

