ATM-Bench: Long-Term Personalized Referential Memory QA

Official code for ATM-Bench: a benchmark for long-term multimodal personalized AI memory QA and retrieval.

ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering.

Paper: According to Me: Long-Term Personalized Referential Memory QA
Project Page: https://atmbench.github.io/

🗓️ Timeline

2026-03-03: arXiv paper release (2603.01990)
2026-03-04: Initial codebase release, including baseline implementations for MMRAG, Oracle, NIAH, and four ported third-party baselines (A-Mem, HippoRAG2, mem0, MemoryOS).
2026-03-12: Initial General-Purpose Agent benchmark results release for Claude Code, Codex, and OpenCode.
2026-03-12: ATM-Bench data release on Hugging Face (Jingbiao/ATM-Bench).
2026-03-13: Fixed Opencode Token Accounting and updated OpenClaw results.
Coming soon: General-Purpose Agents benchmarking support, including OpenClaw.

🤖 General-Purpose Agent Results

Initial General-Purpose Agent results on ATM-Bench-Hard are summarized below. The QS score here uses gpt-5-mini as the primary judge. Tokens/QS shows the token cost per percentage point of QS, so lower is more efficient.

Agent	Model	QS	Total Tokens	Tokens/QS
Claude Code	Claude Opus 4.6	33.80%	4.93M	0.146M
Codex	GPT-5.2	39.70%	15.46M	0.389M
Codex	GPT-5.4*	29.60%	14.29M	0.483M
OpenCode	GLM-5	27.00%	16.89M	0.626M
OpenCode	Qwen3.5-397B-A17B	24.50%	12.06M	0.492M
OpenCode	Kimi K2.5	30.30%	8.46M	0.279M
OpenCode	MiniMax M2.5	22.90%	14.5M	0.633M
OpenCode	MiniMax M2.7	27.80%	13.48M	0.485M
OpenClaw 🦞	Kimi K2.5	25.40%	9.63M	0.379M

GPT-5.4 results may be unreliable because the Codex service was unstable during evaluation.

The coding agents still struggle on ATM-Bench-Hard, although they perform much better than various agentic memory baselines.

📊 Oracle and NIAH Results

Oracle on ATM-Bench-Hard

QS is reported with gpt-5-mini as the primary judge.

Model	Setting	QS
GPT-5	Raw	72.12%
Qwen3-VL-8B-Instruct	Raw	40.14%
Qwen3-VL-8B-Instruct	SGM	27.98%
Qwen3-VL-8B-Instruct	D	21.69%

NIAH on ATM-Bench-Hard

For NIAH, we compare the Qwen3-VL-8B-Instruct SGM and Raw settings at different haystack sizes.

Model	Setting	QS	Avg. Context Tokens
Qwen3-VL-8B-Instruct	Raw, Oracle	40.14%	5.7k
Qwen3-VL-8B-Instruct	Raw, NIAH-25	25.43%	15.9k
Qwen3-VL-8B-Instruct	Raw, NIAH-50	24.87%	29.0k
Qwen3-VL-8B-Instruct	Raw, NIAH-100	10.90%	56.0k
Qwen3-VL-8B-Instruct	SGM, Oracle	27.98%	4.6k
Qwen3-VL-8B-Instruct	SGM, NIAH-25	16.33%	12.5k
Qwen3-VL-8B-Instruct	SGM, NIAH-50	15.77%	23.9k
Qwen3-VL-8B-Instruct	SGM, NIAH-100	12.66%	45.8k

📋 Overview

Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. ATM-Bench addresses this gap with:

🖼️ Multimodal and multi-source data: Images, videos, emails
📅 Long-term horizon: ~4 years of personal memory
🎯 Referential queries: Resolving personalized references (e.g., "Show me the moments where Grace was trying to be sneaky...")
🔍 Evidence-grounded: Human-annotated QA pairs with ground-truth memory evidence
🧩 Multi-evidence reasoning: Queries requiring evidence from multiple sources
⚡ Conflicting evidence: Handling contradictory information

Memory Ingestion

Memory Ingestion is decomposed into:

Memory preprocessing (how each memory item is represented)
Memory organization (how items are structured/linked)

Memory Preprocessing

We compare two preprocessing representations:

Descriptive Memory (DM): each memory item is represented as one natural-language description.
Schema-Guided Memory (SGM): each memory item is represented with fixed text-based key-value fields under a schema.

In SGM, schema fields are modality-aware. For example:

Image/Video memory: time, location, entities, ocr, tags
Email memory: time, summary, body

DM and SGM contain the same underlying information but use different formats.

In this codebase, DM is implemented as caption/description-style text, while SGM is implemented as schema-based key-value text fields.

Memory Organization

For organization of the memory store:

Piled Memory: items are stored without explicit links.
Linked Memory: items are linked with inferred relations (graph structure); agentic systems can additionally update existing items during organization.

NIAH Evaluation Setup

In addition to end-to-end retrieval + generation evaluation, we provide NIAH (Needle In A Haystack):

Each question is paired with a fixed evidence pool (niah_evidence_ids) that contains all ground-truth items.
The rest of the pool is filled with realistic distractors.
This isolates answer generation/reasoning quality from retrieval quality.

See:

docs/niah.md

🚀 Quick Start

Installation

conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .

API Keys

Set via environment variables:

export OPENAI_API_KEY="your-key"
export VLLM_API_KEY="your-key"

Or use local key files (gitignored):

api_keys/.openai_key
api_keys/.vllm_key

Generate Memory Files First

Before running MMRAG or Oracle, generate the image/video batch_results.json files:

# Optional but recommended: preload reverse-geocoding cache
# Cache files are keyed by media filename stem, so the cache bundle must match
# the current image/video filenames.
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache

# Generate memory itemization results
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.sh

Quick commands (MMRAG + Oracle)

# MMRAG (runs both ATM-bench and ATM-bench-hard)
bash scripts/QA_Agent/MMRAG/run.sh

# Oracle (upper bound; raw multimodal evidence)
bash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh

Baseline Compatibility and Environments

Core baselines (MMRAG, Oracle, NIAH) are tested in the main atmbench environment.
Third-party memory-system baselines in this repo include:
- A-Mem
- HippoRAG2
- mem0
- MemoryOS
MemoryOS is strongly recommended to run in a separate conda environment.
A-Mem, HippoRAG2, and mem0 are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility and dependency isolation.
Setup references for these baselines are under third_party/:
- third_party/A-mem/
- third_party/HippoRAG/
- third_party/mem0/
- third_party/MemoryOS/
OpenClaw support is planned; We will shortly release the evaluation setup for all General-Purpose Agents (Claude Code, Codex, OpenCode, OpenClaw) on ATM-Bench.

For detailed setup, data layout, and reproducibility settings, see:

📁 Repository Structure

ATMBench/
├── memqa/              # Core memory QA implementation
├── scripts/            # Experiment scripts
├── docs/               # Documentation
├── data/               # Data directory (user-provided)
├── third_party/        # Vendored agentic memory systems
└── output/             # Experiment outputs (gitignored)

📚 Documentation

docs/README.md - Getting started guide
docs/data.md - Data format and preparation
docs/baseline.md - Baseline implementations
docs/niah.md - NIAH protocol and usage
docs/metrics.md - Evaluation metrics
docs/reproducibility.md - Reproduction instructions
docs/repo_structure.md - Repository organization

📖 Citation

If you use ATM-Bench in your research, please cite:

@article{mei2026atm,
  title={According to Me: Long-Term Personalized Referential Memory QA},
  author={Mei, Jingbiao and Chen, Jinghong and Yang, Guangyu and Hou, Xinyu and Li, Margaret and Byrne, Bill},
  journal={arXiv preprint arXiv:2603.01990},
  year={2026},
  url={https://arxiv.org/abs/2603.01990},
  doi={10.48550/arXiv.2603.01990}
}

🔗 Links

📄 Paper: https://arxiv.org/abs/2603.01990
🤗 Dataset: https://huggingface.co/datasets/Jingbiao/ATM-Bench
💻 Code: https://github.com/JingbiaoMei/ATM-Bench
🐛 Issues: https://github.com/JingbiaoMei/ATM-Bench/issues

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATM-Bench: Long-Term Personalized Referential Memory QA

Table of Contents

🗓️ Timeline

🤖 General-Purpose Agent Results

📊 Oracle and NIAH Results

Oracle on ATM-Bench-Hard

NIAH on ATM-Bench-Hard

📋 Overview

Memory Ingestion

Memory Preprocessing

Memory Organization

NIAH Evaluation Setup

🚀 Quick Start

Installation

API Keys

Generate Memory Files First

Quick commands (MMRAG + Oracle)

Baseline Compatibility and Environments

📁 Repository Structure

📚 Documentation

📖 Citation

🔗 Links

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs		docs
memqa		memqa
scripts		scripts
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ATM-Bench: Long-Term Personalized Referential Memory QA

Table of Contents

🗓️ Timeline

🤖 General-Purpose Agent Results

📊 Oracle and NIAH Results

Oracle on ATM-Bench-Hard

NIAH on ATM-Bench-Hard

📋 Overview

Memory Ingestion

Memory Preprocessing

Memory Organization

NIAH Evaluation Setup

🚀 Quick Start

Installation

API Keys

Generate Memory Files First

Quick commands (MMRAG + Oracle)

Baseline Compatibility and Environments

📁 Repository Structure

📚 Documentation

📖 Citation

🔗 Links

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages