This doc describes how to reproduce ATMBench runs in a paper-compatible way.
Recommended setup:
conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .See docs/data.md for the expected local layout:
data/atm-bench/(benchmark files)data/raw_memory/(your raw memory)output/image/qwen3vl2b/batch_results.jsonoutput/video/qwen3vl2b/batch_results.json
- OpenAI judge/models: set
OPENAI_API_KEY(or useapi_keys/.openai_key) - vLLM answerers: set
VLLM_ENDPOINT(default:http://127.0.0.1:8000/v1/chat/completions)
Run this first so the baseline wrappers can read output/{image,video}/qwen3vl2b/batch_results.json:
# GPS cache files are keyed by media filename stem.
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.shRuns both ATM-bench and ATM-bench-hard:
bash scripts/QA_Agent/MMRAG/run.shbash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh
bash scripts/QA_Agent/Oracle/run_oracle_gpt5.shbash scripts/QA_Agent/NIAH/run_niah_qwen3vl8b_SGM.sh
bash scripts/QA_Agent/NIAH/run_niah_gpt5_SGM.sh
bash scripts/QA_Agent/NIAH/run_niah_qwen3vl8b_raw.sh
bash scripts/QA_Agent/NIAH/run_niah_gpt5_raw.shFor additional memory-agent baselines (HippoRAG 2, MemoryOS, A‑Mem, Mem0), see docs/baseline.md.
Environment note:
MemoryOSis strongly recommended to run in a separate conda environment.HippoRAG 2,A‑Mem, andMem0are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility.
Agent-system note:
- OpenClaw, OpenCode, and Codex baselines are compatible with this repo’s evaluation workflow, but each requires its own third-party software installation.
open_endjudge model:gpt-5-mini(default in this repo)
To summarize NIAH runs into a Markdown table:
python scripts/QA_Agent/NIAH/summarize_niah_results.py \
--output-root output/QA_Agent/NIAH/hard- API-based evaluation is not perfectly deterministic.
- Prefer keeping the same model versions, judge model, and decoding settings for comparisons.