Releases: Parslee-ai/statebench
Releases · Parslee-ai/statebench
Memgine v1.1.0 - Deterministic Memory Engine
Memgine: A Deterministic Memory Engine for Stateful AI Agents
Implements the full state-based specification on top of StateBench v1.0.
Key Results (3-run mean ± std)
| Configuration | Decision Accuracy |
|---|---|
| memgine / Opus 4.6 | 97.3% ± 0.5% |
| memgine / GPT-5.2 | 95.8% ± 0.4% |
| state_based_no_supersession / GPT-5.2 | 90.7% ± 0.3% |
| transcript_replay / GPT-5.2 | 81.2% ± 0.8% |
What's New
- Query-relevance sorting — most relevant facts appear last, exploiting LLM recency attention
- Engine-level access control — restricted/scoped facts never reach the model (leak rate: 13% → 0%)
- Adaptive inline repair — stale conclusions placed next to corrected parent facts
- Compaction architecture — threshold-based with layer-specific rules, validated at 2.2× compression
- Test split validation — 92.6% (GPT-5.2) and 96.0% (Opus 4.6) on held-out data
Paper
See docs/memgine-deterministic-memory-engine.pdf for the full paper.
Key Finding
Architectural enforcement beats prompt engineering. When restricted facts are filtered by the engine rather than guarded by system prompt instructions, information leakage drops from 13% to 0%.
StateBench v1.0
StateBench v1.0
A benchmark for evaluating LLM memory systems on state-sensitive reasoning.
Dataset
- 1,400 timelines across 14 evaluation tracks
- Train/dev/test/hidden splits with canary contamination detection
- Covers supersession, commitment durability, authority hierarchy, privacy, and more
Baseline Results (gpt-5.2, 50 timelines)
| Baseline | Decision Accuracy | SFRR |
|---|---|---|
| state_based | 80.3% | 34.4% |
| rolling_summary | 72.1% | 21.3% |
| fact_extraction | 63.9% | 27.9% |
| transcript_replay | 60.7% | 24.6% |
| no_memory | 26.2% | 19.7% |
Quick Start
pip install statebench
# Generate benchmark
statebench generate --tracks supersession commitment_durability --count 100
# Evaluate baseline
statebench evaluate -d data.jsonl -b state_based -m gpt-4o -p openai
# Compare all baselines
statebench compare -d data.jsonl -m gpt-4o -l 50Tracks
- supersession, supersession_detection
- commitment_durability, interruption_resumption
- scope_permission, environmental_freshness
- authority_hierarchy, enterprise_privacy
- identity, time_decay, confidentiality
- contradiction, detection, adversarial