Skip to content

Releases: Parslee-ai/statebench

Memgine v1.1.0 - Deterministic Memory Engine

21 Feb 19:50

Choose a tag to compare

Memgine: A Deterministic Memory Engine for Stateful AI Agents

Implements the full state-based specification on top of StateBench v1.0.

Key Results (3-run mean ± std)

Configuration Decision Accuracy
memgine / Opus 4.6 97.3% ± 0.5%
memgine / GPT-5.2 95.8% ± 0.4%
state_based_no_supersession / GPT-5.2 90.7% ± 0.3%
transcript_replay / GPT-5.2 81.2% ± 0.8%

What's New

  • Query-relevance sorting — most relevant facts appear last, exploiting LLM recency attention
  • Engine-level access control — restricted/scoped facts never reach the model (leak rate: 13% → 0%)
  • Adaptive inline repair — stale conclusions placed next to corrected parent facts
  • Compaction architecture — threshold-based with layer-specific rules, validated at 2.2× compression
  • Test split validation — 92.6% (GPT-5.2) and 96.0% (Opus 4.6) on held-out data

Paper

See docs/memgine-deterministic-memory-engine.pdf for the full paper.

Key Finding

Architectural enforcement beats prompt engineering. When restricted facts are filtered by the engine rather than guarded by system prompt instructions, information leakage drops from 13% to 0%.

StateBench v1.0

25 Dec 02:18

Choose a tag to compare

StateBench v1.0

A benchmark for evaluating LLM memory systems on state-sensitive reasoning.

Dataset

  • 1,400 timelines across 14 evaluation tracks
  • Train/dev/test/hidden splits with canary contamination detection
  • Covers supersession, commitment durability, authority hierarchy, privacy, and more

Baseline Results (gpt-5.2, 50 timelines)

Baseline Decision Accuracy SFRR
state_based 80.3% 34.4%
rolling_summary 72.1% 21.3%
fact_extraction 63.9% 27.9%
transcript_replay 60.7% 24.6%
no_memory 26.2% 19.7%

Quick Start

pip install statebench

# Generate benchmark
statebench generate --tracks supersession commitment_durability --count 100

# Evaluate baseline
statebench evaluate -d data.jsonl -b state_based -m gpt-4o -p openai

# Compare all baselines
statebench compare -d data.jsonl -m gpt-4o -l 50

Tracks

  • supersession, supersession_detection
  • commitment_durability, interruption_resumption
  • scope_permission, environmental_freshness
  • authority_hierarchy, enterprise_privacy
  • identity, time_decay, confidentiality
  • contradiction, detection, adversarial