We Gave AI a Mirror. Now It Measures What It Believes.
Epistemic infrastructure for AI — measurement, memory, and calibration across sessions.
Empirica tracks what AI knows, gates what it does, and compounds learning across session boundaries. It measures the gap between what AI predicts and what's true — making AI agents measurably more reliable.
Training & Guides | CLI Reference | Architecture
Important: Empirica is an AI measurement framework. It has no cryptocurrency, token, coin, or blockchain component. Any token using the Empirica name (including "$EMPIRICA" on Solana) is unauthorized and not affiliated with this project or Empirica AI GmbH.
AI coding agents today have no self-awareness about what they know:
- Forgets between sessions — same questions, same dead ends, every time
- Acts before understanding — edits your code without knowing the architecture
- Can't tell you when it's guessing — no distinction between knowledge and confabulation
- No audit trail — reasoning evaporates with the context window
| Capability | What You Experience |
|---|---|
| Measures before acting | AI investigates your codebase before touching it. The Sentinel gate blocks edits until understanding is demonstrated |
| Remembers across sessions | Findings, dead-ends, and learnings persist in a 4-layer memory system. Session 3 starts where Session 2 left off |
| Prevents confident mistakes | The CHECK gate uses domain-aware thresholds scaled by criticality — cybersec/high is stricter than default/low |
| Shows confidence in real-time | Live statusline in your terminal: [empirica] ⚡94% ↕70% │ 🎯3 │ POST 🔍92% │ K:95% C:92% |
| Calibrates against reality | Three-vector model: self-assessed, observed (from deterministic checks), and AI-reasoned grounded state with rationale. Domain compliance loops iterate until all checks pass |
| Tracks your codebase | Temporal entity model auto-extracts functions, classes, and imports from every file edit — the AI knows what's alive and what's stale |
| Works through natural language | You describe tasks normally. The AI operates the measurement system automatically |
You talk to your AI normally. Empirica works in the background:
You: "Fix the authentication bug in the login flow"
Empirica: [AI investigates → logs findings → passes Sentinel gate → implements fix → measures learning]
You see: ⚡87% ↕70% │ 🎯1 │ POST 🔍85% │ K:88% C:82% │ Δ +K
You direct. The AI measures.
Empirica's CLI has 150+ commands spanning investigation, measurement, calibration, and memory — like a cockpit instrument panel. You don't need to learn any of them. The AI reads the instruments, operates the controls, and reports back in natural language. The statusline gives you the flight data at a glance.
For power users, direct CLI access is always available: empirica goals-list, empirica calibration-report, empirica project-search --task "...", and more.
Learn the full workflow: getempirica.com has interactive training, guides, and deep explanations of every concept.
pip install empirica
empirica setup-claude-codeThen just start working. The hooks, Sentinel, system prompt, statusline, and MCP server are all configured automatically. See Claude Code Setup for details.
Already have Claude Code configured? Use --force to replace your default Claude Code settings with Empirica's epistemic hooks. Without --force, setup only writes files that don't already exist — so if you've already used Claude Code, the default internals stay in place and Empirica's hooks won't activate.
empirica setup-claude-code --force--force replaces hooks in settings.json but only removes Empirica's own hooks — hooks from other plugins (Railway, Superpowers, etc.) are preserved.
Homebrew (macOS)
brew tap nubaeon/tap
brew install empirica
empirica setup-claude-codeDocker
# Security-hardened Alpine image (~276MB, recommended)
docker pull nubaeon/empirica:1.9.2-alpine
# Standard image (Debian slim, ~414MB)
docker pull nubaeon/empirica:1.9.2
# Run
docker run -it -v $(pwd)/.empirica:/data/.empirica nubaeon/empirica:1.9.2 /bin/bashManual / Other AI Platforms
pip install empirica
pip install empirica-mcp # MCP Server (for Cursor, Cline, etc.)
cd your-project && empirica project-initThe CLI works standalone on any platform. The full epistemic workflow (epistemic transactions, Sentinel, calibration) requires loading the system prompt into your AI. See System Prompts for Claude, Copilot, Gemini, Qwen, and Roo Code.
empirica onboard # Interactive walkthrough of the full workflowOr just start working — with Claude Code hooks active, the AI manages the epistemic workflow automatically.
Empirica works through nested abstraction layers:
Plan
└── Transaction 1 (Goal A)
├── NOETIC: investigate, search, read → findings, unknowns, dead-ends
├── CHECK: Sentinel gate → proceed / investigate more
├── PRAXIC: implement, write, commit → goals completed
└── POSTFLIGHT: measure learning delta → persists to memory
└── Transaction 2 (Goal B, informed by T1's findings)
└── ...
Plans decompose into transactions — one per goal or Claude Code task. Each transaction is a noetic-praxic loop: investigate first (noetic), then act (praxic), with the Sentinel gating the transition. Along the way, the AI collects and reads artifacts (findings, unknowns, assumptions, dead-ends, decisions) while using semantic search to surface relevant epistemic patterns and anti-patterns from the project's history. Top artifacts are ranked by confidence and fed into each project's MEMORY.md as a hot cache.
PREFLIGHT ────────► CHECK ────────► POSTFLIGHT
│ │ │
Baseline Sentinel Learning
Assessment Gate Delta
│ │ │
"What do I "Am I ready "What did I
know now?" to act?" learn?"
PREFLIGHT: AI assesses its knowledge state before starting work. CHECK: Sentinel gate validates readiness before allowing code edits. POSTFLIGHT: AI measures what it learned, creating a delta that persists.
With Claude Code hooks enabled, you see the AI's epistemic state in real-time:
[empirica] ⚡94% ↕70% │ 🎯3 ❓12/5 │ POST 🔍92% │ K:95% C:92% │ Δ +K +C
| Signal | Meaning |
|---|---|
| ⚡94% | Overall epistemic confidence |
| ↕70% | Sentinel threshold (know gate) — user-facing only |
| 🎯3 ❓12/5 | Open goals (3), unknowns (12 total, 5 blocking) |
| POST 🔍92% | Transaction phase + work state (🔍 investigating / 🔨 acting) with composite score |
| K:95% C:92% | Knowledge and Context vectors (color-coded by gap to threshold) |
| Δ +K +C | Learning delta (POSTFLIGHT only) — which vectors improved |
These vectors emerged from 600+ real working sessions across multiple AI systems. They measure the dimensions that consistently predict success or failure in complex tasks.
| Tier | Vector | What It Measures |
|---|---|---|
| Gate | engagement |
Is the AI actively processing or disengaged? |
| Foundation | know |
Domain knowledge depth |
do |
Execution capability | |
context |
Access to relevant information | |
| Comprehension | clarity |
How clear is the understanding? |
coherence |
Do the pieces fit together? | |
signal |
Signal-to-noise in available information | |
density |
Information richness | |
| Execution | state |
Current working state |
change |
Rate of progress/change | |
completion |
Task completion level | |
impact |
Significance of the work | |
| Meta | uncertainty |
Explicit doubt tracking |
Deep dive: Epistemic Vectors Explained
Empirica doesn't replace or reinvent anything Claude Code already does. Claude Code owns tasks, plans, memory, and projects. Empirica adds the measurement layer on top:
| Claude Code Does | Empirica Adds |
|---|---|
| Task management | Epistemic goals with measurable completion |
| Plan mode | Investigation phase with Sentinel gating — no edits until understanding is verified |
| MEMORY.md | Auto-curated hot cache ranked by epistemic confidence |
| Context window | 4-layer memory that survives compaction and persists across sessions |
| Code editing | Grounded calibration — was the AI's confidence justified by test results? |
| Subagent spawning | Bounded autonomy with delegated work counting and budget tracking |
The result: Claude Code's native capabilities, enhanced with measurement, gating, and calibration feedback that compounds over time.
| Platform | Integration Level | What You Get |
|---|---|---|
| Claude Code | Full (production) | Hooks, Sentinel gate, skills, agents, statusline, MCP |
| Cursor, Cline | MCP server | Epistemic transaction workflow, memory, calibration via MCP tools |
| Gemini CLI, Copilot | Experimental | System prompt + CLI |
| Any AI | CLI + prompt | Full measurement via CLI commands and system prompt |
| Resource | What It Covers |
|---|---|
| getempirica.com | Training course, interactive guides, deep explanations |
| Natural Language Guide | How to collaborate with AI using Empirica |
| Getting Started | First-time setup and concepts |
| CLI Reference | All 150+ commands documented |
| Architecture | Technical reference for contributors |
| System Prompts | AI prompts for Claude, Copilot, Gemini, Qwen, Roo |
| Project | Description | Status |
|---|---|---|
| Empirica | Core measurement system — epistemic transactions, Sentinel, calibration, 13 vectors | Open source |
| Empirica Iris | Epistemic browser automation with SVG spatial indexing — Sentinel gating for visual interactions | Open source |
| Docpistemic | Epistemic documentation coverage assessment — know what your docs know | Open source |
| Breadcrumbs | Survive context compacts with git notes — dead simple session continuity | Open source |
| Empirica Cortex | Cross-project intelligence layer — serves verified predictions and accumulated learnings to condition future work | Proprietary |
| Empirica Workspace | Entity Knowledge Graph, Epistemic Prompt Engine, CRM, portfolio dashboard | Proprietary |
Building something with Empirica? Open an issue to get listed.
Three-circle bootstrap aggregator (v0.6 spec) — replaces uniform-decay artifact surfacing with a model that captures different kinds of relevance, not just recency.
- Circle 1 —
active_state— recency-decayed via per-type half-lives (∞ for in-progress goals, 30d for findings/decisions, 14d for dead-ends/ mistakes). Tiebreaker only — circle is small. - Circle 2 —
persistent_reference— never decays, fixed budgets. Decisions with active outcome (rationale still load-bearing), verified or falsified assumptions (now ground truth), sources (citation base). - Circle 3 —
topic_relevant_backlog— Qdrant cosine similarity to active topic. Surfaces open backlog plus completed-on-topic / resolved-on-topic / dead-ends-on-topic for anti-clobber. - Active topic detection — deterministic 3-step fallback: transaction.task_context + active_goal.objective → recent (7d) high-impact findings → none.
- Public API:
build_bootstrap_payload()consumed by CLI hooks (post-compact / session-init), daemonGET /api/v1/bootstrap, MCP toolmcp__empirica__bootstrap_context, and the newempirica bootstrap-contextCLI verb.
Bootstrap injection trio — three new surfaces that surface relevant artifacts at the moments the AI is making decisions, not just at session start.
*-logresponse →suggested_links— everyfinding-log/decision-log/deadend-log/ etc now returns up to 5 semantically similar existing artifacts so the AI can immediately anchor edges via--related-to <id>. Closes the "AI doesn't think to link artifacts" gap.- PreToolUse →
FILE-RELEVANCEnudge — when the AI is about to Edit/Write/MultiEdit a file, the sentinel surfaces a one-line summary of artifacts already referencing it:2 findings, 3 dead-ends reference this file. SQLite LIKE search, ~50ms hot-path budget. - UserPromptSubmit →
<prior-context>block — every substantive prompt triggers an embed → semantic search → top-3 most-similar artifacts injected as additionalContext. The AI's first response is conditioned on prior project knowledge rather than internal weights alone. ~200ms budget.
Compliance + lint
empirica docs-link-check— general broken-link checker for tech docs with tier-prioritized output. Standalone CLI verb plus opt-in compliance check (tech_docs_links, separate fromtech_docscoverage).repo_hygieneversion_file — accepts RustCargo.tomland Nodepackage.jsonshapes alongside Pythonpyproject.tomlfor cross-language ecosystem repos.rust-docs-assess— Rust-aware tech_docs check that understandscargo docsemantics so Rust crates aren't penalized for missing Python-style docstrings.- Tx-AG investigation-proportionality budget — sentinel-side runtime enforcement of the per-prompt search budget (the soft block was empirically ineffective; this is a hard constraint).
- Tx-AJ
EMPIRICA_SENTINEL_FAIL_CLOSED— opt-in fail-closed mode for hardened deployments. Default unchanged (fail-open) for dev. empirica-mcp/brought into lint scope — was previously outsidetool.ruff.include. 25 ruff errors cleaned in the process.
Side-fix surfaced by the trio: Qdrant payloads in three embed functions
previously omitted artifact_id, silently breaking
circle_3._qdrant_similarity_pull. Fixed; the SQLite reverse-hash fallback in
suggested_links resolves pre-fix points without requiring a project-embed
rebuild.
84 new tests across the bootstrap surface, full suite 2293 passed. See PROPOSAL_BOOTSTRAP_AGGREGATOR.md for the design rationale.
Goal-criterion bridge — quality gates that auto-evaluate
criterion_evaluatorspackage — validation_method-keyed registry. Goals declarequality_gate:<metric>@<op>:<threshold>and the bridge routes to the right evaluator at POSTFLIGHT.EvidenceMetricEvaluator— auto-evaluates any criterion whose metric matches an evidence bundle key (test pass-rate, ruff violations, stylometry drift, etc.).- Typed criterion parser —
goals-create --success-criteria "quality_gate:test_pass_rate@>=:0.95"parses to typedCriterionDeclaration.
Stylometric drift collector — voice consistency for outreach work
- 12 prosodic markers (contractions, MTLD, sentence-length stdev, etc.)
- Voice fingerprints at
~/.empirica/voice/<name>.fingerprint.json - Drift direction inference (formal_pull / informal_pull / mixed / within_tolerance)
Content-aware source provenance nudge — fires at moment of artifact
creation when text shows citation but no --source. Closes 0% adoption gap.
Bulk project-link CLI — projects-discover / projects-list /
projects-bulk-register (Cortex-dependent).
Live-scan semantic index — semantic_index.json regenerates when source
docs are newer than the cache.
Sentinel quote-aware shell parsing — false-positive > in quoted code
fixed (_has_dangerous_redirects now uses _contains_outside_quotes).
Template version parameterization (Philipp #100) — CLAUDE.md and
empirica-system-prompt-lean.md use {{ empirica_version }} and
{{ generated_date }} placeholders. Drift cannot recur.
Documentation refresh — UPGRADE_TO_1.9.md (replaces 1.7), full rewrite
of PROJECT_SWITCHING_FOR_AIS.md, TMUX_MULTI_PANE_GUIDE.md cockpit section.
empirica commit-context <sha>(new CLI). Aggregates artifacts--depth Nrecursive walker. Walks edges from each artifact's- Inline edge declaration on individual
*-logcommands. All six edge_density_nudge— POSTFLIGHT retrospective +sources_discipline_nudge— same shape, counts artifacts--status {planned|in_progress|completed|all|drift}flagdriftmode surfaces rows where thestatustext and- Default open count now uses
is_completed = 0as the canonical
- Listener subsystem — sister to cron loops, event-driven not
scheduled.
empirica listener register/heartbeat/list+ cockpit E binding + project.yaml install hook. - Mechanical pause for loops — pause now cancels the next-fire CronCreate token so paused really means silent (no token bleed).
- Cockpit sweep — domain·criticality chip per row, compliance panel with green/yellow/red glyph, services panel for scanner snapshots.
- #95 root-cause cluster closed — Cortex sync reads project_id
from session row (no CWD);
_run_grounded_verificationacceptsproject_path;resolve_project_idraisesProjectNotFoundErrorinstead ofsys.exit(1). SystemExit-walks-through-Exception hazard closed at the source. - Per-project compliance.yaml — projects can
skip_checks, declareextra_checkswith regulatory mapping, overriderepo_hygienesub-checks. Non-CLI/server projects no longer fail tech_docs. - KNOWN_ISSUES 11.29 + 11.30 — instance_isolation audit-trail entries for the subagent CLI bleed fix and the SystemExit propagation chain.
- Validate-and-heal
session.project_idat session boundaries — catches the ghost-project_id pattern (cross-project--resume, ambiguous folder_name match, tmux pane reuse). Heals at post-compact CONTINUE_TRANSACTION + NEW_SESSION_PREFLIGHT and at session-init resume. Workspace.dbtrajectory_pathis the canonical lookup — never folder_name (no 11.10/11.27 regression). - Voice CLI —
empirica voice list / show / applyloads prosodic profiles for outreach drafting. Profiles in~/.empirica/voice/*.yamlwith project-local override at.empirica/voice/. Voice samples themselves stay in Cortex/Qdrant; this CLI is the calling surface. - PREFLIGHT
voice_guidanceblock — whenwork_type=commsor the newvoicefield/--voiceflag is set, response includes voice tendencies + anti-patterns scoped to platform register (mirrors thenoetic_guidancepattern). - Subagent CLI bleed fix (#95 Issue 1) —
subagent-startnow writes~/.empirica/active_work_<subagent_uuid>.jsonwithis_subagent: trueso the subagent's CLI calls resolve to their ownchild_session_idinstead of falling through to the parent's via TTY.sentinel-gate._detect_subagentreads the flag.subagent-stopcleans up. - POSTFLIGHT pipeline restructure (#95 Issue 3) — Stage 0
pre-validates session row + project_id BEFORE any state mutation;
failure → early return with
loop_state: "open". Stages 5-7 wrapped in_soft_run— failures accumulate intoresult["warnings"]without erasing the closed-loop reflex. No more half-success.
- Notify dispatcher — single CLI verb (
empirica notify emit/config/ backends/test) every loop and hook calls. Three v1 backends (stdout, rotating JSONL log, ntfy) with first-match-wins routing and fail-loud fallback to stdout when a backend isn't configured. Always-on audit at~/.empirica/notify-dispatcher.jsonl. Cockpit + TUI surface 5 most recent emits, backend status, 24h fallback count, and a failure banner. Seedocs/architecture/NOTIFY.md. - Project-scoped TUI notifications — per-instance notifications
strip now reads
~/.empirica/enp/pending.json(the file the ENP watcher actually writes). Top-bar⊕Nshows total unacked across all projects. empirica goals-prune— bulk goal cleanup with four modes (test-pollution, planned, auto-stale, duplicates). Dry-run by default.- Empirica Cockpit — multi-instance state visibility +
per-instance controls.
empirica status [--all]overview,empirica tuiinteractive Textual app,empirica sentinel|loop|instancesubcommand groups. Seedocs/architecture/COCKPIT.md. - Loop exponential backoff — empty fires lengthen the gap; found/fail snap back to base (15m → 30m → 1h → 2h → 4h cap).
noetic-batchCLI primitive — bundles N reads/greps/globs/investigateinto one Sentinel-noetic call.
The Sentinel is a compliance loop coordinator. Deterministic services produce information; the AI synthesizes the grounded epistemic state.
- Domain Registry —
(work_type, domain, criticality)tuples map to compliance checklists. 4 built-in domains:default,remote-ops,cybersec,docs. CLI:domain-list,domain-show,domain-resolve - Domain-aware CHECK gate — uncertainty threshold scales by criticality.
cybersec/highis stricter thandefault/low - Three-vector model —
self_assessed,observed(from deterministic checks), and AI-reasonedgroundedstate with rationale - Compliance loop — POSTFLIGHT runs domain checklist, reports status, advises on follow-up for failed checks
- Check-outcome Brier — AI predicts P(check passes), Brier measures against actual outcomes. Falsifiable calibration
- Real check runners — pytest, ruff, and git status execute as subprocess checks (not stubs)
- Test isolation — tests no longer pollute live sessions via TMUX_PANE inheritance
- Empirica Constitution — 12-section governance framework routing situations to mechanisms
- Epistemic Persistence Protocol (EPP) — Calibrated position-holding under pushback, replacing AAP
- Lean Core Prompt — 81% reduction in always-loaded context.
setup-claude-code --lean - Cross-Project Search —
--globalsearches ALL projects' Qdrant collections - Cross-Project Artifact Writing —
finding-log --project-id <name>writes to another project - Plugin Renamed —
empirica-integration→empirica. Runsetup-claude-code --force - Brier Score Calibration — Proper scoring rule with dynamic thresholds
- Profile Management —
profile-sync,profile-prune,profile-status
Your data stays local:
.empirica/— Local SQLite database (gitignored by default).git/refs/notes/empirica/*— Epistemic checkpoints (local unless you push)- Qdrant runs locally if enabled
No cloud dependencies. No telemetry. Your epistemic data is yours.
- Website: getempirica.com
- Issues: GitHub Issues
- Discussions: GitHub Discussions
MIT License — see LICENSE for details.
Author: David S. L. Van Assche Version: 1.9.2
Turtles all the way down — built with its own epistemic framework, measuring what it knows at every step.