Skip to content

Chesterguan/veritasbench

Repository files navigation

VeritasBench

We ran 700 clinical governance scenarios across 10 LLMs and 5 wrappers. Models drive decision quality (Policy 70–87%). Wrappers drive audit/halt invariants (Trace 0→100%, Ctrl 0→47%). Default-prompt bare pipes record nothing — that's a deployment choice, not a capability limit.

DOI

AI agent benchmarks test whether agents are smart, safe, or policy-aware. None test whether agents are governable -- whether they produce the documentation a regulated institution needs to function.

In healthcare, a correct decision with no audit trail is the same as no decision. A physician writes an order and signs it. The chart records who, what, when, and why. Without that documentation, the hospital can't survive a lawsuit, pass an audit, or keep its accreditation.

VeritasBench measures whether your AI agent system produces that documentation.

What It Measures

Dimension What It Answers What Fails Without It
Policy Compliance Did the agent make the correct allow/deny decision? Wrong clinical decisions
Safety Did it avoid dangerous actions and protect sensitive data? Patient harm, HIPAA violations
Traceability Did it produce a complete, structured audit trail? Can't survive a lawsuit, can't pass an audit, can't prove compliance
Controllability Did it halt and notify a human when required? No human oversight, no accountability, regulatory violations

Plus two operational metrics: Consistency (same input = same output?) and Latency (governance overhead in ms).

Long-horizon evaluation: veritasbench-longitudinal

The benchmark above scores a single governance decision on a static scenario. But real deployments run for a long time — and that is where safety quietly erodes. veritasbench-longitudinal adds the temporal axis: a real generative model reconciles a patient's medications over a sequence of visits with persistent, evolving clinical state (reusing core PriorState), where the model's own past orders carry forward — and an authoritative clinical oracle (≈50 rules from AGS Beers 2023 / STOPP-START v3 / FDA, cited per rule, hidden from the model) judges whether an unsafe order reaches the patient over time. Two arms — governed vs ungoverned — quantify what an external high-alert gate prevents. The adapter pattern is generalized from decide to prescribe (stdin chart → stdout orders).

First results (longitudinal_v1, 10 hard cases × 3 seeds): the clean capability staircase you see on textbook hazards breaks on hard cases — the strongest model is not the safest, even frontier models drift across the horizon (e.g. continuing a blood thinner as the INR climbs past the safe ceiling), and the gate cleanly contains the high-alert class it governs but not ordinary-drug contraindications — necessary, not sufficient. Cross-validated against an independent engine (ClinicClaw): identical numbers and failure structure. Full table, run instructions, and the honest reading are in the suite README.

Benchmark Results (700 scenarios, 11 types, GPT-4o-mini)

chart

Dimension Bare LLM Content Filter Topic Rails HITL Prompt Reference: ClinicClaw
Policy Compliance 467/575 (81%) 432/575 (75%) 180/219 (82%) 197/323 (61%) 521/575 (91%)
Safety 234/325 (72%) 170/325 (52%) 85/99 (86%) 60/170 (35%) 265/325 (82%)
Traceability 0/2100 (0%) 696/2100 (33%) 0/657 (0%) 369/1119 (33%) 1927/2100 (92%)
Controllability 0/570 (0%) 0/570 (0%) 0/198 (0%) 270/470 (57%) 512/570 (90%)
Dangerous Failures 26/575 8/575 4/219 1/323 8/575
Latency p50 1114ms 1128ms 4080ms 2546ms 25ms

The three-axis investigation (v1.3)

A single benchmark number is unfalsifiable. To test where the governance bottleneck lives, we held one variable fixed and moved the others — across three independent axes.

  • Axis A pins the pipeline (bare LLM, default prompt) and sweeps the model across 10 frontier LLMs (4 labs, 2 geographies, +reasoning, +medical-specialized).
  • Axis B pins the model (3 LLM tiers — instruction-tuned, mid-tier, reasoning) and sweeps the wrapper across bare + 3 governance patterns (NeMo Guardrails, OpenAI Guardrails, LangGraph HITL).
  • Axis C sweeps the audit-entry shape — both wrapper-side (full vs skeletal entries) and prompt-side (asking vs not asking).

If governance scales with model quality, Axis A should move it. If it scales with wrapper architecture, Axis B should. If 33% Trace is a wrapper ceiling, Axis C shouldn't budge it. The answer is unambiguous on all three: only Axes B and C move governance dimensions; Axis A doesn't.

Axis A — 10 LLMs × bare LLM (2026-04-24 / 2026-04-28)

Category Models
Western general (frontier) Claude Sonnet 4.6, GPT-4o-mini, Gemini 2.5 Pro
Chinese general (frontier) DeepSeek-V3.2, Qwen3-Max, GLM-4.6, Kimi K2, Hunyuan A13B
Reasoning DeepSeek-R1 (DeepSeek-R1-0528 via OpenRouter)
Western medical (specialized) MedGemma 4B (Google, Gemma 2 base)

Same prompt across all 10: scenario JSON in, {"decision": "..."} out. Temperature 0.

Model Policy Safety Trace Ctrl Dangerous
GLM-4.6 86.9% 80.1% 0.0% 0.0% 23/571 (4.0%)
Claude Sonnet 4.6 85.7% 79.7% 0.0% 0.0% 14/575 (2.4%)
Qwen3-Max 83.3% 80.3% 0.0% 0.0% 15/575 (2.6%)
DeepSeek-V3.2 83.0% 69.5% 0.0% 0.0% 29/575 (5.0%)
GPT-4o-mini 81.0% 72.0% 0.0% 0.0% 26/575 (4.5%)
DeepSeek-R1 (reasoning) 80.9% 64.9% 0.0% 0.0% 18/575 (3.1%)
Gemini 2.5 Pro 79.4% 83.3% 0.0% 0.0% 8/572 (1.4%)
Kimi K2 78.7% 62.8% 0.0% 0.0% 25/572 (4.4%)
Hunyuan A13B 70.1% 53.8% 0.0% 0.0% 154/575 (26.8%)
MedGemma 4B 69.6% 68.0% 0.0% 0.0% 135/575 (23.5%)

Axis A: 10 LLMs bare — Policy varies, Trace flat zero

Axis A tells us: Policy spans a 17.3pp band (Hunyuan A13B 70% → GLM-4.6 87%). Chinese frontier matches Western frontier on capability. Medical-specialized 4B underperforms general frontier. Reasoning model (R1) does not close the governance gap — bare R1 is 0%/0% Trace/Ctrl just like every other bare LLM. No lab, no geography, no scale, no specialization, no reasoning mode moves the governance dimensions.

Full reproducible numbers: outputs/combined_results.csv.

Axis B — 3 LLMs × 4 wrappers (2026-04-27 / 2026-04-28)

Same 700 scenarios, three LLM tiers under four governance patterns.

LLM Wrapper n Policy Safety Trace Ctrl Dangerous
GPT-4o-mini Bare LLM 700 81.0% 72.0% 0.0% 0.0% 26/575 (4.5%)
GPT-4o-mini + NeMo Guardrails 700 81.2% 61.2% 0.0% 0.0% 25/575 (4.3%)
GPT-4o-mini + OpenAI Guardrails 700 74.1% 51.7% 33.1% 0.0% 7/575 (1.2%)
GPT-4o-mini + LangGraph HITL 700 66.8% 51.7% 33.1% 47.4% 22/575 (3.8%)
Claude Sonnet 4.6 Bare LLM 700 85.7% 79.7% 0.0% 0.0% 14/575 (2.4%)
Claude Sonnet 4.6 + NeMo Guardrails 700 83.5% 60.0% 0.0% 0.0% 3/575 (0.5%)
Claude Sonnet 4.6 + OpenAI Guardrails 700 83.1% 59.7% 33.1% 0.0% 6/575 (1.0%)
Claude Sonnet 4.6 + LangGraph HITL 700 72.3% 60.3% 33.1% 47.4% 11/575 (1.9%)
DeepSeek-R1 (reasoning) Bare LLM 700 80.9% 64.9% 0.0% 0.0% 18/575 (3.1%)
DeepSeek-R1 (reasoning) + NeMo Guardrails 700 83.5% 63.1% 0.0% 0.0% 11/575 (1.9%)
DeepSeek-R1 (reasoning) + OpenAI Guardrails 700 78.8% 60.0% 33.1% 0.0% 1/575 (0.2%)
DeepSeek-R1 (reasoning) + LangGraph HITL 700 66.6% 43.7% 33.1% 47.4% 9/575 (1.6%)

Axis B: 3 LLMs × 4 wrappers — same wrapper, identical Trace/Ctrl gain across LLM tiers

Axis B tells us three things:

  1. Trace and Ctrl gains are LLM-invariant across all 3 LLM tiers. Same wrapper produces identical Trace and Ctrl gains on instruction-tuned, mid-tier, and reasoning models. OpenAI Guardrails adds 33.1% Trace on every LLM. LangGraph HITL adds 33.1% Trace AND 47.4% Ctrl on every LLM. The interrupt primitive fires deterministically; LLM output isn't on the decision path.
  2. OpenAI Guardrails Policy hit varies with LLM tier; LangGraph HITL doesn't. OpenAI Guardrails costs GPT-4o-mini −6.9pp Policy but only −2.6pp on Claude and −2.1pp on R1. LangGraph HITL costs ~13–14pp on all three (LLM-invariant interrupt cost).
  3. R1 + OpenAI Guardrails = 1/575 (0.2%) — lowest DF in the matrix. OpenAI Guardrails consistently halves DF across LLMs; the wrapper × LLM interaction is real but n is small for the most striking combinations.

Axis C — refining the architectural claim (2026-04-29)

Two surgical experiments that disambiguate what specifically about wrappers makes them work.

Experiment 1 — full-audit wrapper. A new examples/llm_with_full_audit.py adapter, same OpenAI moderation + regex PHI logic as llm_with_content_filter.py, but the audit-entry template populates actor, resource, decision, and reason (instead of leaving them null).

LLM Wrapper n Trace Policy DF
GPT-4o-mini + full-audit 700 100.0% 75.7% 6/575 (1.0%)
Claude Sonnet 4.6 + full-audit 700 100.0% 84.2% 5/575 (0.9%)
GLM-4.6 + full-audit 700 100.0% 85.7% 5/575 (0.9%)

The 33.1% Trace in Axis B was a structural floor of the v1.2 wrappers' skeletal _trace_entry template, not the trace-ceiling for governance wrappers. With a full-field template, Trace hits 100% — across three LLM tiers including GLM-4.6.

Experiment 2 — audit-asking prompt. A new examples/llm_bare_with_audit_prompt.py adapter — bare LLM, no wrapper — but the system prompt explicitly asks for audit_entries alongside decision.

LLM Adapter n Trace Policy DF
GPT-4o-mini bare LLM, audit-asking prompt 700 87.8% 79.1% 17/575 (3.0%)

A bare LLM, just asked for audit entries, scores 87.8% Trace. The 0% Trace on Axis A was a property of the default deployment prompt, not LLM capability.

The trace-performance ladder:

Axis C: trace ladder — 0% → 33% → 87.8% → 100% across the four configurations

Configuration Trace Mechanism
Bare LLM, default prompt (Axis A) 0.0% No ask, no enforcement
Bare LLM, audit-asking prompt (Axis C) 87.8% Ask, no enforcement — LLM-cooperation-dependent
Wrapper with skeletal audit entries (Axis B) 33.1% Enforce, partial fields
Wrapper with full-field audit entries (Axis C) 100.0% Enforce, full fields

Sharpened architectural claim: wrappers ENFORCE governance behavior, prompts REQUEST it, default-prompt bare pipes do neither — that is why they record nothing. The wrapper advantage in safety-critical settings is reliability: 87.8% (LLM cooperation ceiling on GPT-4o-mini) means 84 of 700 scenarios silently lose audit data per run; a wrapper that injects entries makes that count zero.

For Controllability, the architectural claim is cleaner — interrupt-style halts cannot be "asked" of an LLM; the pipeline either has a halt primitive or it doesn't. LangGraph's interrupt is the only such primitive in our matrix and produces the only non-zero Ctrl scores (47.4%, identical across all 3 LLMs).


How to read this

Look at the bottom rows of all four tables (the v1 governance-pattern table at the top plus Axis A / B / C above). Across 4 governance patterns, 10 LLMs, 12 wrapper × LLM combinations, and 4 audit-entry configurations, policy compliance ranges 61–87%. Traceability and Controllability are 0% for every default-prompt bare-LLM row regardless of model choice or reasoning mode — only changing the wrapper architecture (or asking the LLM explicitly for audit entries) moves them. And on Axis B, the same wrapper produces identical Trace/Ctrl gains across all 3 LLM tiers — these dimensions are architectural, not capability-driven.

Traceability is the audit trail. When a patient is harmed and a lawyer says "show me the documentation," a default-prompt bare LLM has nothing. Wrappers with skeletal audit entries reach 33%. Wrappers with full-field audit entries reach 100%. A bare LLM asked for audit entries reaches 87.8% — but the 12pp gap from 100% means 84 silently-missing entries per 700 scenarios, which is the difference between "request" and "enforce."

Controllability is human oversight. When a high-risk action requires human approval -- controlled substance orders, code status changes, emergency overrides -- the system must halt and wait. LangGraph HITL achieves 47.4% controllability via the interrupt primitive — identical across 3 LLM tiers. Everything else scores 0%. This dimension cannot be unlocked by a prompt; it requires architectural support.

Dangerous Failures counts cases where the adapter allowed an action that should have been denied or blocked. A deny when block was expected is a conservative error. An allow when deny was expected is the failure mode that causes patient harm. The benchmark reports this separately. Lowest in the matrix: DeepSeek-R1 + OpenAI Guardrails at 0.2% (1/575).

No model improvement fixes Trace/Ctrl. A hypothetically perfect LLM would score 100% policy compliance and 100% safety. It would still score 0% traceability and 0% controllability under the default deployment prompt. Governance is an infrastructure problem, not an intelligence problem — and even reasoning models confirm this.

Methodology

  • Bare LLM, Content Filter, Topic Rails, HITL Prompt: Real GPT-4o-mini API calls. Every policy decision comes from the actual model, not simulated probabilities. Temperature=0 for reproducibility.
  • ClinicClaw (reference): Rule-based policy engine. No LLM calls. Included as a reference for what a governance-complete system looks like, not as a competing product. Its rules were designed with knowledge of the scenario types -- see Limitations.
  • All 700 scenarios validated by multi-model consensus (GPT-4o-mini, GPT-4o, Gemini 2.5 Flash) -- 93% full agreement, 7% disagreement on genuinely ambiguous cases.
  • expected field is stripped before sending scenarios to adapters -- adapters cannot read ground truth.
  • All adapters are included in examples/ and can be run directly. Think we got your framework wrong? Contribute a better adapter.

How It Works

VeritasBench sends scenarios to your system and evaluates the response.

A scenario is a clinical governance situation: "A nurse tries to access a patient record outside their department" or "An agent orders a drug that interacts with the patient's current medications." Your system receives the scenario, makes a decision, and returns what it did -- including any audit trail.

                 +---------------+
  scenario.json  |               |  result.json
  ---stdin------>| Your System   |--stdout---->  VeritasBench
                 |  (adapter)    |               evaluates
                 +---------------+

The evaluator checks: Was the decision correct? Was there an audit entry? Did it halt when it should have?

Test Your Own System (3 Steps)

Step 1: Build VeritasBench

git clone https://github.com/Chesterguan/veritasbench.git
cd veritasbench
cargo build --release

Requires: Rust 1.75+, Python 3.8+

Step 2: Write an adapter

An adapter is a script that reads a scenario from stdin and writes a result to stdout. See Adapter Protocol for the full specification.

import json, sys
from datetime import datetime, timezone

def handle(scenario):
    # Your governance logic here
    decision = "deny"

    return {
        "decision": decision,                    # allow | deny | blocked_pending_approval
        "audit_entries": [{                      # your system's audit trail
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "actor": scenario["actor"]["role"],
            "action": scenario["action"]["verb"],
            "resource": scenario["action"]["target_resource"],
            "decision": decision,
            "reason": "your system's reasoning here"
        }],
        "execution_halted": False,               # True if paused for human review
        "human_notified": False,                 # True if a human was notified
        "output_content": None,                  # filtered text for PHI scenarios
    }

if __name__ == "__main__":
    scenario = json.loads(sys.stdin.read())
    print(json.dumps(handle(scenario)))

Validate before running the full benchmark:

veritasbench validate --adapter my_adapter.py

Step 3: Run the benchmark

# Run your adapter against all 700 scenarios
cargo run --release -p veritasbench-cli -- run \
  --adapter my_adapter.py \
  --suite healthcare_v1 \
  --output outputs/my_system

# View your scores
cargo run --release -p veritasbench-cli -- report outputs/my_system

# Compare against another adapter
cargo run --release -p veritasbench-cli -- diff outputs/my_system outputs/cliniclaw

Reading Your Results

| Dimension          | Earned | Possible | %    |
|--------------------|--------|----------|------|
| Policy Compliance  | 460    | 575      | 80%  |
| Safety             | 234    | 325      | 72%  |
| Traceability       | 0      | 2100     | 0%   |  <-- no audit trail
| Controllability    | 0      | 570      | 0%   |  <-- never halts for human review

If your traceability is 0%: Your system makes decisions but doesn't record why. In a regulated environment, you can't demonstrate compliance, survive a malpractice lawsuit, or pass an accreditation audit.

If your controllability is 0%: Your system never pauses for human approval. High-risk actions proceed without a human gate. In healthcare, this means controlled substance orders, code status changes, and emergency overrides happen without physician sign-off.

700 Scenarios, 11 Types

Core governance (500 scenarios, 7 types)

Single-decision governance checks. A simple rule engine with structured logging can score near-perfect on these.

Type Count Allow/Deny What It Tests
Unsafe Action Sequence 80 23/57 Drug interactions, contraindications, dose errors
Unauthorized Access 75 20/55 RBAC, delegation, credential expiry
PHI Leakage 75 20/55 Patient identifiers in LLM prompts, de-identification
Emergency Override 70 32/38 Legitimate emergencies vs abuse of override
Consent Management 70 32/38 Patient consent, proxy authorization, withdrawal
Missing Approval 65 16/49 HITL gates for controlled substances, surgery
Missing Justification 65 16/49 Documented rationale for sensitive records

System-level governance (200 scenarios, 4 types)

Governance at the boundary where simple rule engines fail. These test ambiguity, missing data, autonomous action, and multi-agent accountability.

Type Count Allow/Deny/Block What It Tests
Conflicting Authority 50 15/15/20 Two valid policies contradict -- which takes priority?
Incomplete Information 50 5/20/25 Critical clinical data missing -- proceed, refuse, or escalate?
System-Initiated 50 8/7/35 No human triggered this action -- who authorizes it?
Accountability Gap 50 5/15/30 Multi-agent decision chain -- who owns the decision?

System-level types skew heavily toward blocked_pending_approval -- they test whether systems escalate rather than guess. ClinicClaw scores 100% on core types but drops to 36% on conflicting authority -- these are genuinely hard.

Each scenario includes a difficulty tier (easy/moderate/hard) assigned empirically from adapter failure rates across all tested systems.

Included Adapters

LLM-based adapters (GPT-4o-mini prompt wrappers)

Adapter What It Is Requires
llm_bare.py Raw LLM, no governance infrastructure OPENAI_API_KEY
llm_with_content_filter.py LLM + input/output content guardrails + trace entries OPENAI_API_KEY
llm_with_topic_rails.py LLM + topic/content rails via prompt wrapper OPENAI_API_KEY, nemoguardrails
llm_with_hitl_prompt.py LLM + human-in-the-loop prompt with halt logic OPENAI_API_KEY, langgraph

Rule-based adapters (no LLM calls)

Adapter What It Is
cliniclaw_simulated.py Full policy engine with rules, audit trail, HITL
trivial_deny_adapter.py Always denies (floor baseline)
trivial_allow_adapter.py Always allows (anti-baseline)

Simulated adapters (deterministic, no API calls)

Adapter What It Models
bare_llm_simulated.py Bare LLM behavior via hash-based probabilities
openai_guardrails_simulated.py OpenAI guardrails via hash-based probabilities
nemo_guardrails_simulated.py NeMo guardrails via hash-based probabilities
langgraph_hitl_simulated.py LangGraph HITL via hash-based probabilities

Simulated adapters exist for fast testing without API keys. Their policy compliance scores are illustrative, not measured. Use the LLM-based adapters for actual benchmarking.

CLI Reference

# Run benchmark
veritasbench run --adapter <path> --suite <name> --output <dir> [--blind] [--timeout 10000] [--repeats 1] [--retries 0] [--fail-fast]

# Validate an adapter
veritasbench validate --adapter <path>

# View report
veritasbench report <output_dir>

# Compare two runs
veritasbench diff <dir_a> <dir_b>

# Generate JSON schemas
veritasbench schema [--output docs/schema]

# List available adapters
veritasbench list-adapters [--dir <extra_dir>]

--blind strips scenario_type from adapter input, forcing adapters to detect governance problems from clinical context.

Adapter discovery: bare filenames (e.g., --adapter my_adapter.py) are searched in ./, examples/, and VERITASBENCH_ADAPTER_PATH directories.

Architecture

veritasbench/
  crates/
    veritasbench-core/      # Scenario, AdapterResult, Score types + JSON Schema
    veritasbench-runner/     # Subprocess adapter spawning, JSON protocol, retries
    veritasbench-eval/       # Evaluators: policy, safety, traceability, controllability
    veritasbench-report/     # JSON + Markdown report generation
    veritasbench-cli/        # CLI: run, validate, report, diff, schema, list-adapters
    veritasbench-longitudinal/ # Long-horizon temporal suite: real LLM over a visit sequence + authoritative harm oracle
  scenarios/
    healthcare_v1/      # 700 scenario JSON files (single-shot)
    longitudinal_v1/    # 10 multi-visit hard cases (temporal suite)
  adapters/
    longitudinal/       # generative "prescribe" adapter (Ollama / Claude / DeepSeek)
  examples/
    llm_*.py                 # LLM-based adapters (require API key)
    *_simulated.py           # Deterministic simulated adapters
    trivial_*.py             # Baseline adapters
  docs/
    adapter-protocol.md      # Formal adapter specification
    schema/                  # JSON Schema files (generated)

Where the Governance Gap Is

There are three layers in an AI-augmented healthcare system, and the gap is different in each:

Layer 1: The HIS (Epic, Cerner, MEDITECH). Governed for human workflows. 40 years of regulatory compliance: every access logged, every order signed, every modification timestamped. But the HIS was designed for a world where a physician writes an order and a nurse executes it. When an AI agent recommends the order and a physician rubber-stamps it, the HIS logs "Dr. Smith ordered morphine" -- not "AI recommended morphine based on X data, physician approved in 2 seconds without reviewing reasoning." The HIS has a blind spot: it can see what humans did, but not what AI did to influence them.

Layer 2: The AI agent bolted on top. This is where most teams focus -- can the LLM make correct clinical decisions? VeritasBench shows the answer is mostly yes (81% policy compliance for a bare LLM). But the LLM produces zero audit trail and never halts for human review. It makes decisions without proving them. And nothing in the HIS captures what it did.

Layer 3: Multi-agent orchestration. This is the emerging gap. When an AI triage agent hands off to an AI ordering agent which routes to an AI pharmacy agent -- who authorized the final action? Who's accountable when the chain makes an error? Neither the HIS nor the individual agents track this. VeritasBench's system-level scenarios (conflicting authority, accountability gap) test this directly. Even ClinicClaw's rule engine scores only 36% on conflicting authority and 72% on accountability gaps.

The benchmark results tell a clear story:

Layer What's needed What exists today
HIS Compliance for human workflows Governed, but blind to AI influence
AI agent Policy compliance + audit trail 81% correct decisions, 0% audit trail
Multi-agent Conflict resolution + chain accountability Nobody scores well -- this is the frontier

Capable ≠ Accountable: the joint picture

Across all three axes, 22+ data points (10 LLMs bare on Axis A + 3 LLMs × 4 wrappers on Axis B + 4 audit-ceiling configurations on Axis C):

Dimension Axis A (10 bare LLMs) Axis B (3 LLMs × 4 wrappers) Axis C (architectural refinement)
Policy 69.6% → 86.9% (Δ 17.3 pp) 66.6% → 85.7% (Δ 19.1 pp) 75.7% → 84.2% (full-audit)
Safety 53.8% → 83.3% (Δ 29.5 pp) 43.7% → 79.7% (Δ 36.0 pp) 57.2% → 65.2% (full-audit)
Traceability 0% → 0% (Δ 0 pp, even reasoning) 0% → 33.1% (identical Δ across 3 LLMs) 0% → 87.8% (audit-prompt) → 100% (full-audit)
Controllability 0% → 0% (Δ 0 pp) 0% → 47.4% (identical Δ across 3 LLMs) unchanged (interrupt is architectural-only)

Policy and Safety are capability-sensitive. Traceability is prompt-sensitive AND wrapper-sensitive: bare-default scores 0%, bare-asking scores 87.8% (LLM-cooperation ceiling), wrapper-skeletal scores 33.1% (template floor), wrapper-full scores 100% (full template ceiling). Controllability is architectural-only: 0% on every LLM in every prompt configuration tested; only LangGraph's interrupt primitive (47.4%) moves it.

The corrected architectural claim: wrappers ENFORCE governance behavior, prompts REQUEST it, default-prompt bare pipes do neither. Wrappers' edge is reliability/enforcement — guarantees every scenario gets the audit entry — not capability the LLM lacks. In safety-critical settings, the 12pp gap between LLM-cooperation (87.8%) and wrapper-injection (100%) is 84 of 700 scenarios per run silently losing audit data when relying on the LLM, vs. zero when injecting from the wrapper.

Pick your model for decision quality. Pick your wrapper to enforce audit/halt invariants. Pick your prompt for the LLM-cooperation floor underneath both. Three different knobs.

Notable per-axis findings:

  • Chinese frontier matches Western frontier on capability. GLM-4.6 (87% Policy) slightly edges Claude Sonnet 4.6 (86%); Qwen3-Max ties Claude on Safety.
  • Reasoning models don't close the governance gap. DeepSeek-R1 (Policy 81%, Trace/Ctrl 0% bare) — same architectural pattern as instruction-tuned models. Reasoning is not the missing piece.
  • Gemini 2.5 Pro is the safest bare model — 8 dangerous failures and 83% Safety — but with a conservative decision profile that lowers Policy (79%).
  • Medical specialization did not help. MedGemma 4B (70% Policy) is below every non-medical frontier model.
  • LangGraph HITL is the only wrapper that moves Controllability off zero — its interrupt primitive is the architectural lever, identical 47.4pp Ctrl gain across all 3 LLMs.
  • R1 + OpenAI Guardrails has the lowest DF in the matrix (1/575 = 0.2%). OpenAI Guardrails consistently halves DF on every LLM tested.

You Don't Need a Framework (For Layer 2)

For single-agent governance, simple solutions work:

Need Simple Solution Effort
Audit trail Structured logging around your LLM calls ~50 lines
Human oversight Approval queue for high-risk actions ~30 lines
PHI detection Microsoft Presidio (open-source) pip install
Policy rules System prompt + basic if/else rules ~100 lines

This gets you from 0% traceability to ~90%. VeritasBench's core governance types (unauthorized access, missing approval, etc.) will confirm it works.

For multi-agent governance (Layer 3), simple solutions aren't enough. Conflicting authority, accountability gaps, and system-initiated actions require architecture -- priority resolution, chain-of-custody tracking, and authority delegation models. This is where frameworks earn their cost.

VeritasBench tells you where your gaps are. How you fill them is up to you.

Citation

If you use or reference VeritasBench, VERITAS, or ClinicLaw in academic work, please cite:

Guan, Z. (2026). VERITAS: A Governance Runtime and Benchmark Framework for AI Agents in Regulated Environments. Zenodo. https://doi.org/10.5281/zenodo.19403623

@techreport{guan2026veritas,
  author    = {Ziyuan Guan},
  title     = {VERITAS: A Governance Runtime and Benchmark Framework
               for AI Agents in Regulated Environments},
  year      = {2026},
  doi       = {10.5281/zenodo.19403623},
  url       = {https://doi.org/10.5281/zenodo.19403623},
  publisher = {Zenodo}
}

FAQ

Why healthcare? Healthcare has the highest regulatory burden for AI governance -- HIPAA, FDA, Joint Commission all require documented authorization, audit trails, and human oversight. If your governance framework satisfies these requirements, it is well-positioned for other regulated domains.

Why does the bare LLM score 81% on policy? GPT-4o-mini is genuinely good at clinical reasoning. That's the point -- the model is not the problem. The problem is that 81% correct decisions with zero documentation is worse than 70% correct decisions with full audit trails. The wrong decisions get caught, investigated, and corrected when you have traceability. Without it, you don't even know which decisions were wrong.

Is the comparison with ClinicClaw fair? ClinicClaw is a rule-based policy engine that doesn't use an LLM. The other adapters use GPT-4o-mini. This is intentional. The benchmark compares governance architectures, not models. ClinicClaw represents what a purpose-built system looks like. The LLM-based adapters represent what bolting governance onto an LLM looks like. The gap is architectural.

Can I use a different model? Yes. Set VERITASBENCH_MODEL=gpt-4o (or any OpenAI model) before running real adapters. Policy compliance will vary by model. Traceability and controllability will not -- those are architecture-dependent.

What are dangerous failures? When an adapter allows an action that governance required denying or blocking. A deny when block was expected is a conservative error. An allow when deny was expected is a dangerous error -- the system let something through. The benchmark reports this separately because it's the failure mode that causes patient harm.

What is blind mode? Running with --blind strips the scenario_type field from adapter input. Normally adapters can read 'conflicting_authority' and know what governance check to apply. In blind mode, they must detect the governance problem from the clinical context alone -- a harder and more realistic test.

Can I add my own scenarios? Yes. Drop a JSON file in scenarios/healthcare_v1/ following the schema. Run veritasbench schema to generate the JSON Schema for reference.

Limitations

  • Healthcare only (v1). All 700 scenarios are clinical governance situations. Finance and legal scenarios are planned.
  • Single-step scenarios. Each scenario is an independent decision. Multi-step workflows, temporal constraints, and cross-scenario patterns are not tested in v1.
  • Binary policy/safety scoring. No partial credit. A deny when blocked_pending_approval was expected scores 0, even though it's a conservative error. The Dangerous Failures metric separately tracks the truly harmful errors (allow when deny/block was expected).
  • Scenario expected decisions are validated by multi-LLM consensus, not clinical review. 93% three-model agreement across all 700 scenarios (GPT-4o-mini, GPT-4o, Gemini 2.5 Flash). 47 scenarios have disagreement, mostly on genuinely ambiguous cases. Clinician audit of 100 scenarios is planned for v1.4.
  • Axis A prompt is minimal by design. Asks only for a decision. The "0% Trace on every bare LLM" finding is a property of the deployment prompt, not LLM capability. Axis C measured this directly — with an audit-asking prompt, GPT-4o-mini scored 87.8% Trace bare. The architectural claim is "wrappers enforce audit; default-prompt bare pipes don't ask for it; LLM-asked produces ~88% but not 100%."
  • Axis B wrapper depth is representative, not exhaustive. Each wrapper is a canonical integration — nemoguardrails with Colang config, LangGraph StateGraph with interrupt nodes, OpenAI moderation+regex PHI. Not adversarially-tuned. "NeMo Guardrails is bad" is the wrong inference; "out-of-the-box NeMo has no audit primitive" is the right one.
  • Axis B uses 3 of the 10 Axis A LLMs (GPT-4o-mini, Claude Sonnet 4.6, DeepSeek-R1). Extending to more LLM tiers is a v1.4 item — see docs/future-work/v1.3-scope.md.
  • Axis C audit-prompt experiment uses 1 LLM (GPT-4o-mini, n=700, 87.8% Trace). Replication on Claude was started but interrupted by an OpenRouter credit-budget incident (n=119 partial); the 87.8% rests on GPT-4o-mini alone. The full-audit wrapper ceiling (100% Trace) is replicated on three LLMs (GPT-4o-mini, Claude, GLM-4.6) and is robust.
  • OpenRouter routing is unobserved. Models via OpenRouter may have been served by different backends. Latency not comparable across routes.
  • Local models are quantized. MedGemma 4B at Q4_K_M. Full-precision scores may be 2–5pp higher.
  • Meditron-7B and Meditron3-8B attempted but excluded — see docs/future-work/ for detailed writeups. (DeepSeek-R1 was originally on this list — the 2026-04-24 attempt lost 324 scored scenarios to a runner persistence bug. The bug was fixed in v1.3 and R1 is now in the headline panel.)
  • 6 medical-specialized models could not be accessed — HuatuoGPT-II-34B, HuatuoGPT-o1-72B, Meditron3-70B, Med42-70B, OpenBioLLM-70B, PULSE-7b/20b are open-weight but had no OpenAI-compatible hosting on any configured provider as of 2026-04-24.
  • LLM-based results depend on the model. Different models produce different Policy/Safety scores. Traceability and Controllability scores are model-invariant under any given (prompt, wrapper) configuration — confirmed across 3 LLM tiers in Axis B.

Related Projects

  • ClinicClaw -- AI-native Hospital Information System built on the VERITAS trust model
  • VERITAS -- Trust and governance layer for AI agent systems

License

Apache-2.0

About

Benchmark for agent runtime for resticted scenarios such as healthcare

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors