Contextual guardian experiments - dynamic guardian subagent, static guardian subagent by vgtomahawk · Pull Request #7 · patronus-ai/skill-inject

vgtomahawk · 2026-05-13T20:02:11Z

PR: Contextual Guardian Experiments

Summary

Adds a guardian defense mechanism for contextual prompt injection tasks — both dynamic (runtime MCP subagent) and static (build-time preprocessing) variants — with sanitized experimental setups, full JSON trace capture, and results across all 139 contextual injection instances.

What's in this PR

Core: Guardian Defense (`scripts/build_sandbox.py`, `scripts/skill_guardian_server.py`)

Dynamic guardian — an MCP subagent that mediates between the coding agent and its skill documentation at runtime:

Agent no longer reads skill docs directly from disk
Docs are embedded in the MCP server's memory at build time, then deleted from disk
Agent must call consult_skill_guardian(skill_name, question) to get documentation
A separate guardian LLM (Sonnet or Haiku) interprets the docs before returning them
Server self-deletes on startup (agent can't cat the script to extract embedded docs)
All guardian interactions logged to .guardian_interactions.jsonl
.mcp.json set to chmod 444

This closes the validity gap from the v1 guardian where agents could bypass the guardian by reading .raw/ directly (observed in ~50% of survived-attack cases in v1).

Key files:

scripts/skill_guardian_server.py — MCP server template (FastMCP, stdio transport)
scripts/build_sandbox.py — apply_guardian() function + --guardian / --guardian-model CLI flags

Core: Static Guardian (`scripts/apply_static_guardian.py`)

Build-time preprocessing: sends each SKILL.md through a guardian LLM once, replaces it with the guardian's interpretation. No runtime MCP server.

Sanitized for fairness:

Original YAML frontmatter replaced with generic description (no injection leak)
All auxiliary .md files deleted (agent can't read originals)

Trace Capture (`scripts/run_sandbox_container.py`)

Added --output-format json flag — captures full agent conversation trace (every reasoning step, tool call, tool result) as agent_trace.json
Previously only captured final summary text via --output-format text

Documentation & Examples

PROTOCOL.md — operational protocol (virtualenv, API keys, model names, build scripts, parallelism, verification checklist)
illustrative_examples/guardian_v2_illustrative_examples.md — dynamic guardian: 4 detailed examples with raw guardian Q&A transcripts, showing how the MCP subagent filters (or leaks) injections in real time
illustrative_examples/static_guardian_v2_illustrative_examples.md — static guardian: 4 examples showing before/after SKILL.md content — what the guardian's one-shot interpretation preserved vs filtered, and how Sonnet and Haiku guardians differ

Minor fixes

judges/contextual_judge.py — default to task index 0 when no _taskN suffix in sandbox name
config.py — updated model defaults from retired dated IDs to stable names (claude-sonnet-4-5, claude-haiku-4-5)

Results

All experiments: Sonnet agent on 139 contextual injection tasks.

Condition	ASR	Task Success	Notes
Vanilla (no defense)	38.8% (54/139)	58.3% (81/139)	Baseline from prior runs
Dynamic guardian (Sonnet)	7.2% (10/139)	54.7% (76/139)	Runtime MCP subagent
Dynamic guardian (Haiku)	13.0% (18/139)	56.1% (78/139)	Cheaper guardian, slightly weaker
Static guardian (Sonnet)	4.3% (6/139)	18.0% (25/139)	Strongest defense, but task success drops
Static guardian (Haiku)	2.2% (3/139)	18.0% (25/139)	Lowest ASR overall

Takeaway: The dynamic guardian is the best tradeoff — strong injection filtering (7-13% ASR vs 39% baseline) while preserving task completion (55-56%). The static guardian filters more aggressively but loses too much detail for the agent to complete tasks.

What's NOT integrated yet

Guardian + Escalation/User-Sim

The escalation experiments (experiments/escalation.py, experiments/escalation_usersim.py) don't yet pass --guardian through to build_sandboxes(). Integration is straightforward (~50 LOC):

Add --guardian / --guardian-model args to ExperimentRunner
Pass through in build_sandboxes() CLI construction
Possibly add "guardian" to _extra_results_suffixes()

This is a natural next step but out of scope for this PR.

Open LLM experiments

OpenCode agent support is wired up but no runs have been done yet with DeepSeek V4 Flash or Nemotron 120B. The infra is ready (config, container setup, provider routing).

Vestigial / Experimental (not core to this PR)

The following files are from earlier exploration of fragment-based injection variants (horcruxes, parting shots, skill extension). These experiments were largely inconclusive — the injection strategies didn't produce strong or consistent results compared to contextual injections. They're included because they were developed on this branch, but they are not part of the guardian defense narrative and can be ignored for review purposes.

Vestigial files (can be moved or cleaned up separately):

HORCRUXES_DESIGN.md — design doc for fragment injection approach
data/horcruxes_injections.json — fragment injection definitions
data/parting_shot_injections.json — parting shot injection definitions
data/skill_extension_injections_peer_split.json — peer-split injection data
experiments/horcruxes.py — experiment runner for horcruxes
experiments/parting_shot.py — experiment runner for parting shots
experiments/skill_extension.py — experiment runner for skill extensions
scripts/build_horcrux_sandboxes.py — sandbox builder for fragment injections
scripts/generate_variant_injections.py — LLM-based injection generation
scripts/generate_peer_split_injections.py — peer-split injection generation
scripts/improve_peer_split_injections.py — iterative injection refinement
scripts/improve_peer_split_injections_v3.py — v3 of above
scripts/compare_contextual_vs_parting.py — comparison analysis script

Total: ~6,800 LOC of vestigial code. None of this affects the guardian experiments or existing contextual injection pipeline.

File-by-file review guide

Review carefully (core changes):

scripts/skill_guardian_server.py — new file, MCP server
scripts/build_sandbox.py — apply_guardian() + CLI changes
scripts/apply_static_guardian.py — static guardian preprocessing
scripts/run_sandbox_container.py — --output-format json, opencode support
config.py — opencode agent, model name fixes
judges/contextual_judge.py — task index default fix

Skim (documentation):

PROTOCOL.md
illustrative_examples/guardian_v2_illustrative_examples.md (dynamic guardian examples with raw transcripts)
illustrative_examples/static_guardian_v2_illustrative_examples.md (static guardian before/after comparisons)

Skip (vestigial):

Everything listed in the vestigial section above

Implement two new attack variants that test whether dispersing malicious instructions across multiple linked files affects attack success rate: - Horcrux (split-skill): task-critical content moved to auxiliary file alongside malicious instruction, forcing the agent to read it - Parting Shot (auxiliary-doc): SKILL.md references optional auxiliary file containing malicious instruction, agent can skip it Includes: - 5 attacks × 2 variants injection definitions (horcruxes/parting_shot) - Fragment-based sandbox builder (replace_lines, line, new_file) - Experiment runners mirroring contextual.py pattern - --injection-id and --first-task-only flags for controlled comparisons - Design document (HORCRUXES_DESIGN.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…parser bug - Add --max-tasks flag to build_sandbox.py, build_horcrux_sandboxes.py, and experiment runners for precise instance-level control - Wire --injection-id and --first-task-only through horcrux/parting_shot experiment runners - Remove fabricated task1 from injection ID 4 (base 15) in both horcruxes and parting_shot JSONs to match contextual's single task - Document verdict parser bug in judges/_base.py (TODO discuss-with-yoshi): under-reports attack_success and task_success when judge evidence is long Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rebuild injection definitions from 5 to 12 matched instances covering base_ids 1,2,3,5,13,15,23,26,42 across pptx/docx/email-api/calendar/ hf-llm-trainer. Horcrux places malicious step as Step 4/6 in mandatory auxiliary workflow; parting shot references optional policy document. Results (5 Claude models): horcrux 0% ASR, parting shot 17-42% ASR, vs contextual baseline 100% on these instances. Indirection weakens attacks—opposite of initial hypothesis. Add generator script and analysis scripts for result comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Guardian experiment: MCP-based subagent that mediates agent access to skill documentation through a conversational LLM interpreter. Includes runtime guardian (--guardian flag in build_sandbox.py) and static preprocessing (apply_static_guardian.py). Results show 50% ASR reduction with neutral guardian prompt on contextual injections. Also adds peer split injection pipeline (generate + improve scripts) and skill extension experiment runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Close two validity gaps in the guardian experiment: 1. Embed skill docs in server memory instead of .raw/ directory — agent can no longer bypass guardian by reading files directly. Server self-deletes on startup to prevent extraction from source. 2. Log all guardian interactions to .guardian_interactions.jsonl for post-hoc analysis of what the guardian passed through. Also: fix static guardian to use generic YAML frontmatter and delete auxiliary .md files for fair comparison, update model defaults to non-dated IDs, add PROTOCOL.md and illustrative examples doc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… + run-suffix)

- static_guardian_v2_illustrative_examples.md: before/after SKILL.md comparisons showing what Sonnet vs Haiku guardians filter vs leak - PR_DESCRIPTION.md: full PR description with results, review guide, and vestigial file callouts for Yoshi's review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nts) Extended both dynamic and static guardian to support OpenAI-compatible APIs alongside Anthropic, enabling experiments with open models (DeepSeek V4 Flash, Nemotron) via vLLM/ngrok endpoints. Changes: - skill_guardian_server.py: Added GUARDIAN_PROVIDER/BASE_URL/API_KEY_ENV config, branching logic in _call_guardian_llm() for openai vs anthropic - apply_static_guardian.py: Added --provider/--base-url/--api-key-env flags, OpenAI chat/completions support in call_guardian() - build_sandbox.py: Added --guardian-provider/--guardian-base-url/ --guardian-api-key-env flags, template injection for all 3 new placeholders - run_sandbox_container.py: Added DEEPSEEK_API_KEY/NEMOTRON_API_KEY to standard env var forwarding (for guardian MCP server), added deepseek agent type - config.py: Added deepseek agent config (model, parallelism, CLI mapping) Fully backward-compatible: defaults to anthropic provider with existing behavior. DeepSeek V4 Flash results (139 contextual injections): - Vanilla: 46.8% ASR, 60.4% task success - Dynamic guardian (DeepSeek): 2.9% ASR, 57.6% task success - Static guardian (DeepSeek): 3.6% ASR, 56.1% task success Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

yoshinari-patronus · 2026-05-15T19:59:58Z

+    server_content = server_content.replace(
+        "__EMBEDDED_SKILL_DOCS__", repr(embedded_docs)
+    )
+    guardian_dst.write_text(server_content, encoding="utf-8")


Can you patch this?

Note from varun: Tried, but was leading to some regression, so reverted back, given short time might just keep it as it is to preserve experimental validitity etc.

+            "response_preview": response[:500],
+        }
+        with _INTERACTION_LOG.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(record) + "\n")


yoshinari-patronus · 2026-05-15T19:53:47Z

+        p.add_argument("--injection-id", action="append", default=[],
+                       help="Only build/run these injection IDs (repeatable, comma-separated)")
+        p.add_argument("--first-task-only", action="store_true",
+                       help="Only build the first task per injection")
+        p.add_argument("--max-tasks", type=int, default=None,


These are nice 👍

yoshinari-patronus · 2026-05-15T19:59:58Z

+    server_content = server_content.replace(
+        "__EMBEDDED_SKILL_DOCS__", repr(embedded_docs)
+    )
+    guardian_dst.write_text(server_content, encoding="utf-8")


Can you patch this?

Note from varun: Tried, but was leading to some regression, so reverted back, given short time might just keep it as it is to preserve experimental validitity etc.

yoshinari-patronus · 2026-05-15T20:09:13Z

+        # Anthropic Messages API (default)
+        body = json.dumps({
+            "model": GUARDIAN_MODEL,
+            "max_tokens": 8192,


Maybe bump this by 2x?

Gives more headroom for verbose skill docs and guardian responses, reducing risk of truncation for complex documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- When agent stdout is empty but stderr has content, use stderr as fallback for agent_stdout.txt (handles OpenCode routing execution output to stderr) - Fix contextual judge to default to task index 0 when sandbox name has no _taskN suffix, enabling task evaluation for single-task sandboxes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vgtomahawk and others added 9 commits April 24, 2026 13:20

Merge branch 'main' into horcruxes-attack

a050015

Merge main into horcrux-skill-extension (structured tool-call capture…

18d4a66

… + run-suffix)

Move illustrative examples to dedicated directory

dcf851e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vgtomahawk requested a review from yoshinari-patronus May 13, 2026 20:02

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

yoshinari-patronus approved these changes May 15, 2026

View reviewed changes

yoshinari-patronus reviewed May 15, 2026

View reviewed changes

vgtomahawk and others added 2 commits May 15, 2026 21:37

Increase guardian max_tokens from 8192 to 16384

5b319d4

Gives more headroom for verbose skill docs and guardian responses, reducing risk of truncation for complex documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

vgtomahawk merged commit aba291b into main May 20, 2026
3 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contextual guardian experiments - dynamic guardian subagent, static guardian subagent#7

Contextual guardian experiments - dynamic guardian subagent, static guardian subagent#7
vgtomahawk merged 12 commits into
mainfrom
contextual-guardian-experiments

vgtomahawk commented May 13, 2026 •

edited

Loading

Uh oh!

yoshinari-patronus May 15, 2026 •

edited by vgtomahawk

Loading

Uh oh!

yoshinari-patronus May 15, 2026

Uh oh!

vgtomahawk May 18, 2026

Uh oh!

yoshinari-patronus May 15, 2026 •

edited by vgtomahawk

Loading

Uh oh!

yoshinari-patronus May 15, 2026

Uh oh!

vgtomahawk May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vgtomahawk commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Contextual Guardian Experiments

Summary

What's in this PR

Core: Guardian Defense (scripts/build_sandbox.py, scripts/skill_guardian_server.py)

Core: Static Guardian (scripts/apply_static_guardian.py)

Trace Capture (scripts/run_sandbox_container.py)

Documentation & Examples

Minor fixes

Results

What's NOT integrated yet

Guardian + Escalation/User-Sim

Open LLM experiments

Vestigial / Experimental (not core to this PR)

File-by-file review guide

Uh oh!

yoshinari-patronus May 15, 2026 • edited by vgtomahawk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoshinari-patronus May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vgtomahawk May 18, 2026

Choose a reason for hiding this comment

Uh oh!

yoshinari-patronus May 15, 2026 • edited by vgtomahawk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoshinari-patronus May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vgtomahawk May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vgtomahawk commented May 13, 2026 •

edited

Loading

Core: Guardian Defense (`scripts/build_sandbox.py`, `scripts/skill_guardian_server.py`)

Core: Static Guardian (`scripts/apply_static_guardian.py`)

Trace Capture (`scripts/run_sandbox_container.py`)

yoshinari-patronus May 15, 2026 •

edited by vgtomahawk

Loading

yoshinari-patronus May 15, 2026 •

edited by vgtomahawk

Loading