Skip to content

Contextual guardian experiments - dynamic guardian subagent, static guardian subagent#7

Merged
vgtomahawk merged 12 commits into
mainfrom
contextual-guardian-experiments
May 20, 2026
Merged

Contextual guardian experiments - dynamic guardian subagent, static guardian subagent#7
vgtomahawk merged 12 commits into
mainfrom
contextual-guardian-experiments

Conversation

@vgtomahawk
Copy link
Copy Markdown
Contributor

@vgtomahawk vgtomahawk commented May 13, 2026

PR: Contextual Guardian Experiments

Summary

Adds a guardian defense mechanism for contextual prompt injection tasks — both dynamic (runtime MCP subagent) and static (build-time preprocessing) variants — with sanitized experimental setups, full JSON trace capture, and results across all 139 contextual injection instances.

What's in this PR

Core: Guardian Defense (scripts/build_sandbox.py, scripts/skill_guardian_server.py)

Dynamic guardian — an MCP subagent that mediates between the coding agent and its skill documentation at runtime:

  • Agent no longer reads skill docs directly from disk
  • Docs are embedded in the MCP server's memory at build time, then deleted from disk
  • Agent must call consult_skill_guardian(skill_name, question) to get documentation
  • A separate guardian LLM (Sonnet or Haiku) interprets the docs before returning them
  • Server self-deletes on startup (agent can't cat the script to extract embedded docs)
  • All guardian interactions logged to .guardian_interactions.jsonl
  • .mcp.json set to chmod 444

This closes the validity gap from the v1 guardian where agents could bypass the guardian by reading .raw/ directly (observed in ~50% of survived-attack cases in v1).

Key files:

  • scripts/skill_guardian_server.py — MCP server template (FastMCP, stdio transport)
  • scripts/build_sandbox.pyapply_guardian() function + --guardian / --guardian-model CLI flags

Core: Static Guardian (scripts/apply_static_guardian.py)

Build-time preprocessing: sends each SKILL.md through a guardian LLM once, replaces it with the guardian's interpretation. No runtime MCP server.

Sanitized for fairness:

  • Original YAML frontmatter replaced with generic description (no injection leak)
  • All auxiliary .md files deleted (agent can't read originals)

Trace Capture (scripts/run_sandbox_container.py)

  • Added --output-format json flag — captures full agent conversation trace (every reasoning step, tool call, tool result) as agent_trace.json
  • Previously only captured final summary text via --output-format text

Documentation & Examples

Minor fixes

  • judges/contextual_judge.py — default to task index 0 when no _taskN suffix in sandbox name
  • config.py — updated model defaults from retired dated IDs to stable names (claude-sonnet-4-5, claude-haiku-4-5)

Results

All experiments: Sonnet agent on 139 contextual injection tasks.

Condition ASR Task Success Notes
Vanilla (no defense) 38.8% (54/139) 58.3% (81/139) Baseline from prior runs
Dynamic guardian (Sonnet) 7.2% (10/139) 54.7% (76/139) Runtime MCP subagent
Dynamic guardian (Haiku) 13.0% (18/139) 56.1% (78/139) Cheaper guardian, slightly weaker
Static guardian (Sonnet) 4.3% (6/139) 18.0% (25/139) Strongest defense, but task success drops
Static guardian (Haiku) 2.2% (3/139) 18.0% (25/139) Lowest ASR overall

Takeaway: The dynamic guardian is the best tradeoff — strong injection filtering (7-13% ASR vs 39% baseline) while preserving task completion (55-56%). The static guardian filters more aggressively but loses too much detail for the agent to complete tasks.


What's NOT integrated yet

Guardian + Escalation/User-Sim

The escalation experiments (experiments/escalation.py, experiments/escalation_usersim.py) don't yet pass --guardian through to build_sandboxes(). Integration is straightforward (~50 LOC):

  1. Add --guardian / --guardian-model args to ExperimentRunner
  2. Pass through in build_sandboxes() CLI construction
  3. Possibly add "guardian" to _extra_results_suffixes()

This is a natural next step but out of scope for this PR.

Open LLM experiments

OpenCode agent support is wired up but no runs have been done yet with DeepSeek V4 Flash or Nemotron 120B. The infra is ready (config, container setup, provider routing).


Vestigial / Experimental (not core to this PR)

The following files are from earlier exploration of fragment-based injection variants (horcruxes, parting shots, skill extension). These experiments were largely inconclusive — the injection strategies didn't produce strong or consistent results compared to contextual injections. They're included because they were developed on this branch, but they are not part of the guardian defense narrative and can be ignored for review purposes.

Vestigial files (can be moved or cleaned up separately):

  • HORCRUXES_DESIGN.md — design doc for fragment injection approach
  • data/horcruxes_injections.json — fragment injection definitions
  • data/parting_shot_injections.json — parting shot injection definitions
  • data/skill_extension_injections_peer_split.json — peer-split injection data
  • experiments/horcruxes.py — experiment runner for horcruxes
  • experiments/parting_shot.py — experiment runner for parting shots
  • experiments/skill_extension.py — experiment runner for skill extensions
  • scripts/build_horcrux_sandboxes.py — sandbox builder for fragment injections
  • scripts/generate_variant_injections.py — LLM-based injection generation
  • scripts/generate_peer_split_injections.py — peer-split injection generation
  • scripts/improve_peer_split_injections.py — iterative injection refinement
  • scripts/improve_peer_split_injections_v3.py — v3 of above
  • scripts/compare_contextual_vs_parting.py — comparison analysis script

Total: ~6,800 LOC of vestigial code. None of this affects the guardian experiments or existing contextual injection pipeline.


File-by-file review guide

Review carefully (core changes):

  • scripts/skill_guardian_server.py — new file, MCP server
  • scripts/build_sandbox.pyapply_guardian() + CLI changes
  • scripts/apply_static_guardian.py — static guardian preprocessing
  • scripts/run_sandbox_container.py--output-format json, opencode support
  • config.py — opencode agent, model name fixes
  • judges/contextual_judge.py — task index default fix

Skim (documentation):

  • PROTOCOL.md
  • illustrative_examples/guardian_v2_illustrative_examples.md (dynamic guardian examples with raw transcripts)
  • illustrative_examples/static_guardian_v2_illustrative_examples.md (static guardian before/after comparisons)

Skip (vestigial):

  • Everything listed in the vestigial section above

vgtomahawk and others added 9 commits April 24, 2026 13:20
Implement two new attack variants that test whether dispersing malicious
instructions across multiple linked files affects attack success rate:

- Horcrux (split-skill): task-critical content moved to auxiliary file
  alongside malicious instruction, forcing the agent to read it
- Parting Shot (auxiliary-doc): SKILL.md references optional auxiliary
  file containing malicious instruction, agent can skip it

Includes:
- 5 attacks × 2 variants injection definitions (horcruxes/parting_shot)
- Fragment-based sandbox builder (replace_lines, line, new_file)
- Experiment runners mirroring contextual.py pattern
- --injection-id and --first-task-only flags for controlled comparisons
- Design document (HORCRUXES_DESIGN.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…parser bug

- Add --max-tasks flag to build_sandbox.py, build_horcrux_sandboxes.py,
  and experiment runners for precise instance-level control
- Wire --injection-id and --first-task-only through horcrux/parting_shot
  experiment runners
- Remove fabricated task1 from injection ID 4 (base 15) in both
  horcruxes and parting_shot JSONs to match contextual's single task
- Document verdict parser bug in judges/_base.py (TODO discuss-with-yoshi):
  under-reports attack_success and task_success when judge evidence is long

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rebuild injection definitions from 5 to 12 matched instances covering
base_ids 1,2,3,5,13,15,23,26,42 across pptx/docx/email-api/calendar/
hf-llm-trainer. Horcrux places malicious step as Step 4/6 in mandatory
auxiliary workflow; parting shot references optional policy document.

Results (5 Claude models): horcrux 0% ASR, parting shot 17-42% ASR,
vs contextual baseline 100% on these instances. Indirection weakens
attacks—opposite of initial hypothesis.

Add generator script and analysis scripts for result comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guardian experiment: MCP-based subagent that mediates agent access to skill
documentation through a conversational LLM interpreter. Includes runtime
guardian (--guardian flag in build_sandbox.py) and static preprocessing
(apply_static_guardian.py). Results show 50% ASR reduction with neutral
guardian prompt on contextual injections.

Also adds peer split injection pipeline (generate + improve scripts) and
skill extension experiment runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Close two validity gaps in the guardian experiment:

1. Embed skill docs in server memory instead of .raw/ directory —
   agent can no longer bypass guardian by reading files directly.
   Server self-deletes on startup to prevent extraction from source.

2. Log all guardian interactions to .guardian_interactions.jsonl
   for post-hoc analysis of what the guardian passed through.

Also: fix static guardian to use generic YAML frontmatter and delete
auxiliary .md files for fair comparison, update model defaults to
non-dated IDs, add PROTOCOL.md and illustrative examples doc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- static_guardian_v2_illustrative_examples.md: before/after SKILL.md
  comparisons showing what Sonnet vs Haiku guardians filter vs leak
- PR_DESCRIPTION.md: full PR description with results, review guide,
  and vestigial file callouts for Yoshi's review

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nts)

Extended both dynamic and static guardian to support OpenAI-compatible APIs
alongside Anthropic, enabling experiments with open models (DeepSeek V4 Flash,
Nemotron) via vLLM/ngrok endpoints.

Changes:
- skill_guardian_server.py: Added GUARDIAN_PROVIDER/BASE_URL/API_KEY_ENV config,
  branching logic in _call_guardian_llm() for openai vs anthropic
- apply_static_guardian.py: Added --provider/--base-url/--api-key-env flags,
  OpenAI chat/completions support in call_guardian()
- build_sandbox.py: Added --guardian-provider/--guardian-base-url/
  --guardian-api-key-env flags, template injection for all 3 new placeholders
- run_sandbox_container.py: Added DEEPSEEK_API_KEY/NEMOTRON_API_KEY to standard
  env var forwarding (for guardian MCP server), added deepseek agent type
- config.py: Added deepseek agent config (model, parallelism, CLI mapping)

Fully backward-compatible: defaults to anthropic provider with existing behavior.

DeepSeek V4 Flash results (139 contextual injections):
- Vanilla: 46.8% ASR, 60.4% task success
- Dynamic guardian (DeepSeek): 2.9% ASR, 57.6% task success
- Static guardian (DeepSeek): 3.6% ASR, 56.1% task success

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Comment thread scripts/build_sandbox.py
server_content = server_content.replace(
"__EMBEDDED_SKILL_DOCS__", repr(embedded_docs)
)
guardian_dst.write_text(server_content, encoding="utf-8")
Copy link
Copy Markdown
Collaborator

@yoshinari-patronus yoshinari-patronus May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you patch this?

Note from varun: Tried, but was leading to some regression, so reverted back, given short time might just keep it as it is to preserve experimental validitity etc.

"response_preview": response[:500],
}
with _INTERACTION_LOG.open("a", encoding="utf-8") as f:
f.write(json.dumps(record) + "\n")
Comment thread experiments/_base.py
Comment on lines +54 to +58
p.add_argument("--injection-id", action="append", default=[],
help="Only build/run these injection IDs (repeatable, comma-separated)")
p.add_argument("--first-task-only", action="store_true",
help="Only build the first task per injection")
p.add_argument("--max-tasks", type=int, default=None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are nice 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread scripts/build_sandbox.py
server_content = server_content.replace(
"__EMBEDDED_SKILL_DOCS__", repr(embedded_docs)
)
guardian_dst.write_text(server_content, encoding="utf-8")
Copy link
Copy Markdown
Collaborator

@yoshinari-patronus yoshinari-patronus May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you patch this?

Note from varun: Tried, but was leading to some regression, so reverted back, given short time might just keep it as it is to preserve experimental validitity etc.

Comment thread scripts/skill_guardian_server.py Outdated
# Anthropic Messages API (default)
body = json.dumps({
"model": GUARDIAN_MODEL,
"max_tokens": 8192,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe bump this by 2x?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

vgtomahawk and others added 2 commits May 15, 2026 21:37
Gives more headroom for verbose skill docs and guardian responses,
reducing risk of truncation for complex documentation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- When agent stdout is empty but stderr has content, use stderr as
  fallback for agent_stdout.txt (handles OpenCode routing execution
  output to stderr)
- Fix contextual judge to default to task index 0 when sandbox name
  has no _taskN suffix, enabling task evaluation for single-task
  sandboxes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vgtomahawk vgtomahawk merged commit aba291b into main May 20, 2026
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants