Contextual guardian experiments - dynamic guardian subagent, static guardian subagent#7
Merged
Merged
Conversation
Implement two new attack variants that test whether dispersing malicious instructions across multiple linked files affects attack success rate: - Horcrux (split-skill): task-critical content moved to auxiliary file alongside malicious instruction, forcing the agent to read it - Parting Shot (auxiliary-doc): SKILL.md references optional auxiliary file containing malicious instruction, agent can skip it Includes: - 5 attacks × 2 variants injection definitions (horcruxes/parting_shot) - Fragment-based sandbox builder (replace_lines, line, new_file) - Experiment runners mirroring contextual.py pattern - --injection-id and --first-task-only flags for controlled comparisons - Design document (HORCRUXES_DESIGN.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…parser bug - Add --max-tasks flag to build_sandbox.py, build_horcrux_sandboxes.py, and experiment runners for precise instance-level control - Wire --injection-id and --first-task-only through horcrux/parting_shot experiment runners - Remove fabricated task1 from injection ID 4 (base 15) in both horcruxes and parting_shot JSONs to match contextual's single task - Document verdict parser bug in judges/_base.py (TODO discuss-with-yoshi): under-reports attack_success and task_success when judge evidence is long Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rebuild injection definitions from 5 to 12 matched instances covering base_ids 1,2,3,5,13,15,23,26,42 across pptx/docx/email-api/calendar/ hf-llm-trainer. Horcrux places malicious step as Step 4/6 in mandatory auxiliary workflow; parting shot references optional policy document. Results (5 Claude models): horcrux 0% ASR, parting shot 17-42% ASR, vs contextual baseline 100% on these instances. Indirection weakens attacks—opposite of initial hypothesis. Add generator script and analysis scripts for result comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guardian experiment: MCP-based subagent that mediates agent access to skill documentation through a conversational LLM interpreter. Includes runtime guardian (--guardian flag in build_sandbox.py) and static preprocessing (apply_static_guardian.py). Results show 50% ASR reduction with neutral guardian prompt on contextual injections. Also adds peer split injection pipeline (generate + improve scripts) and skill extension experiment runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Close two validity gaps in the guardian experiment: 1. Embed skill docs in server memory instead of .raw/ directory — agent can no longer bypass guardian by reading files directly. Server self-deletes on startup to prevent extraction from source. 2. Log all guardian interactions to .guardian_interactions.jsonl for post-hoc analysis of what the guardian passed through. Also: fix static guardian to use generic YAML frontmatter and delete auxiliary .md files for fair comparison, update model defaults to non-dated IDs, add PROTOCOL.md and illustrative examples doc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- static_guardian_v2_illustrative_examples.md: before/after SKILL.md comparisons showing what Sonnet vs Haiku guardians filter vs leak - PR_DESCRIPTION.md: full PR description with results, review guide, and vestigial file callouts for Yoshi's review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nts) Extended both dynamic and static guardian to support OpenAI-compatible APIs alongside Anthropic, enabling experiments with open models (DeepSeek V4 Flash, Nemotron) via vLLM/ngrok endpoints. Changes: - skill_guardian_server.py: Added GUARDIAN_PROVIDER/BASE_URL/API_KEY_ENV config, branching logic in _call_guardian_llm() for openai vs anthropic - apply_static_guardian.py: Added --provider/--base-url/--api-key-env flags, OpenAI chat/completions support in call_guardian() - build_sandbox.py: Added --guardian-provider/--guardian-base-url/ --guardian-api-key-env flags, template injection for all 3 new placeholders - run_sandbox_container.py: Added DEEPSEEK_API_KEY/NEMOTRON_API_KEY to standard env var forwarding (for guardian MCP server), added deepseek agent type - config.py: Added deepseek agent config (model, parallelism, CLI mapping) Fully backward-compatible: defaults to anthropic provider with existing behavior. DeepSeek V4 Flash results (139 contextual injections): - Vanilla: 46.8% ASR, 60.4% task success - Dynamic guardian (DeepSeek): 2.9% ASR, 57.6% task success - Static guardian (DeepSeek): 3.6% ASR, 56.1% task success Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| server_content = server_content.replace( | ||
| "__EMBEDDED_SKILL_DOCS__", repr(embedded_docs) | ||
| ) | ||
| guardian_dst.write_text(server_content, encoding="utf-8") |
Collaborator
There was a problem hiding this comment.
Can you patch this?
Note from varun: Tried, but was leading to some regression, so reverted back, given short time might just keep it as it is to preserve experimental validitity etc.
| "response_preview": response[:500], | ||
| } | ||
| with _INTERACTION_LOG.open("a", encoding="utf-8") as f: | ||
| f.write(json.dumps(record) + "\n") |
yoshinari-patronus
approved these changes
May 15, 2026
Comment on lines
+54
to
+58
| p.add_argument("--injection-id", action="append", default=[], | ||
| help="Only build/run these injection IDs (repeatable, comma-separated)") | ||
| p.add_argument("--first-task-only", action="store_true", | ||
| help="Only build the first task per injection") | ||
| p.add_argument("--max-tasks", type=int, default=None, |
Collaborator
There was a problem hiding this comment.
These are nice 👍
| server_content = server_content.replace( | ||
| "__EMBEDDED_SKILL_DOCS__", repr(embedded_docs) | ||
| ) | ||
| guardian_dst.write_text(server_content, encoding="utf-8") |
Collaborator
There was a problem hiding this comment.
Can you patch this?
Note from varun: Tried, but was leading to some regression, so reverted back, given short time might just keep it as it is to preserve experimental validitity etc.
| # Anthropic Messages API (default) | ||
| body = json.dumps({ | ||
| "model": GUARDIAN_MODEL, | ||
| "max_tokens": 8192, |
Collaborator
There was a problem hiding this comment.
Maybe bump this by 2x?
Gives more headroom for verbose skill docs and guardian responses, reducing risk of truncation for complex documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- When agent stdout is empty but stderr has content, use stderr as fallback for agent_stdout.txt (handles OpenCode routing execution output to stderr) - Fix contextual judge to default to task index 0 when sandbox name has no _taskN suffix, enabling task evaluation for single-task sandboxes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Contextual Guardian Experiments
Summary
Adds a guardian defense mechanism for contextual prompt injection tasks — both dynamic (runtime MCP subagent) and static (build-time preprocessing) variants — with sanitized experimental setups, full JSON trace capture, and results across all 139 contextual injection instances.
What's in this PR
Core: Guardian Defense (
scripts/build_sandbox.py,scripts/skill_guardian_server.py)Dynamic guardian — an MCP subagent that mediates between the coding agent and its skill documentation at runtime:
consult_skill_guardian(skill_name, question)to get documentationcatthe script to extract embedded docs).guardian_interactions.jsonl.mcp.jsonset to chmod 444This closes the validity gap from the v1 guardian where agents could bypass the guardian by reading
.raw/directly (observed in ~50% of survived-attack cases in v1).Key files:
scripts/skill_guardian_server.py— MCP server template (FastMCP, stdio transport)scripts/build_sandbox.py—apply_guardian()function +--guardian/--guardian-modelCLI flagsCore: Static Guardian (
scripts/apply_static_guardian.py)Build-time preprocessing: sends each SKILL.md through a guardian LLM once, replaces it with the guardian's interpretation. No runtime MCP server.
Sanitized for fairness:
.mdfiles deleted (agent can't read originals)Trace Capture (
scripts/run_sandbox_container.py)--output-format jsonflag — captures full agent conversation trace (every reasoning step, tool call, tool result) asagent_trace.json--output-format textDocumentation & Examples
PROTOCOL.md— operational protocol (virtualenv, API keys, model names, build scripts, parallelism, verification checklist)illustrative_examples/guardian_v2_illustrative_examples.md— dynamic guardian: 4 detailed examples with raw guardian Q&A transcripts, showing how the MCP subagent filters (or leaks) injections in real timeillustrative_examples/static_guardian_v2_illustrative_examples.md— static guardian: 4 examples showing before/after SKILL.md content — what the guardian's one-shot interpretation preserved vs filtered, and how Sonnet and Haiku guardians differMinor fixes
judges/contextual_judge.py— default to task index 0 when no_taskNsuffix in sandbox nameconfig.py— updated model defaults from retired dated IDs to stable names (claude-sonnet-4-5,claude-haiku-4-5)Results
All experiments: Sonnet agent on 139 contextual injection tasks.
Takeaway: The dynamic guardian is the best tradeoff — strong injection filtering (7-13% ASR vs 39% baseline) while preserving task completion (55-56%). The static guardian filters more aggressively but loses too much detail for the agent to complete tasks.
What's NOT integrated yet
Guardian + Escalation/User-Sim
The escalation experiments (
experiments/escalation.py,experiments/escalation_usersim.py) don't yet pass--guardianthrough tobuild_sandboxes(). Integration is straightforward (~50 LOC):--guardian/--guardian-modelargs toExperimentRunnerbuild_sandboxes()CLI construction"guardian"to_extra_results_suffixes()This is a natural next step but out of scope for this PR.
Open LLM experiments
OpenCode agent support is wired up but no runs have been done yet with DeepSeek V4 Flash or Nemotron 120B. The infra is ready (config, container setup, provider routing).
Vestigial / Experimental (not core to this PR)
The following files are from earlier exploration of fragment-based injection variants (horcruxes, parting shots, skill extension). These experiments were largely inconclusive — the injection strategies didn't produce strong or consistent results compared to contextual injections. They're included because they were developed on this branch, but they are not part of the guardian defense narrative and can be ignored for review purposes.
Vestigial files (can be moved or cleaned up separately):
HORCRUXES_DESIGN.md— design doc for fragment injection approachdata/horcruxes_injections.json— fragment injection definitionsdata/parting_shot_injections.json— parting shot injection definitionsdata/skill_extension_injections_peer_split.json— peer-split injection dataexperiments/horcruxes.py— experiment runner for horcruxesexperiments/parting_shot.py— experiment runner for parting shotsexperiments/skill_extension.py— experiment runner for skill extensionsscripts/build_horcrux_sandboxes.py— sandbox builder for fragment injectionsscripts/generate_variant_injections.py— LLM-based injection generationscripts/generate_peer_split_injections.py— peer-split injection generationscripts/improve_peer_split_injections.py— iterative injection refinementscripts/improve_peer_split_injections_v3.py— v3 of abovescripts/compare_contextual_vs_parting.py— comparison analysis scriptTotal: ~6,800 LOC of vestigial code. None of this affects the guardian experiments or existing contextual injection pipeline.
File-by-file review guide
Review carefully (core changes):
scripts/skill_guardian_server.py— new file, MCP serverscripts/build_sandbox.py—apply_guardian()+ CLI changesscripts/apply_static_guardian.py— static guardian preprocessingscripts/run_sandbox_container.py—--output-format json, opencode supportconfig.py— opencode agent, model name fixesjudges/contextual_judge.py— task index default fixSkim (documentation):
PROTOCOL.mdillustrative_examples/guardian_v2_illustrative_examples.md(dynamic guardian examples with raw transcripts)illustrative_examples/static_guardian_v2_illustrative_examples.md(static guardian before/after comparisons)Skip (vestigial):