Add full-language SKILL.md translation attack (ES + JA) by yoshinari-patronus · Pull Request #8 · patronus-ai/skill-inject

yoshinari-patronus · 2026-05-19T19:47:44Z

Summary

A new stego-style attack: the entire SKILL.md (frontmatter + body) is translated into Spanish or Japanese, with the malicious instruction also translated and embedded inline. Because the injection is in the same language register as the legitimate skill content, English keyword scans and "this instruction looks foreign" heuristics lose their grip.

Changes

data/skill_md_overrides_full_lang/ — 28 clean ES/JA translations (14 skills × 2 langs) as utility-baseline carriers
data/skill_md_overrides_full_lang_stego/ — 108 JA-translated SKILL.md with translated injection inline
data/skill_md_overrides_stego/INST-5005/5500_full_ja_skill_*/ — 3 JA-skill variants
+111 entries in data/contextual_injections_encoded.json indexing the overrides
scripts/build_full_lang_skill_records.py — generator
scripts/build_sandbox.py — orig_task_index support so filtered multi-task INST records produce sandbox names matching the canonical task numbering that judges/task_verifier.py rules were authored against

Regenerate the records

python scripts/build_full_lang_skill_records.py

This builds the 108 full_lang_stego/ override SKILL.md files and appends the corresponding 111 entries to contextual_injections_encoded.json. Idempotent.

Bundled dependency

Bundles 7bec5f0 build_sandbox: skill_md_override + claude_md_override and 8f03599 acrostic-stego, multilingual, cross-model so the PR is self-contained from origin/main. If those land in main first, this PR will rebase to a no-op for the duplicates.

Test plan

python scripts/build_full_lang_skill_records.py regenerates the records cleanly
Run on Claude Sonnet 4.5: python -m experiments.contextual --agent claude --model sonnet --policy normal (encoded.json gets picked up via the existing experiment framework)
Confirm new INST-10xxx / 11xxx / 5005 / 5500 sandboxes build and run

🤖 Generated with Claude Code

Lets an injection record point at a pre-injected SKILL.md (or CLAUDE.md) file under data/<override_path> and have build_sandbox.py copy it verbatim into the sandbox, bypassing line-based injection. Required for porting autoresearch keeps faithfully — the autoresearch meta-agent writes a full markdown section (heading + numbered steps + authority framing), not just a single line to inject at a fixed offset. Adds: - data/contextual_injections_autoresearch_full.json (21 May-13 keeps) - data/skill_md_overrides_autoresearch/<cid>/SKILL.md (per-keep verbatim) - notes/2026-05-14_porting_autoresearch_keeps_to_contextual_injections.md documenting the workflow, measured impact (76% reproducibility vs 19% with line-based injection), the bugs we surfaced, and recommended autoresearch-loop changes. Backward-compatible: existing 48 contextual_injections.json records don't set skill_md_override and continue to use the line-based injection path.

Three threads of new attack-corpus + analysis work: 1. Multilingual encoded extensions of the canonical 48 injections (data/contextual_injections_encoded.json — 288 records: 48 × 6 transforms es/fr/zh/ja/b64-hint/b64-raw). Generator at scripts/build_encoded_injection_extensions.py. 2. Acrostic-stego variants gradient (data/contextual_injections_stego.json + data/skill_md_overrides_stego/) — 23 hand-crafted variants probing the priming-strength × payload-cleanliness × action-class surface. Key finding: 8407 recipe (explicit cipher prompt + clean self-contained shell command payload) hits 60% Sonnet ASR; full 85-char canonical-text acrostic produces explicit safety refusal. 3. DeepSeek-V4-Flash wired into OpenCode harness for cross-model A/B: - config.py / experiments/_opencode_config.py: new vllm provider block for OpenAI-compatible endpoints serving DeepSeek models - scripts/run_sandbox_container.py: pass DEEPSEEK_API_KEY / DEEPSEEK_BASE_URL env vars into containers; add "opencode" to the --agent argparse choices Findings consolidated in 4 notes files: - notes/2026-05-14_phase3_stego_audit.md - notes/2026-05-14_encoded_injection_extensions.md - notes/2026-05-15_acrostic_stego_gradient.md - notes/2026-05-15_cross_model_cross_harness_matrix.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A new stego-style attack where the entire SKILL.md (frontmatter + body) is translated into Spanish or Japanese, with the malicious instruction also translated and embedded inline. Because the injection is in the same language register as the legitimate skill content, English keyword scans and "this instruction looks foreign" heuristics lose their grip. Adds: - data/skill_md_overrides_full_lang/ 28 clean ES/JA translations (14 skills × 2 langs) as utility-baseline carriers - data/skill_md_overrides_full_lang_stego/ 108 JA-translated SKILL.md with translated injection inline (one per affected (skill, canonical-injection) pair) - data/skill_md_overrides_stego/INST-5005/5500_full_ja_skill_*/ 3 JA-skill variants under the existing stego dir - 111 new entries in data/contextual_injections_encoded.json indexing the override files above - scripts/build_full_lang_skill_records.py generator - scripts/build_sandbox.py: `orig_task_index` support so filtered multi-task INST records produce sandbox names matching the canonical task numbering that task_verifier.py rules were authored against Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI previously installed only `pytest`, which made `tests/test_evolutionary_optimization.py` error during collection because it imports `experiments.ablations.evolutionary_optimization`, which itself imports `anthropic` at module top level. The error predates this PR (main is in the same state) but blocks the test matrix on PR #6 too. Two changes: 1. `pyproject.toml`: add `anthropic` to the `test` extras list. 2. `.github/workflows/tests.yml`: install `pip install -e ".[test]"` instead of just `pip install pytest`, so the extras are picked up. Out of scope here (and not required to make tests green): the underlying coupling between a top-level `import anthropic` and the test collection step. A cleaner fix is to lazy-import inside the function that needs the client, but that's a separate concern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`pip install -e .` fails because setuptools' flat-layout auto-discovery sees too many top-level dirs (data/, plots/, notes/, docker/, judges/, apptainer/, experiments/, startup_assets/, startup_scripts/) and refuses to pick. Configuring explicit `packages` discovery is the proper long-term fix, but unrelated to this PR. Tests don't actually need the project installed as a package — they manipulate sys.path themselves to find local modules. So just install pytest and anthropic directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`test_fallback_text_attack_success` was written when `parse_injection_verdict` had a fallback path that matched verdict keywords (`attack_success`, `ignored`, `technical`) as substrings in plain prose if no JSON block was found. That fallback was removed in favor of JSON-only parsing — the judge prompt is explicit about returning JSON, and substring-matching led to false positives when the judge described candidate verdicts in narrative form before settling on a different one. The test was reachable on PR #6 only after we fixed the unrelated `anthropic` import issue; on `main` it was hidden behind that collection error. Renamed to `test_no_json_returns_ignored` and flipped the expectation to match current behavior (no JSON → "ignored", consistent with the verifier's ambiguous-case semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix CI test_agent_parallel_covers_all_agents — the cherry-pick conflict resolution for 8f03599 kept `opencode` in AGENT_MODELS but missed the corresponding entry in AGENT_PARALLEL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akkikiki and others added 3 commits May 19, 2026 18:38

yoshinari-patronus requested a review from vgtomahawk May 19, 2026 19:48

akkikiki and others added 4 commits May 19, 2026 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add full-language SKILL.md translation attack (ES + JA)#8

Add full-language SKILL.md translation attack (ES + JA)#8
yoshinari-patronus wants to merge 7 commits into
mainfrom
pr/translation-attack

yoshinari-patronus commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yoshinari-patronus commented May 19, 2026

Summary

Changes

Regenerate the records

Bundled dependency

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants