Add full-language SKILL.md translation attack (ES + JA)#8
Open
yoshinari-patronus wants to merge 7 commits into
Open
Add full-language SKILL.md translation attack (ES + JA)#8yoshinari-patronus wants to merge 7 commits into
yoshinari-patronus wants to merge 7 commits into
Conversation
Lets an injection record point at a pre-injected SKILL.md (or CLAUDE.md) file under data/<override_path> and have build_sandbox.py copy it verbatim into the sandbox, bypassing line-based injection. Required for porting autoresearch keeps faithfully — the autoresearch meta-agent writes a full markdown section (heading + numbered steps + authority framing), not just a single line to inject at a fixed offset. Adds: - data/contextual_injections_autoresearch_full.json (21 May-13 keeps) - data/skill_md_overrides_autoresearch/<cid>/SKILL.md (per-keep verbatim) - notes/2026-05-14_porting_autoresearch_keeps_to_contextual_injections.md documenting the workflow, measured impact (76% reproducibility vs 19% with line-based injection), the bugs we surfaced, and recommended autoresearch-loop changes. Backward-compatible: existing 48 contextual_injections.json records don't set skill_md_override and continue to use the line-based injection path.
Three threads of new attack-corpus + analysis work:
1. Multilingual encoded extensions of the canonical 48 injections
(data/contextual_injections_encoded.json — 288 records: 48 × 6
transforms es/fr/zh/ja/b64-hint/b64-raw). Generator at
scripts/build_encoded_injection_extensions.py.
2. Acrostic-stego variants gradient (data/contextual_injections_stego.json
+ data/skill_md_overrides_stego/) — 23 hand-crafted variants probing
the priming-strength × payload-cleanliness × action-class surface.
Key finding: 8407 recipe (explicit cipher prompt + clean
self-contained shell command payload) hits 60% Sonnet ASR; full
85-char canonical-text acrostic produces explicit safety refusal.
3. DeepSeek-V4-Flash wired into OpenCode harness for cross-model A/B:
- config.py / experiments/_opencode_config.py: new vllm provider block
for OpenAI-compatible endpoints serving DeepSeek models
- scripts/run_sandbox_container.py: pass DEEPSEEK_API_KEY /
DEEPSEEK_BASE_URL env vars into containers; add "opencode" to the
--agent argparse choices
Findings consolidated in 4 notes files:
- notes/2026-05-14_phase3_stego_audit.md
- notes/2026-05-14_encoded_injection_extensions.md
- notes/2026-05-15_acrostic_stego_gradient.md
- notes/2026-05-15_cross_model_cross_harness_matrix.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new stego-style attack where the entire SKILL.md (frontmatter + body)
is translated into Spanish or Japanese, with the malicious instruction
also translated and embedded inline. Because the injection is in the
same language register as the legitimate skill content, English keyword
scans and "this instruction looks foreign" heuristics lose their grip.
Adds:
- data/skill_md_overrides_full_lang/ 28 clean ES/JA translations
(14 skills × 2 langs) as
utility-baseline carriers
- data/skill_md_overrides_full_lang_stego/ 108 JA-translated SKILL.md
with translated injection
inline (one per affected
(skill, canonical-injection)
pair)
- data/skill_md_overrides_stego/INST-5005/5500_full_ja_skill_*/
3 JA-skill variants under
the existing stego dir
- 111 new entries in data/contextual_injections_encoded.json indexing
the override files above
- scripts/build_full_lang_skill_records.py generator
- scripts/build_sandbox.py: `orig_task_index` support so filtered
multi-task INST records produce sandbox names matching the canonical
task numbering that task_verifier.py rules were authored against
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI previously installed only `pytest`, which made `tests/test_evolutionary_optimization.py` error during collection because it imports `experiments.ablations.evolutionary_optimization`, which itself imports `anthropic` at module top level. The error predates this PR (main is in the same state) but blocks the test matrix on PR #6 too. Two changes: 1. `pyproject.toml`: add `anthropic` to the `test` extras list. 2. `.github/workflows/tests.yml`: install `pip install -e ".[test]"` instead of just `pip install pytest`, so the extras are picked up. Out of scope here (and not required to make tests green): the underlying coupling between a top-level `import anthropic` and the test collection step. A cleaner fix is to lazy-import inside the function that needs the client, but that's a separate concern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`pip install -e .` fails because setuptools' flat-layout auto-discovery sees too many top-level dirs (data/, plots/, notes/, docker/, judges/, apptainer/, experiments/, startup_assets/, startup_scripts/) and refuses to pick. Configuring explicit `packages` discovery is the proper long-term fix, but unrelated to this PR. Tests don't actually need the project installed as a package — they manipulate sys.path themselves to find local modules. So just install pytest and anthropic directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`test_fallback_text_attack_success` was written when `parse_injection_verdict` had a fallback path that matched verdict keywords (`attack_success`, `ignored`, `technical`) as substrings in plain prose if no JSON block was found. That fallback was removed in favor of JSON-only parsing — the judge prompt is explicit about returning JSON, and substring-matching led to false positives when the judge described candidate verdicts in narrative form before settling on a different one. The test was reachable on PR #6 only after we fixed the unrelated `anthropic` import issue; on `main` it was hidden behind that collection error. Renamed to `test_no_json_returns_ignored` and flipped the expectation to match current behavior (no JSON → "ignored", consistent with the verifier's ambiguous-case semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix CI test_agent_parallel_covers_all_agents — the cherry-pick conflict resolution for 8f03599 kept `opencode` in AGENT_MODELS but missed the corresponding entry in AGENT_PARALLEL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A new stego-style attack: the entire SKILL.md (frontmatter + body) is translated into Spanish or Japanese, with the malicious instruction also translated and embedded inline. Because the injection is in the same language register as the legitimate skill content, English keyword scans and "this instruction looks foreign" heuristics lose their grip.
Changes
data/skill_md_overrides_full_lang/— 28 clean ES/JA translations (14 skills × 2 langs) as utility-baseline carriersdata/skill_md_overrides_full_lang_stego/— 108 JA-translated SKILL.md with translated injection inlinedata/skill_md_overrides_stego/INST-5005/5500_full_ja_skill_*/— 3 JA-skill variantsdata/contextual_injections_encoded.jsonindexing the overridesscripts/build_full_lang_skill_records.py— generatorscripts/build_sandbox.py—orig_task_indexsupport so filtered multi-task INST records produce sandbox names matching the canonical task numbering thatjudges/task_verifier.pyrules were authored againstRegenerate the records
This builds the 108
full_lang_stego/override SKILL.md files and appends the corresponding 111 entries tocontextual_injections_encoded.json. Idempotent.Bundled dependency
Bundles
7bec5f0 build_sandbox: skill_md_override + claude_md_overrideand8f03599 acrostic-stego, multilingual, cross-modelso the PR is self-contained fromorigin/main. If those land in main first, this PR will rebase to a no-op for the duplicates.Test plan
python scripts/build_full_lang_skill_records.pyregenerates the records cleanlypython -m experiments.contextual --agent claude --model sonnet --policy normal(encoded.json gets picked up via the existing experiment framework)🤖 Generated with Claude Code