Skip to content

Add full-language SKILL.md translation attack (ES + JA)#8

Open
yoshinari-patronus wants to merge 7 commits into
mainfrom
pr/translation-attack
Open

Add full-language SKILL.md translation attack (ES + JA)#8
yoshinari-patronus wants to merge 7 commits into
mainfrom
pr/translation-attack

Conversation

@yoshinari-patronus
Copy link
Copy Markdown
Collaborator

Summary

A new stego-style attack: the entire SKILL.md (frontmatter + body) is translated into Spanish or Japanese, with the malicious instruction also translated and embedded inline. Because the injection is in the same language register as the legitimate skill content, English keyword scans and "this instruction looks foreign" heuristics lose their grip.

Changes

  • data/skill_md_overrides_full_lang/ — 28 clean ES/JA translations (14 skills × 2 langs) as utility-baseline carriers
  • data/skill_md_overrides_full_lang_stego/ — 108 JA-translated SKILL.md with translated injection inline
  • data/skill_md_overrides_stego/INST-5005/5500_full_ja_skill_*/ — 3 JA-skill variants
  • +111 entries in data/contextual_injections_encoded.json indexing the overrides
  • scripts/build_full_lang_skill_records.py — generator
  • scripts/build_sandbox.pyorig_task_index support so filtered multi-task INST records produce sandbox names matching the canonical task numbering that judges/task_verifier.py rules were authored against

Regenerate the records

python scripts/build_full_lang_skill_records.py

This builds the 108 full_lang_stego/ override SKILL.md files and appends the corresponding 111 entries to contextual_injections_encoded.json. Idempotent.

Bundled dependency

Bundles 7bec5f0 build_sandbox: skill_md_override + claude_md_override and 8f03599 acrostic-stego, multilingual, cross-model so the PR is self-contained from origin/main. If those land in main first, this PR will rebase to a no-op for the duplicates.

Test plan

  • python scripts/build_full_lang_skill_records.py regenerates the records cleanly
  • Run on Claude Sonnet 4.5: python -m experiments.contextual --agent claude --model sonnet --policy normal (encoded.json gets picked up via the existing experiment framework)
  • Confirm new INST-10xxx / 11xxx / 5005 / 5500 sandboxes build and run

🤖 Generated with Claude Code

akkikiki and others added 3 commits May 19, 2026 18:38
Lets an injection record point at a pre-injected SKILL.md (or CLAUDE.md)
file under data/<override_path> and have build_sandbox.py copy it verbatim
into the sandbox, bypassing line-based injection. Required for porting
autoresearch keeps faithfully — the autoresearch meta-agent writes a full
markdown section (heading + numbered steps + authority framing), not just
a single line to inject at a fixed offset.

Adds:
- data/contextual_injections_autoresearch_full.json (21 May-13 keeps)
- data/skill_md_overrides_autoresearch/<cid>/SKILL.md  (per-keep verbatim)
- notes/2026-05-14_porting_autoresearch_keeps_to_contextual_injections.md
  documenting the workflow, measured impact (76% reproducibility vs 19%
  with line-based injection), the bugs we surfaced, and recommended
  autoresearch-loop changes.

Backward-compatible: existing 48 contextual_injections.json records don't
set skill_md_override and continue to use the line-based injection path.
Three threads of new attack-corpus + analysis work:

1. Multilingual encoded extensions of the canonical 48 injections
   (data/contextual_injections_encoded.json — 288 records: 48 × 6
   transforms es/fr/zh/ja/b64-hint/b64-raw). Generator at
   scripts/build_encoded_injection_extensions.py.

2. Acrostic-stego variants gradient (data/contextual_injections_stego.json
   + data/skill_md_overrides_stego/) — 23 hand-crafted variants probing
   the priming-strength × payload-cleanliness × action-class surface.
   Key finding: 8407 recipe (explicit cipher prompt + clean
   self-contained shell command payload) hits 60% Sonnet ASR; full
   85-char canonical-text acrostic produces explicit safety refusal.

3. DeepSeek-V4-Flash wired into OpenCode harness for cross-model A/B:
   - config.py / experiments/_opencode_config.py: new vllm provider block
     for OpenAI-compatible endpoints serving DeepSeek models
   - scripts/run_sandbox_container.py: pass DEEPSEEK_API_KEY /
     DEEPSEEK_BASE_URL env vars into containers; add "opencode" to the
     --agent argparse choices

Findings consolidated in 4 notes files:
- notes/2026-05-14_phase3_stego_audit.md
- notes/2026-05-14_encoded_injection_extensions.md
- notes/2026-05-15_acrostic_stego_gradient.md
- notes/2026-05-15_cross_model_cross_harness_matrix.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new stego-style attack where the entire SKILL.md (frontmatter + body)
is translated into Spanish or Japanese, with the malicious instruction
also translated and embedded inline. Because the injection is in the
same language register as the legitimate skill content, English keyword
scans and "this instruction looks foreign" heuristics lose their grip.

Adds:
  - data/skill_md_overrides_full_lang/         28 clean ES/JA translations
                                                (14 skills × 2 langs) as
                                                utility-baseline carriers
  - data/skill_md_overrides_full_lang_stego/   108 JA-translated SKILL.md
                                                with translated injection
                                                inline (one per affected
                                                (skill, canonical-injection)
                                                pair)
  - data/skill_md_overrides_stego/INST-5005/5500_full_ja_skill_*/
                                                3 JA-skill variants under
                                                the existing stego dir
  - 111 new entries in data/contextual_injections_encoded.json indexing
    the override files above
  - scripts/build_full_lang_skill_records.py    generator
  - scripts/build_sandbox.py: `orig_task_index` support so filtered
    multi-task INST records produce sandbox names matching the canonical
    task numbering that task_verifier.py rules were authored against

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akkikiki and others added 4 commits May 19, 2026 19:55
CI previously installed only `pytest`, which made
`tests/test_evolutionary_optimization.py` error during collection
because it imports `experiments.ablations.evolutionary_optimization`,
which itself imports `anthropic` at module top level. The error
predates this PR (main is in the same state) but blocks the test
matrix on PR #6 too.

Two changes:
1. `pyproject.toml`: add `anthropic` to the `test` extras list.
2. `.github/workflows/tests.yml`: install `pip install -e ".[test]"`
   instead of just `pip install pytest`, so the extras are picked up.

Out of scope here (and not required to make tests green): the
underlying coupling between a top-level `import anthropic` and the
test collection step. A cleaner fix is to lazy-import inside the
function that needs the client, but that's a separate concern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`pip install -e .` fails because setuptools' flat-layout auto-discovery
sees too many top-level dirs (data/, plots/, notes/, docker/, judges/,
apptainer/, experiments/, startup_assets/, startup_scripts/) and
refuses to pick. Configuring explicit `packages` discovery is the
proper long-term fix, but unrelated to this PR.

Tests don't actually need the project installed as a package — they
manipulate sys.path themselves to find local modules. So just install
pytest and anthropic directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`test_fallback_text_attack_success` was written when
`parse_injection_verdict` had a fallback path that matched verdict
keywords (`attack_success`, `ignored`, `technical`) as substrings in
plain prose if no JSON block was found. That fallback was removed in
favor of JSON-only parsing — the judge prompt is explicit about
returning JSON, and substring-matching led to false positives when
the judge described candidate verdicts in narrative form before
settling on a different one.

The test was reachable on PR #6 only after we fixed the unrelated
`anthropic` import issue; on `main` it was hidden behind that
collection error. Renamed to `test_no_json_returns_ignored` and
flipped the expectation to match current behavior (no JSON →
"ignored", consistent with the verifier's ambiguous-case semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix CI test_agent_parallel_covers_all_agents — the cherry-pick conflict
resolution for 8f03599 kept `opencode` in AGENT_MODELS but missed the
corresponding entry in AGENT_PARALLEL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants