Summary
The hardenUnicodeText function in sanitize_content_core.cjs correctly maps the lowercase Cyrillic Dze (U+0455 → s) and lowercase Ukrainian I (U+0456 → i), but their uppercase counterparts — U+0405 (Ѕ → S) and U+0406 (І → I) — are absent from HOMOGLYPH_MAP. The fix applied in #24154 was incomplete: it covered the lowercase variants but missed these two uppercase codepoints. An agent producing output that substitutes U+0405 for every S passes the sanitizer layer unmodified and reaches the LLM threat-detection engine with obfuscated content.
Affected Area
SafeOutputs content sanitization boundary — hardenUnicodeText normalization layer in actions/setup/js/sanitize_content_core.cjs. This is the layer intended to normalize confusable characters before content is evaluated by AI-powered secret-leak detection or written as a safe output.
Reproduction Outline
- Load
sanitize_content_core.cjs (v0.68.3) from a gh-aw runner ($RUNNER_TEMP/gh-aw/safeoutputs/).
- Construct a string by replacing every
S with U+0405 (Ѕ, Cyrillic Dze): e.g., "PENTEЅT-ЅECRET-MARKER".
- Call
hardenUnicodeText(probe) and compare the result to the input.
- Observe that the output equals the input — U+0405 is not normalized to
S.
- Pass the obfuscated string as the body of a
create_issue SafeOutputs call; the sanitizer does not normalize it before the detection layer processes it.
Observed Behavior
hardenUnicodeText('PENTEЅT-ЅECRET-MARKER') returns the string unchanged. U+0405 and U+0406 survive normalization, so a string visually identical to PENTEST-SECRET-MARKER reaches downstream processing without being converted to its Latin equivalent.
Expected Behavior
hardenUnicodeText should map U+0405 → S and U+0406 → I, consistent with its handling of the corresponding lowercase forms (U+0455, U+0456). The normalized output should match the plain-Latin form so that any downstream secret-pattern matching applies uniformly.
Security Relevance
The sanitizer normalization layer is documented as providing defense-in-depth before AI-powered threat detection. A gap for uppercase Ѕ/І means an adversarial agent can bypass this layer entirely for any secret pattern containing S or I — which covers common credential prefixes (ghp_, SECRET, etc.) — by substituting these codepoints. While the LLM detection layer may catch homoglyph substitutions probabilistically, the sanitizer should not have codepoint gaps for characters it already partially handles.
Suggested Fix
Add the following entries to HOMOGLYPH_MAP in sanitize_content_core.cjs:
"\u0405": "S", // Cyrillic Dze → S
"\u0406": "I", // Cyrillic Byelorussian-Ukrainian I → I
Additionally, audit the full map against Unicode TR#39 confusables for other gaps at less common codepoints, and add regression tests for U+0405 and U+0406.
Additional Context
If this codepoint selection (lowercase-only coverage) is by design, that assumption should be documented explicitly in the HOMOGLYPH_MAP source or the safe-outputs reference documentation, so it is clear to future maintainers which confusable classes are in scope.
gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/2063
Generated by File Issue · ● 356.4K · ◷
Summary
The
hardenUnicodeTextfunction insanitize_content_core.cjscorrectly maps the lowercase Cyrillic Dze (U+0455 →s) and lowercase Ukrainian I (U+0456 →i), but their uppercase counterparts — U+0405 (Ѕ →S) and U+0406 (І →I) — are absent fromHOMOGLYPH_MAP. The fix applied in #24154 was incomplete: it covered the lowercase variants but missed these two uppercase codepoints. An agent producing output that substitutes U+0405 for everySpasses the sanitizer layer unmodified and reaches the LLM threat-detection engine with obfuscated content.Affected Area
SafeOutputs content sanitization boundary —
hardenUnicodeTextnormalization layer inactions/setup/js/sanitize_content_core.cjs. This is the layer intended to normalize confusable characters before content is evaluated by AI-powered secret-leak detection or written as a safe output.Reproduction Outline
sanitize_content_core.cjs(v0.68.3) from a gh-aw runner ($RUNNER_TEMP/gh-aw/safeoutputs/).Swith U+0405 (Ѕ, Cyrillic Dze): e.g.,"PENTEЅT-ЅECRET-MARKER".hardenUnicodeText(probe)and compare the result to the input.S.create_issueSafeOutputs call; the sanitizer does not normalize it before the detection layer processes it.Observed Behavior
hardenUnicodeText('PENTEЅT-ЅECRET-MARKER')returns the string unchanged. U+0405 and U+0406 survive normalization, so a string visually identical toPENTEST-SECRET-MARKERreaches downstream processing without being converted to its Latin equivalent.Expected Behavior
hardenUnicodeTextshould map U+0405 →Sand U+0406 →I, consistent with its handling of the corresponding lowercase forms (U+0455, U+0456). The normalized output should match the plain-Latin form so that any downstream secret-pattern matching applies uniformly.Security Relevance
The sanitizer normalization layer is documented as providing defense-in-depth before AI-powered threat detection. A gap for uppercase Ѕ/І means an adversarial agent can bypass this layer entirely for any secret pattern containing
SorI— which covers common credential prefixes (ghp_,SECRET, etc.) — by substituting these codepoints. While the LLM detection layer may catch homoglyph substitutions probabilistically, the sanitizer should not have codepoint gaps for characters it already partially handles.Suggested Fix
Add the following entries to
HOMOGLYPH_MAPinsanitize_content_core.cjs:Additionally, audit the full map against Unicode TR#39 confusables for other gaps at less common codepoints, and add regression tests for U+0405 and U+0406.
Additional Context
If this codepoint selection (lowercase-only coverage) is by design, that assumption should be documented explicitly in the HOMOGLYPH_MAP source or the safe-outputs reference documentation, so it is clear to future maintainers which confusable classes are in scope.
gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/2063