Skip to content

HOMOGLYPH_MAP missing uppercase U+0405 (Ѕ → S) and U+0406 (І → I) — partial fix from #24154 leaves uppercase Cyrillic Dze and Uk [Content truncated due to length] #28924

@szabta89

Description

@szabta89

Summary

The hardenUnicodeText function in sanitize_content_core.cjs correctly maps the lowercase Cyrillic Dze (U+0455 → s) and lowercase Ukrainian I (U+0456 → i), but their uppercase counterparts — U+0405 (Ѕ → S) and U+0406 (І → I) — are absent from HOMOGLYPH_MAP. The fix applied in #24154 was incomplete: it covered the lowercase variants but missed these two uppercase codepoints. An agent producing output that substitutes U+0405 for every S passes the sanitizer layer unmodified and reaches the LLM threat-detection engine with obfuscated content.

Affected Area

SafeOutputs content sanitization boundary — hardenUnicodeText normalization layer in actions/setup/js/sanitize_content_core.cjs. This is the layer intended to normalize confusable characters before content is evaluated by AI-powered secret-leak detection or written as a safe output.

Reproduction Outline

  1. Load sanitize_content_core.cjs (v0.68.3) from a gh-aw runner ($RUNNER_TEMP/gh-aw/safeoutputs/).
  2. Construct a string by replacing every S with U+0405 (Ѕ, Cyrillic Dze): e.g., "PENTEЅT-ЅECRET-MARKER".
  3. Call hardenUnicodeText(probe) and compare the result to the input.
  4. Observe that the output equals the input — U+0405 is not normalized to S.
  5. Pass the obfuscated string as the body of a create_issue SafeOutputs call; the sanitizer does not normalize it before the detection layer processes it.

Observed Behavior

hardenUnicodeText('PENTEЅT-ЅECRET-MARKER') returns the string unchanged. U+0405 and U+0406 survive normalization, so a string visually identical to PENTEST-SECRET-MARKER reaches downstream processing without being converted to its Latin equivalent.

Expected Behavior

hardenUnicodeText should map U+0405 → S and U+0406 → I, consistent with its handling of the corresponding lowercase forms (U+0455, U+0456). The normalized output should match the plain-Latin form so that any downstream secret-pattern matching applies uniformly.

Security Relevance

The sanitizer normalization layer is documented as providing defense-in-depth before AI-powered threat detection. A gap for uppercase Ѕ/І means an adversarial agent can bypass this layer entirely for any secret pattern containing S or I — which covers common credential prefixes (ghp_, SECRET, etc.) — by substituting these codepoints. While the LLM detection layer may catch homoglyph substitutions probabilistically, the sanitizer should not have codepoint gaps for characters it already partially handles.

Suggested Fix

Add the following entries to HOMOGLYPH_MAP in sanitize_content_core.cjs:

"\u0405": "S",  // Cyrillic Dze → S
"\u0406": "I",  // Cyrillic Byelorussian-Ukrainian I → I

Additionally, audit the full map against Unicode TR#39 confusables for other gaps at less common codepoints, and add regression tests for U+0405 and U+0406.

Additional Context

If this codepoint selection (lowercase-only coverage) is by design, that assumption should be documented explicitly in the HOMOGLYPH_MAP source or the safe-outputs reference documentation, so it is clear to future maintainers which confusable classes are in scope.

gh-aw version: v0.68.3

Original finding: https://github.com/githubnext/gh-aw-security/issues/2063

Generated by File Issue · ● 356.4K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions