HOMOGLYPH_MAP missing uppercase U+0405 (Ѕ → S) and U+0406 (І → I) — partial fix from #24154 leaves uppercase Cyrillic Dze and Uk
[Content truncated due to length]

## Summary

The `hardenUnicodeText` function in `sanitize_content_core.cjs` correctly maps the **lowercase** Cyrillic Dze (U+0455 → `s`) and lowercase Ukrainian I (U+0456 → `i`), but their **uppercase** counterparts — U+0405 (Ѕ → `S`) and U+0406 (І → `I`) — are absent from `HOMOGLYPH_MAP`. The fix applied in #24154 was incomplete: it covered the lowercase variants but missed these two uppercase codepoints. An agent producing output that substitutes U+0405 for every `S` passes the sanitizer layer unmodified and reaches the LLM threat-detection engine with obfuscated content.

## Affected Area

SafeOutputs content sanitization boundary — `hardenUnicodeText` normalization layer in `actions/setup/js/sanitize_content_core.cjs`. This is the layer intended to normalize confusable characters before content is evaluated by AI-powered secret-leak detection or written as a safe output.

## Reproduction Outline

1. Load `sanitize_content_core.cjs` (v0.68.3) from a gh-aw runner (`$RUNNER_TEMP/gh-aw/safeoutputs/`).
2. Construct a string by replacing every `S` with U+0405 (Ѕ, Cyrillic Dze): e.g., `"PENTEЅT-ЅECRET-MARKER"`.
3. Call `hardenUnicodeText(probe)` and compare the result to the input.
4. Observe that the output equals the input — U+0405 is not normalized to `S`.
5. Pass the obfuscated string as the body of a `create_issue` SafeOutputs call; the sanitizer does not normalize it before the detection layer processes it.

## Observed Behavior

`hardenUnicodeText('PENTEЅT-ЅECRET-MARKER')` returns the string unchanged. U+0405 and U+0406 survive normalization, so a string visually identical to `PENTEST-SECRET-MARKER` reaches downstream processing without being converted to its Latin equivalent.

## Expected Behavior

`hardenUnicodeText` should map U+0405 → `S` and U+0406 → `I`, consistent with its handling of the corresponding lowercase forms (U+0455, U+0456). The normalized output should match the plain-Latin form so that any downstream secret-pattern matching applies uniformly.

## Security Relevance

The sanitizer normalization layer is documented as providing defense-in-depth before AI-powered threat detection. A gap for uppercase Ѕ/І means an adversarial agent can bypass this layer entirely for any secret pattern containing `S` or `I` — which covers common credential prefixes (`ghp_`, `SECRET`, etc.) — by substituting these codepoints. While the LLM detection layer may catch homoglyph substitutions probabilistically, the sanitizer should not have codepoint gaps for characters it already partially handles.

## Suggested Fix

Add the following entries to `HOMOGLYPH_MAP` in `sanitize_content_core.cjs`:

```js
"\u0405": "S",  // Cyrillic Dze → S
"\u0406": "I",  // Cyrillic Byelorussian-Ukrainian I → I
```

Additionally, audit the full map against Unicode TR#39 confusables for other gaps at less common codepoints, and add regression tests for U+0405 and U+0406.

## Additional Context

If this codepoint selection (lowercase-only coverage) is by design, that assumption should be documented explicitly in the HOMOGLYPH_MAP source or the safe-outputs reference documentation, so it is clear to future maintainers which confusable classes are in scope.

**gh-aw version**: v0.68.3

Original finding: https://github.com/githubnext/gh-aw-security/issues/2063




> Generated by [File Issue](https://github.com/githubnext/gh-aw-security/actions/runs/25050043007/agentic_workflow) · ● 356.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+githubnext%2Fgh-aw-security%2Ffile-issue%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOMOGLYPH_MAP missing uppercase U+0405 (Ѕ → S) and U+0406 (І → I) — partial fix from #24154 leaves uppercase Cyrillic Dze and Uk [Content truncated due to length] #28924

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HOMOGLYPH_MAP missing uppercase U+0405 (Ѕ → S) and U+0406 (І → I) — partial fix from #24154 leaves uppercase Cyrillic Dze and Uk [Content truncated due to length] #28924

Description

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions