-
Notifications
You must be signed in to change notification settings - Fork 353
Open
Description
Description
When mixing CJK (Japanese/Chinese/Korean) characters with ASCII text, ctrl+w / deleteWordBackward deletes both together instead of treating them as separate words.
Steps to Reproduce
- Type mixed text like
日本語abcorテストtest - Place cursor at the end
- Press
ctrl+w
Expected Behavior
Only abc (or test) should be deleted, stopping at the CJK-ASCII boundary.
This is the expected behavior per Unicode UAX #\29 (Text Segmentation), where CJK characters are treated as individual word units, creating implicit boundaries between CJK and Latin scripts.
Actual Behavior
The entire string 日本語abc is deleted at once.
Root Cause
Looking at utf8.zig, the findWrapBreaks function only recognizes these as word boundaries:
- ASCII: spaces, tabs, punctuation (
-,/,., etc.), brackets - Unicode: various space characters (NBSP, ideographic space, etc.)
CJK characters and script transitions are not considered word boundaries.
Suggested Fix
Consider implementing UAX #\29 compliant word boundary detection, or at minimum:
- Treat each CJK character as a word boundary (per UAX #\29 default)
- Treat script transitions (e.g., Han → Latin, Hiragana → Latin) as word boundaries
References
- Unicode UAX #\29 (Text Segmentation): https://www.unicode.org/reports/tr29/#Word_Boundaries
- Related code: https://github.com/anomalyco/opentui/blob/main/packages/core/src/zig/utf8.zig
Environment
- OS: macOS
- Terminal: iTerm2
- opencode version: 1.1.36 (discovered while using opencode)
- opentui version: 0.1.75 (@opentui/core, @opentui/solid)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels