Skip to content

Optimize char deserialization with manual UTF-8 decoder#33

Open
tanmay4l wants to merge 3 commits intoanza-xyz:masterfrom
tanmay4l:optimize-char-decode
Open

Optimize char deserialization with manual UTF-8 decoder#33
tanmay4l wants to merge 3 commits intoanza-xyz:masterfrom
tanmay4l:optimize-char-decode

Conversation

@tanmay4l
Copy link
Contributor

addresses the TODO comment at line 247-250 which noted: "Could implement a manual decoder that avoids UTF-8
validate + chars() and instead performs the UTF-8 validity checks and produces a char directly. Some quick
micro-benchmarking revealed a roughly 2x speedup is possible."

Changes

Before:

let str = core::str::from_utf8(buf).map_err(invalid_utf8_encoding)?;
let c = str.chars().next().unwrap();

After:
- Manual UTF-8 decoding for 2-4 byte characters using bit masks
- Inline validation of continuation bytes (must be 10xxxxxx)
- Overlong encoding validation (3-byte: >= U+0800, 4-byte: >= U+10000)
- Surrogate validation (rejects U+D800..U+DFFF)
- Out of range validation (rejects > U+10FFFF)

@tanmay4l tanmay4l closed this Jan 22, 2026
@tanmay4l tanmay4l deleted the optimize-char-decode branch January 22, 2026 18:58
@tanmay4l tanmay4l restored the optimize-char-decode branch January 22, 2026 19:25
@tanmay4l tanmay4l reopened this Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant