Unicode decoding and encoding bugs for codepoints greater than 0xFFFF 

`MD_DecodeCodepointFromUtf16` incorrectly calculates codepoints greater than 0xFFFF because it does not offset by 0x10000.

Adding 0x10000 to the end of the codepoint calculation should fix the issue:
```
if (1 < max && 0xD800 <= out[0] && out[0] < 0xDC00 && 0xDC00 <= out[1] && out[1] < 0xE000)
{
    result.codepoint = ((out[0] - 0xD800) << 10) | (out[1] - 0xDC00) + 0x10000;
    result.advance = 2;
}
```
Reference:  [Step 5 for Decoding UTF-16](https://datatracker.ietf.org/doc/html/rfc2781.html#section-2.2)

---

`MD_Utf8FromCodepoint` sets the first byte incorrectly when the codepoint requires four bytes because it left-bitshifts `MD_bitmask4` by 3 rather than 4.
`MD_bitmask4` is the value 0x0F (in binary 1111), and the first byte in UTF-8 of codepoints greater than 0xFFFF should start with the binary 11110 (which would then get bitshifted by 3 so the remaining 3 bits can hold codepoint info).

Bitshifting by 4 instead of 3 should fix the issue:
```
else if (codepoint <= 0x10FFFF)
{
    out[0] = (MD_bitmask4 << 4) | ((codepoint >> 18) & MD_bitmask3);
    out[1] = MD_bit8 | ((codepoint >> 12) & MD_bitmask6);
    out[2] = MD_bit8 | ((codepoint >>  6) & MD_bitmask6);
    out[3] = MD_bit8 | ( codepoint        & MD_bitmask6);
    advance = 4;
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode decoding and encoding bugs for codepoints greater than 0xFFFF #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unicode decoding and encoding bugs for codepoints greater than 0xFFFF #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions