Skip to content

Unicode decoding and encoding bugs for codepoints greater than 0xFFFF  #12

Description

@DelleVelleD

MD_DecodeCodepointFromUtf16 incorrectly calculates codepoints greater than 0xFFFF because it does not offset by 0x10000.

Adding 0x10000 to the end of the codepoint calculation should fix the issue:

if (1 < max && 0xD800 <= out[0] && out[0] < 0xDC00 && 0xDC00 <= out[1] && out[1] < 0xE000)
{
    result.codepoint = ((out[0] - 0xD800) << 10) | (out[1] - 0xDC00) + 0x10000;
    result.advance = 2;
}

Reference: Step 5 for Decoding UTF-16


MD_Utf8FromCodepoint sets the first byte incorrectly when the codepoint requires four bytes because it left-bitshifts MD_bitmask4 by 3 rather than 4.
MD_bitmask4 is the value 0x0F (in binary 1111), and the first byte in UTF-8 of codepoints greater than 0xFFFF should start with the binary 11110 (which would then get bitshifted by 3 so the remaining 3 bits can hold codepoint info).

Bitshifting by 4 instead of 3 should fix the issue:

else if (codepoint <= 0x10FFFF)
{
    out[0] = (MD_bitmask4 << 4) | ((codepoint >> 18) & MD_bitmask3);
    out[1] = MD_bit8 | ((codepoint >> 12) & MD_bitmask6);
    out[2] = MD_bit8 | ((codepoint >>  6) & MD_bitmask6);
    out[3] = MD_bit8 | ( codepoint        & MD_bitmask6);
    advance = 4;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions