Skip to content

[WIP] Identifier_Type updates for Unicode 17.0 from UTC183 #1113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
383 changes: 133 additions & 250 deletions unicodetools/data/security/dev/IdentifierStatus.txt

Large diffs are not rendered by default.

785 changes: 486 additions & 299 deletions unicodetools/data/security/dev/IdentifierType.txt

Large diffs are not rendered by default.

47 changes: 37 additions & 10 deletions unicodetools/data/security/dev/confusables.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# confusables.txt
# Date: 2025-03-27, 16:29:14 GMT
# Date: 2025-05-01, 03:29:03 GMT
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -78,6 +78,7 @@ A6F0 ; 0302 ; MA # ( ꛰ → ̂ ) BAMUM COMBINING MARK KOQNDON → COMBINING CIR

08EB ; 0308 ; MA # ( ࣫ → ̈ ) ARABIC TONE TWO DOTS ABOVE → COMBINING DIAERESIS #
07F3 ; 0308 ; MA # ( ߳ → ̈ ) NKO COMBINING DOUBLE DOT ABOVE → COMBINING DIAERESIS #
0B54 ; 0308 ; MA # ( ୔ → ̈ ) ORIYA SIGN DOUBLE DOT ABOVE → COMBINING DIAERESIS #

064B ; 030B ; MA # ( ً → ̋ ) ARABIC FATHATAN → COMBINING DOUBLE ACUTE ACCENT #
08F0 ; 030B ; MA # ( ࣰ → ̋ ) ARABIC OPEN FATHATAN → COMBINING DOUBLE ACUTE ACCENT # →ً→
Expand All @@ -100,6 +101,7 @@ A6F0 ; 0302 ; MA # ( ꛰ → ̂ ) BAMUM COMBINING MARK KOQNDON → COMBINING CIR
0A02 ; 0307 ; MA # ( ਂ → ̇ ) GURMUKHI SIGN BINDI → COMBINING DOT ABOVE #
0A82 ; 0307 ; MA # ( ં → ̇ ) GUJARATI SIGN ANUSVARA → COMBINING DOT ABOVE #
0BCD ; 0307 ; MA # ( ் → ̇ ) TAMIL SIGN VIRAMA → COMBINING DOT ABOVE #
0B53 ; 0307 ; MA # ( ୓ → ̇ ) ORIYA SIGN DOT ABOVE → COMBINING DOT ABOVE #

0337 ; 0338 ; MA # ( ̷ → ̸ ) COMBINING SHORT SOLIDUS OVERLAY → COMBINING LONG SOLIDUS OVERLAY #

Expand All @@ -112,6 +114,9 @@ A6F0 ; 0302 ; MA # ( ꛰ → ̂ ) BAMUM COMBINING MARK KOQNDON → COMBINING CIR
0659 ; 0304 ; MA # ( ٙ → ̄ ) ARABIC ZWARAKAY → COMBINING MACRON #
07EB ; 0304 ; MA # ( ߫ → ̄ ) NKO COMBINING SHORT HIGH TONE → COMBINING MACRON #
A6F1 ; 0304 ; MA # ( ꛱ → ̄ ) BAMUM COMBINING MARK TUKWENTIS → COMBINING MACRON #
1AE2 ; 0304 ; MA # ( ᫢ → ̄ ) COMBINING MINUS SIGN ABOVE → COMBINING MACRON #

1AE8 ; 0304 0304 ; MA # ( ᫨ → ̄̄ ) COMBINING EQUALS SIGN ABOVE → COMBINING MACRON, COMBINING MACRON #

1CDA ; 030E ; MA # ( ᳚ → ̎ ) VEDIC TONE DOUBLE SVARITA → COMBINING DOUBLE VERTICAL LINE ABOVE #

Expand All @@ -123,6 +128,10 @@ A6F1 ; 0304 ; MA # ( ꛱ → ̄ ) BAMUM COMBINING MARK TUKWENTIS → COMBINING M

0900 ; 0352 ; MA # ( ऀ → ͒ ) DEVANAGARI SIGN INVERTED CANDRABINDU → COMBINING FERMATA #

1AD9 ; 1AC6 ; MA # ( ᫙ → ᫆ ) COMBINING SHARP SIGN → COMBINING NUMBER SIGN ABOVE #

1E6EE ; 1AC8 ; MA # ( 𞛮 → ᫈ ) TAI YO SIGN AY → COMBINING PLUS SIGN ABOVE #

1CED ; 0316 ; MA # ( ᳭ → ̖ ) VEDIC SIGN TIRYAK → COMBINING GRAVE ACCENT BELOW #

1CDC ; 0329 ; MA # ( ᳜ → ̩ ) VEDIC TONE KATHAKA ANUDATTA → COMBINING VERTICAL LINE BELOW #
Expand Down Expand Up @@ -4413,7 +4422,7 @@ FB28 ; 05EA ; MA # ( ‎ﬨ‎ → ‎ת‎ ) HEBREW LETTER WIDE TAV → HEBREW

FE80 ; 0621 ; MA # ( ‎ﺀ‎ → ‎ء‎ ) ARABIC LETTER HAMZA ISOLATED FORM → ARABIC LETTER HAMZA #

06FD ; 0621 0348 ; MA #* ( ‎۽‎ → ‎ء͈‎ ) ARABIC SIGN SINDHI AMPERSAND → ARABIC LETTER HAMZA, COMBINING DOUBLE VERTICAL LINE BELOW #
06FD ; 0621 10EFA ; MA #* ( ‎۽‎ → ‎ء𐻺‎ ) ARABIC SIGN SINDHI AMPERSAND → ARABIC LETTER HAMZA, ARABIC DOUBLE VERTICAL BAR BELOW #

FE82 ; 0622 ; MA # ( ‎ﺂ‎ → ‎آ‎ ) ARABIC LETTER ALEF WITH MADDA ABOVE FINAL FORM → ARABIC LETTER ALEF WITH MADDA ABOVE #
FE81 ; 0622 ; MA # ( ‎ﺁ‎ → ‎آ‎ ) ARABIC LETTER ALEF WITH MADDA ABOVE ISOLATED FORM → ARABIC LETTER ALEF WITH MADDA ABOVE #
Expand Down Expand Up @@ -5292,8 +5301,6 @@ FEE1 ; 0645 ; MA # ( ‎ﻡ‎ → ‎م‎ ) ARABIC LETTER MEEM ISOLATED FORM

08A7 ; 0645 06DB ; MA # ( ‎ࢧ‎ → ‎مۛ‎ ) ARABIC LETTER MEEM WITH THREE DOTS ABOVE → ARABIC LETTER MEEM, ARABIC SMALL HIGH THREE DOTS #

06FE ; 0645 0348 ; MA #* ( ‎۾‎ → ‎م͈‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN → ARABIC LETTER MEEM, COMBINING DOUBLE VERTICAL LINE BELOW #

FC88 ; 0645 006C ; MA # ( ‎ﲈ‎ → ‎مl‎ ) ARABIC LIGATURE MEEM WITH ALEF FINAL FORM → ARABIC LETTER MEEM, LATIN SMALL LETTER L # →‎ما‎→

FCCE ; 0645 062C ; MA # ( ‎ﳎ‎ → ‎مج‎ ) ARABIC LIGATURE MEEM WITH JEEM INITIAL FORM → ARABIC LETTER MEEM, ARABIC LETTER JEEM #
Expand Down Expand Up @@ -5336,6 +5343,8 @@ FDB1 ; 0645 0645 0649 ; MA # ( ‎ﶱ‎ → ‎ممى‎ ) ARABIC LIGATURE MEEM
FC49 ; 0645 0649 ; MA # ( ‎ﱉ‎ → ‎مى‎ ) ARABIC LIGATURE MEEM WITH ALEF MAKSURA ISOLATED FORM → ARABIC LETTER MEEM, ARABIC LETTER ALEF MAKSURA #
FC4A ; 0645 0649 ; MA # ( ‎ﱊ‎ → ‎مى‎ ) ARABIC LIGATURE MEEM WITH YEH ISOLATED FORM → ARABIC LETTER MEEM, ARABIC LETTER ALEF MAKSURA # →‎مي‎→

06FE ; 0645 10EFA ; MA #* ( ‎۾‎ → ‎م𐻺‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN → ARABIC LETTER MEEM, ARABIC DOUBLE VERTICAL BAR BELOW #

1EE0D ; 0646 ; MA # ( ‎𞸍‎ → ‎ن‎ ) ARABIC MATHEMATICAL NOON → ARABIC LETTER NOON #
1EE2D ; 0646 ; MA # ( ‎𞸭‎ → ‎ن‎ ) ARABIC MATHEMATICAL INITIAL NOON → ARABIC LETTER NOON #
1EE4D ; 0646 ; MA # ( ‎𞹍‎ → ‎ن‎ ) ARABIC MATHEMATICAL TAILED NOON → ARABIC LETTER NOON #
Expand Down Expand Up @@ -5659,19 +5668,19 @@ FE19 ; 2D57 ; MA #* ( ︙ → ⵗ ) PRESENTATION FORM FOR VERTICAL HORIZONTAL EL

0912 ; 0905 093E 0946 ; MA # ( ऒ → अाॆ ) DEVANAGARI LETTER SHORT O → DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AA, DEVANAGARI VOWEL SIGN SHORT E # →अॊ→→आॆ→

0913 ; 0905 093E 0947 ; MA # ( ओ → अाे ) DEVANAGARI LETTER O → DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AA, DEVANAGARI VOWEL SIGN E # →अो→→आे→

0914 ; 0905 093E 0948 ; MA # ( औ → अाै ) DEVANAGARI LETTER AU → DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AA, DEVANAGARI VOWEL SIGN AI # →अौ→→आै→

0913 ; 0905 093E 11B64 ; MA # ( ओ → अा𑭤 ) DEVANAGARI LETTER O → DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AA, SHARADA VOWEL SIGN SHORT E # →अो→→आे→

0904 ; 0905 0946 ; MA # ( ऄ → अॆ ) DEVANAGARI LETTER SHORT A → DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN SHORT E #

0911 ; 0905 0949 ; MA # ( ऑ → अॉ ) DEVANAGARI LETTER CANDRA O → DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN CANDRA O #

090D ; 090F 0945 ; MA # ( ऍ → एॅ ) DEVANAGARI LETTER CANDRA E → DEVANAGARI LETTER E, DEVANAGARI VOWEL SIGN CANDRA E #

090E ; 090F 0946 ; MA # ( ऎ → एॆ ) DEVANAGARI LETTER SHORT E → DEVANAGARI LETTER E, DEVANAGARI VOWEL SIGN SHORT E #

0910 ; 090F 0947 ; MA # ( ऐ → एे ) DEVANAGARI LETTER AI → DEVANAGARI LETTER E, DEVANAGARI VOWEL SIGN E #
0910 ; 090F 11B64 ; MA # ( ऐ → ए𑭤 ) DEVANAGARI LETTER AI → DEVANAGARI LETTER E, SHARADA VOWEL SIGN SHORT E # →एे→

090D ; 090F 11B66 ; MA # ( ऍ → ए𑭦 ) DEVANAGARI LETTER CANDRA E → DEVANAGARI LETTER E, SHARADA VOWEL SIGN CANDRA E # →एॅ→

0908 ; 0930 094D 0907 ; MA # ( ई → र्इ ) DEVANAGARI LETTER II → DEVANAGARI LETTER RA, DEVANAGARI SIGN VIRAMA, DEVANAGARI LETTER I #

Expand All @@ -5680,6 +5689,7 @@ FE19 ; 2D57 ; MA #* ( ︙ → ⵗ ) PRESENTATION FORM FOR VERTICAL HORIZONTAL EL
111DC ; A8FB ; MA # ( 𑇜 → ꣻ ) SHARADA HEADSTROKE → DEVANAGARI HEADSTROKE #

111CB ; 093A ; MA # ( 𑇋 → ऺ ) SHARADA VOWEL MODIFIER MARK → DEVANAGARI VOWEL SIGN OE #
11B60 ; 093A ; MA # ( 𑭠 → ऺ ) SHARADA VOWEL SIGN OE → DEVANAGARI VOWEL SIGN OE #

0AC1 ; 0941 ; MA # ( ુ → ु ) GUJARATI VOWEL SIGN U → DEVANAGARI VOWEL SIGN U #

Expand Down Expand Up @@ -5748,6 +5758,7 @@ FE19 ; 2D57 ; MA #* ( ︙ → ⵗ ) PRESENTATION FORM FOR VERTICAL HORIZONTAL EL
114BE ; 09CC ; MA # ( 𑒾 → ৌ ) TIRHUTA VOWEL SIGN AU → BENGALI VOWEL SIGN AU #

114C2 ; 09CD ; MA # ( 𑓂 → ্ ) TIRHUTA SIGN VIRAMA → BENGALI SIGN VIRAMA #
16D9D ; 09CD ; MA # ( 𖶝 → ্ ) CHISOI SIGN SISO → BENGALI SIGN VIRAMA #

114BD ; 09D7 ; MA # ( 𑒽 → ৗ ) TIRHUTA VOWEL SIGN SHORT O → BENGALI AU LENGTH MARK #

Expand Down Expand Up @@ -9723,11 +9734,27 @@ FACE ; 9F9C ; MA # ( 龜 → 龜 ) CJK COMPATIBILITY IDEOGRAPH-FACE → CJK UNIF

0CDC ; 0C5C ; MA # ( ೜ → ౜ ) KANNADA ARCHAIC SHRII → TELUGU ARCHAIC SHRII #

1DE8 ; 1ADA ; MA # ( ᷨ → ᫚ ) COMBINING LATIN SMALL LETTER B → COMBINING FLAT SIGN #

2DEE ; 1ADB ; MA # ( ⷮ → ᫛ ) COMBINING CYRILLIC LETTER TE → COMBINING DOWN TACK ABOVE #

1AE7 ; 1AE5 ; MA # ( ᫧ → ᫥ ) COMBINING DOUBLE ARCH ABOVE → COMBINING SEAGULL ABOVE #

031A ; 1AE9 ; MA # ( ̚ → ᫩ ) COMBINING LEFT ANGLE ABOVE → COMBINING LEFT ANGLE CENTRED ABOVE #

0295 ; A7CE ; MA # ( ʕ → ꟎ ) LATIN LETTER PHARYNGEAL VOICED FRICATIVE → LATIN CAPITAL LETTER PHARYNGEAL VOICED FRICATIVE #
A7CF ; A7CE ; MA # ( ꟏ → ꟎ ) LATIN SMALL LETTER PHARYNGEAL VOICED FRICATIVE → LATIN CAPITAL LETTER PHARYNGEAL VOICED FRICATIVE # →ʕ→

0348 ; 10EFA ; MA # ( ͈ → 𐻺 ) COMBINING DOUBLE VERTICAL LINE BELOW → ARABIC DOUBLE VERTICAL BAR BELOW #

0956 ; 11B62 ; MA # ( ॖ → 𑭢 ) DEVANAGARI VOWEL SIGN UE → SHARADA VOWEL SIGN UE #

0957 ; 11B63 ; MA # ( ॗ → 𑭣 ) DEVANAGARI VOWEL SIGN UUE → SHARADA VOWEL SIGN UUE #

0947 ; 11B64 ; MA # ( े → 𑭤 ) DEVANAGARI VOWEL SIGN E → SHARADA VOWEL SIGN SHORT E #

0945 ; 11B66 ; MA # ( ॅ → 𑭦 ) DEVANAGARI VOWEL SIGN CANDRA E → SHARADA VOWEL SIGN CANDRA E #

0427 ; 16D8C ; MA # ( Ч → 𖶌 ) CYRILLIC CAPITAL LETTER CHE → CHISOI LETTER MA #

09E9 ; 16DA3 ; MA # ( ৩ → 𖶣 ) BENGALI DIGIT THREE → CHISOI DIGIT THREE #
Expand Down Expand Up @@ -9765,5 +9792,5 @@ A7CF ; A7CE ; MA # ( ꟏ → ꟎ ) LATIN SMALL LETTER PHARYNGEAL VOICED FRICATIV

1F514 ; 1FBFA ; MA #* ( 🔔 → 🯺 ) BELL → ALARM BELL SYMBOL #

# total: 6412
# total: 6428

Loading
Loading