-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define our own code point ranges or defer to Unicode #117
Comments
We could always weasel word it with "as of the time of this writing". |
https://en.wikipedia.org/wiki/Template:General_Category_(Unicode) lists some categories, but it's not clear what to use for "noncharacter" other than the term (since the Cn category potentially includes other code points according to that page). In http://ftp.unicode.org/Public/UNIDATA/PropList.txt I found mention of ASCII_hex_digit which could be useful, but none of the other groupings we need... This would probably require quite a bit of research. |
Probably of use: http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf and http://unicode.org/reports/tr44/. |
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf makes mention of ASCII digits and various other terms for the same range, but I'm not really convinced that it helps to refer to that over our own definition. |
I guess we could defer to Unicode more explicitly for control/noncharacter/surrogate per http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf (section 3.4 and references), but they're all fixed so I'm not sure it's worth it. Can't really find much on all the other ranges we have. |
So, I was mainly concerned about general categories that are large and/or could change over time. The biggest concern is "noncharacter" since the current definition is pretty impenetrable. I have no sense of the historical or technical motivation for that giant list. Whereas things like "ASCII upper alpha" are obvious why they're named what they're named. I think it could also be nice for control, but even that's in the ASCII range, so it's not going to change over time. |
Control is outside ASCII too, but it's also defined as fixed by Unicode. I don't think anything we list is something that can change over time in Unicode. This would change if we imported HTML's White_Space, but that only has a single dependency and it's unclear whether it's implemented as such. |
It would probably be better to reference Unicode terms when available, rather than specify the lists of code points twice. We could still have the list in a note, given that Unicode still isn't available in HTML.
However, for lists that might mutate over time we need to be careful and perhaps not deduplicate.
(This seems like a good idea since there should be a single source of truth, although duplicate the list in a note dilutes that a bit, it's still only non-normative.)
The text was updated successfully, but these errors were encountered: