Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define our own code point ranges or defer to Unicode #117

Open
annevk opened this issue Mar 30, 2017 · 7 comments
Open

Define our own code point ranges or defer to Unicode #117

annevk opened this issue Mar 30, 2017 · 7 comments

Comments

@annevk
Copy link
Member

annevk commented Mar 30, 2017

It would probably be better to reference Unicode terms when available, rather than specify the lists of code points twice. We could still have the list in a note, given that Unicode still isn't available in HTML.

However, for lists that might mutate over time we need to be careful and perhaps not deduplicate.

(This seems like a good idea since there should be a single source of truth, although duplicate the list in a note dilutes that a bit, it's still only non-normative.)

@domenic
Copy link
Member

domenic commented Apr 3, 2017

We could always weasel word it with "as of the time of this writing".

@annevk
Copy link
Member Author

annevk commented Apr 3, 2017

https://en.wikipedia.org/wiki/Template:General_Category_(Unicode) lists some categories, but it's not clear what to use for "noncharacter" other than the term (since the Cn category potentially includes other code points according to that page).

In http://ftp.unicode.org/Public/UNIDATA/PropList.txt I found mention of ASCII_hex_digit which could be useful, but none of the other groupings we need... This would probably require quite a bit of research.

@annevk
Copy link
Member Author

annevk commented Apr 3, 2017

@annevk
Copy link
Member Author

annevk commented Apr 3, 2017

http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf makes mention of ASCII digits and various other terms for the same range, but I'm not really convinced that it helps to refer to that over our own definition.

@annevk
Copy link
Member Author

annevk commented Apr 3, 2017

I guess we could defer to Unicode more explicitly for control/noncharacter/surrogate per http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf (section 3.4 and references), but they're all fixed so I'm not sure it's worth it.

Can't really find much on all the other ranges we have.

@domenic
Copy link
Member

domenic commented Apr 4, 2017

So, I was mainly concerned about general categories that are large and/or could change over time. The biggest concern is "noncharacter" since the current definition is pretty impenetrable. I have no sense of the historical or technical motivation for that giant list. Whereas things like "ASCII upper alpha" are obvious why they're named what they're named.

I think it could also be nice for control, but even that's in the ASCII range, so it's not going to change over time.

@annevk
Copy link
Member Author

annevk commented Apr 4, 2017

Control is outside ASCII too, but it's also defined as fixed by Unicode. I don't think anything we list is something that can change over time in Unicode. This would change if we imported HTML's White_Space, but that only has a single dependency and it's unclear whether it's implemented as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants