Define our own code point ranges or defer to Unicode #117

annevk · 2017-03-30T10:32:01Z

It would probably be better to reference Unicode terms when available, rather than specify the lists of code points twice. We could still have the list in a note, given that Unicode still isn't available in HTML.

However, for lists that might mutate over time we need to be careful and perhaps not deduplicate.

(This seems like a good idea since there should be a single source of truth, although duplicate the list in a note dilutes that a bit, it's still only non-normative.)

domenic · 2017-04-03T03:05:40Z

We could always weasel word it with "as of the time of this writing".

annevk · 2017-04-03T15:23:15Z

https://en.wikipedia.org/wiki/Template:General_Category_(Unicode) lists some categories, but it's not clear what to use for "noncharacter" other than the term (since the Cn category potentially includes other code points according to that page).

In http://ftp.unicode.org/Public/UNIDATA/PropList.txt I found mention of ASCII_hex_digit which could be useful, but none of the other groupings we need... This would probably require quite a bit of research.

annevk · 2017-04-03T15:24:41Z

Probably of use: http://www.unicode.org/versions/Unicode9.0.0/ch04.pdf and http://unicode.org/reports/tr44/.

annevk · 2017-04-03T15:31:08Z

http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf makes mention of ASCII digits and various other terms for the same range, but I'm not really convinced that it helps to refer to that over our own definition.

annevk · 2017-04-03T15:44:04Z

I guess we could defer to Unicode more explicitly for control/noncharacter/surrogate per http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf (section 3.4 and references), but they're all fixed so I'm not sure it's worth it.

Can't really find much on all the other ranges we have.

domenic · 2017-04-04T00:59:25Z

So, I was mainly concerned about general categories that are large and/or could change over time. The biggest concern is "noncharacter" since the current definition is pretty impenetrable. I have no sense of the historical or technical motivation for that giant list. Whereas things like "ASCII upper alpha" are obvious why they're named what they're named.

I think it could also be nice for control, but even that's in the ASCII range, so it's not going to change over time.

annevk · 2017-04-04T05:41:36Z

Control is outside ASCII too, but it's also defined as fixed by Unicode. I don't think anything we list is something that can change over time in Unicode. This would change if we imported HTML's White_Space, but that only has a single dependency and it's unclear whether it's implemented as such.

annevk mentioned this issue Mar 30, 2017

Annevk/noncharacter #114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define our own code point ranges or defer to Unicode #117

Define our own code point ranges or defer to Unicode #117

annevk commented Mar 30, 2017

domenic commented Apr 3, 2017

annevk commented Apr 3, 2017

annevk commented Apr 3, 2017

annevk commented Apr 3, 2017

annevk commented Apr 3, 2017

domenic commented Apr 4, 2017

annevk commented Apr 4, 2017

Define our own code point ranges or defer to Unicode #117

Define our own code point ranges or defer to Unicode #117

Comments

annevk commented Mar 30, 2017

domenic commented Apr 3, 2017

annevk commented Apr 3, 2017

annevk commented Apr 3, 2017

annevk commented Apr 3, 2017

annevk commented Apr 3, 2017

domenic commented Apr 4, 2017

annevk commented Apr 4, 2017