-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assign unique IDs to tokenization errors (#1339 part1) #2701
Conversation
- Added unexpected-eof-character - Added unexepceted-null-character
This doesn't have an error code. |
@zcorpan 0_0 no idea how we've missed this one. Will fix in a moment. |
OK I'm done fiddling here for now, I think it's looking pretty good! Just that error code missing. |
@zcorpan I've added missing error code. |
OK let's wait for @domenic if he likes to check the recent changes. Then we'll need to squash and rebase. |
I've updated the preview: https://htmlparseerrorwg.github.io/html-build/output/multipage/syntax.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some final tweaks. Two minor questions/suggestions, but ready to merge.
checking for parse errors, conformance checkers will also verify that the document obeys all the | ||
other conformance requirements described in this specification.</p> | ||
|
||
<p>Some parse errors have dedicated codes outlined in the table below that should be used by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "some" accurate still, or is it "all" now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's accurate. Tree construction errors still doesn't have error codes.
source
Outdated
resolves such character references to a U+FFFD REPLACEMENT CHARACTER.</p> | ||
|
||
<tr> | ||
<td><dfn data-x="parse-error-undefined-character-in-input-stream">undefined-character-in-input-stream</dfn> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename these to "noncharacter-..." instead of talking about "undefined character"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had concerns about this rename (you've suggested it earlier): we already have non-unicode-character-in-input-stream
error, adding noncharacter-in-input-stream
will be confusing IMHO. I don't like how it named in infra, to be honest: isolated surrogates and code points more than 0x10ffff are non-characters as well but they are not in the same category as "noncharacters" in Infra. In fact, what Infra defines as "noncharacters" is "permanently undefined characters".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never understood the name "noncharacters" and have asked for clarification on it before: see whatwg/infra#114. I think it's best to be consistent across the ecosystem though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to rename non-unicode-character-in-... to surrogate-in-..., actually. That seems like a clear improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had the same idea at first, but the problem is that we have a non-unicode-character-reference
error that occurs if parser encounters a numeric character reference that resolve to surrogate or code point that is more than 0x10ffff. Currently it's nice and consistent. As possible solution we can split it into two errors: surrogate-character-reference
and non-unicode-range-character-reference
, thus making error codes consistent again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter solution sounds great. I'm also OK with just leaving them inconsistent; I don't think they need to be inconsistent since they are about separate cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's rename it. I'll update the PR within an hour.
…e-range + surrogate-character-reference
@domenic Fixed. I've updated the preview as well. |
Awesome, thanks so much! Suggestions on a suitably-epic commit message? |
@domenic "How I spent nights this May" sounds epic enough? =) |
Jokes apart, something like the title of the PR will be fine. |
How's this?
|
@domenic lgtm |
🎉 🎉🎉 |
- Use new initial states in tests according to: html5lib/html5lib-tests#101 - Implement tokenization errors introduced in: whatwg/html#2701 html5lib/html5lib-tests#92
PR for tests: html5lib/html5lib-tests#92
Reference implementation: https://github.com/HTMLParseErrorWG/parse5
Preview: https://htmlparseerrorwg.github.io/html-build/output/multipage/syntax.html