Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign unique IDs to tokenization errors (#1339 part1) #2701

Merged
merged 111 commits into from
May 31, 2017

Conversation

inikulin
Copy link
Member

@inikulin inikulin commented May 21, 2017

inikulin and others added 30 commits May 21, 2017 23:22
- Added unexpected-eof-character
- Added unexepceted-null-character
@zcorpan
Copy link
Member

zcorpan commented May 30, 2017

When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be removed from the token.

This doesn't have an error code.

@inikulin
Copy link
Member Author

@zcorpan 0_0 no idea how we've missed this one. Will fix in a moment.

@zcorpan
Copy link
Member

zcorpan commented May 30, 2017

OK I'm done fiddling here for now, I think it's looking pretty good! Just that error code missing.

@inikulin
Copy link
Member Author

@zcorpan I've added missing error code.

@zcorpan
Copy link
Member

zcorpan commented May 30, 2017

OK let's wait for @domenic if he likes to check the recent changes. Then we'll need to squash and rebase.

@inikulin
Copy link
Member Author

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some final tweaks. Two minor questions/suggestions, but ready to merge.

checking for parse errors, conformance checkers will also verify that the document obeys all the
other conformance requirements described in this specification.</p>

<p>Some parse errors have dedicated codes outlined in the table below that should be used by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "some" accurate still, or is it "all" now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's accurate. Tree construction errors still doesn't have error codes.

source Outdated
resolves such character references to a U+FFFD REPLACEMENT CHARACTER.</p>

<tr>
<td><dfn data-x="parse-error-undefined-character-in-input-stream">undefined-character-in-input-stream</dfn>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename these to "noncharacter-..." instead of talking about "undefined character"?

Copy link
Member Author

@inikulin inikulin May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had concerns about this rename (you've suggested it earlier): we already have non-unicode-character-in-input-stream error, adding noncharacter-in-input-stream will be confusing IMHO. I don't like how it named in infra, to be honest: isolated surrogates and code points more than 0x10ffff are non-characters as well but they are not in the same category as "noncharacters" in Infra. In fact, what Infra defines as "noncharacters" is "permanently undefined characters".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never understood the name "noncharacters" and have asked for clarification on it before: see whatwg/infra#114. I think it's best to be consistent across the ecosystem though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to rename non-unicode-character-in-... to surrogate-in-..., actually. That seems like a clear improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had the same idea at first, but the problem is that we have a non-unicode-character-reference error that occurs if parser encounters a numeric character reference that resolve to surrogate or code point that is more than 0x10ffff. Currently it's nice and consistent. As possible solution we can split it into two errors: surrogate-character-reference and non-unicode-range-character-reference, thus making error codes consistent again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter solution sounds great. I'm also OK with just leaving them inconsistent; I don't think they need to be inconsistent since they are about separate cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's rename it. I'll update the PR within an hour.

@inikulin
Copy link
Member Author

inikulin commented May 31, 2017

@domenic Fixed. I've updated the preview as well.

@domenic
Copy link
Member

domenic commented May 31, 2017

Awesome, thanks so much!

Suggestions on a suitably-epic commit message?

@inikulin
Copy link
Member Author

@domenic "How I spent nights this May" sounds epic enough? =)

@inikulin
Copy link
Member Author

Jokes apart, something like the title of the PR will be fine.

@domenic
Copy link
Member

domenic commented May 31, 2017

How's this?

Assign IDs to and explain all tokenization parse errors

This gives every parse error that occurs during tokenization a unique
ID, and adds non-normative text explaining and exemplifying when they
occur in an overview table.

Part of #1339; tree construction parse errors remain before that issue
is finished.

@inikulin
Copy link
Member Author

@domenic lgtm

@domenic domenic merged commit 32dbd7d into whatwg:master May 31, 2017
@inikulin
Copy link
Member Author

🎉 🎉🎉

@inikulin inikulin deleted the to-upstream branch May 31, 2017 22:37
iabudiab added a commit to iabudiab/HTMLKit that referenced this pull request Sep 10, 2017
- Use new initial states in tests according to:
html5lib/html5lib-tests#101

- Implement tokenization errors introduced in:
whatwg/html#2701
html5lib/html5lib-tests#92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants