fixes for more edge cases in raw tokenizer by mdelah · Pull Request #43 · muktihari/xmltokenizer

mdelah · 2025-03-24T11:50:02Z

This PR closes out the remaining issues from #35.

As discussed:

Left and right angle brackets within  are considered part of the comment, and
Right angle brackets are considered valid in attribute values (and no longer create a split in RawToken)

I also realized there was a similar issue with <? ... ?> tags, so made some fixes there too.

I had to rearrange RawToken a bit to get this to work efficiently. It now calls out to a new function findTokenEnd to locate the closing >, which returns -1 if it's not within the current buffer. The findTokenEnd has logic to skip past > that occur within comments or quoted values. It calls itself to skip past nested tags (such as may appear inside <!DOCTYPE [ ... ] >.

I took the opportunity to replace some of the byte-by-byte iteration with bytes.IndexByte, which gives a moderate performance boost in some of the benchmarks.

codecov-commenter · 2025-03-24T11:50:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (5d45a12) to head (e9c4ca0).

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
- Coverage   99.00%   98.54%   -0.47%     
==========================================
  Files           2        2              
  Lines         302      343      +41     
==========================================
+ Hits          299      338      +39     
- Misses          2        4       +2     
  Partials        1        1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fixes for more edge cases in raw tokenizer

e9c4ca0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixes for more edge cases in raw tokenizer#43

fixes for more edge cases in raw tokenizer#43
mdelah wants to merge 1 commit into
muktihari:masterfrom
mdelah:raw-token-edge-cases

mdelah commented Mar 24, 2025

Uh oh!

codecov-commenter commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mdelah commented Mar 24, 2025

Uh oh!

codecov-commenter commented Mar 24, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants