Hi, while upgrading Rails we have found an unexpected difference in behavior between using Loofah directly, HTML4::FullSanitizer and HTML5::FullSanitizer.
The problem is that rails-html-sanitizer is decoded HTML entities to their unicode representation when the HTML entity does not have the final semicolon, which is causing false positives.
For example with the ×
entity, the correct behaviour is:
> Rails::HTML4::FullSanitizer.new.sanitize('×')
=> "&times"
> Rails::HTML4::FullSanitizer.new.sanitize('×')
=> "×"
> Loofah.fragment('×').to_s
=> "&times"
> Loofah.fragment('×').to_s
=> "×"
But the HTML5 sanitizer does this:
> Rails::HTML5::FullSanitizer.new.sanitize('×')
=> "×"
> Rails::HTML5::FullSanitizer.new.sanitize('×')
=> "×"
This is specifically causing an issue where we have a URL which has two URL parameters and one is called "timestamp".
> Rails::HTML5::FullSanitizer.new.sanitize('https://example.org?foo=bar×tamp=123456')
=> "https://example.org?foo=bar×tamp=123456"
Is this a bug, or expected behavior? If it's expected behavior, is there a way to configure the new HTML5 sanitizer to ensure that it only decodes full HTML entities with the final semicolon present?
Thanks! 💜