-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching from lxml's clean_html for enhanced security (and possibly performance) #30
Comments
Hi, thanks a lot for a very nice summary. Just as an extra feedback, for this library and a few other uses we have internally, blacklist approach works fine as we're not assuming that HTML provided by lxml.html cleaner is safe, rather we use it to clean HTML for machine learning input, conversion to text, etc. Also regarding potential replacement libraries, as I understand they don't operate on lxml.html tree, but rather have their own HTML parsers, which is a pretty big no performance-wise -- we'd want to continue using an lxml.html tree which we already obtained instead of re-parsing HTML. Internally we're using an HTML5-compatible parsers but still converting the tree to lxml.html format since it's so popular. |
Thanks for the quick and detailed reply. We plan to move the functionality for cleaning HTML to a separate project so it'll still be possible to use it. It'll just require a small change in the project dependencies. |
Just an update on this. The latest version of If you want to continue using it, you can either depend on |
I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.
The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.
Two viable alternatives worth considering are
bleach
andnh3
. Here's why:bleach:
nh3:
We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.
Let me know if we can help you with this transition anyhow and have a nice day.
The text was updated successfully, but these errors were encountered: