-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape HTML Entities #763
base: master
Are you sure you want to change the base?
Escape HTML Entities #763
Conversation
3, 3,11, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // 0xE2 is the start of \u2028 and \u2029 | ||
//First byte of a 4+ byte code point | ||
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 9, 9, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table is copy/pasted from script_safe
with the exception of characters &
, <
, >
.
Co-Authored-By: Ian Ker-Seymer <[email protected]>
So this is a feature I considered in the past, to help speedup I suppose that's what If that's the goal, then this shouldn't escape forward slashes ( Now the reason I haven't added this feature is that I wanted to avoid an explosion of settings, and also because when I paired with @etiennebarrie, we managed to optimize the So I don't know how useful that extra escaping mode really is. |
Also if we go with this, the JRuby and TruffleRuby implementations will need the feature for parity. |
Correct! For our dataset I was happy to also escape
Do you know top of mind which versions that would have landed in? I only tried with |
It's in |
Looking at the performance improvements of using If that's the case, and if you're ready to accept adding something like Using the current way was easy because its a superset of both
Totally, I can try to add these if these changes are accepted |
I don't know, would be worth benchmarking.
Yeah, that's probably the biggest blocker for me. One solution could be: I'm almost tempted to do a more generic API where you essentially provide the escape table, but that might be too much. I think I need to sleep on this, I don't want to add a feature I'd regret. |
Funny you should mention that, while we were building the thing this is exactly what we wanted, but also... how do we even do that 😅
Totally yeah, I also wouldn't want that |
It's not that hard. You can pass an array of characters, or even just a string, and build the table from there. Of course it would need to be allocated, and the building process would be significant overhead, so it would only make sense to do it via the new It's certainly a bit of work, but totally doable. |
I benchmarked with this script (subset of our test corpus) with ruby-head. Using
Removing the regexp, to mimic using only an escape table makes JSON (this patch) much faster
|
Your benchmark is missing the most important important optimization we found with @etiennebarrie which is to gsub binary mode: https://gist.github.com/byroot/e5dc39f0ce15ae1869bc7511ebc522e0 $ ruby --yjit /tmp/je.rb
/opt/rubies/head/lib/ruby/3.5.0+0/bundler/runtime.rb:58: warning: benchmark/ips is found in benchmark, which is not part of the default gems since Ruby 3.5.0.
You can add benchmark to your Gemfile or gemspec to fix this error.
ruby 3.5.0dev (2025-03-12T09:20:40Z master 2782cc75a9) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
Oj 5.615k i/100ms
JSON 8.115k i/100ms
Calculating -------------------------------------
Oj 53.352k (± 1.2%) i/s (18.74 μs/i) - 269.520k in 5.052415s
JSON 80.977k (± 1.3%) i/s (12.35 μs/i) - 405.750k in 5.011526s
Comparison:
Oj: 53352.2 i/s
JSON: 80977.0 i/s - 1.52x faster |
oooooh I completely missed that from the patch! Incredible. I guess apples to apples, the Oj benchmark would also need to use the binary mode, but this is something I can work with. I looked at how this transfers to our project and it does improve our baseline, so we don't actually need this patch to replace Oj, which gives us (you!) plenty of time to decide whether
Thanks a lot for the help |
It feels like it's a bug in CRuby it's much faster at |
To be too, but the regexp engine kinda flies over my head, so I haven't attempted to figure out a solution. I've also seen that trick used in other gems such as CGI: https://github.com/ruby/cgi/blob/ab84b7fe6624faeba21fb52acac33ea678366e11/lib/cgi/util.rb#L43-L47, so I assume it's a known performance characteristic. |
One concern with it is it potentially discards the computed coderange, if it was known (unless it's CR_7BIT, that may be preserved across force_encoding's). |
That specifically isn't a concern, because the coderange of But yes, I generally agree it would be a good idea to investigate it. However I suspect the answer is that using a binary string allow to use a singlebyte encoding fastpath. |
I'm trying to replace
Oj
withJSON
, but hit a snag, which leads to this PR.Oj.dump(value, mode: :rails)
will perform the equivalent substitutions asERB::Util.json_escape
. It escapes more thanJSON.generate(value, script_safe: true)
performs, specifically:&
>
<
It does not escape the forward-slash character (
/
), which is escaped byscript_safe
. However, conveniently, we perform agsub
on top of that to escape it.This PR adds a
escape_html_entities
(please bikeshed the name), which is a the union of escaped characters ofscript_safe
andOj.dump(mode: :rails)
.I shouldn't be trusted to write production C code so please review carefully.