Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape HTML Entities #763

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

gmalette
Copy link

@gmalette gmalette commented Mar 12, 2025

I'm trying to replace Oj with JSON, but hit a snag, which leads to this PR.

Oj.dump(value, mode: :rails) will perform the equivalent substitutions as ERB::Util.json_escape. It escapes more than JSON.generate(value, script_safe: true) performs, specifically:

  • &
  • >
  • <

It does not escape the forward-slash character (/), which is escaped by script_safe. However, conveniently, we perform a gsub on top of that to escape it.

This PR adds a escape_html_entities (please bikeshed the name), which is a the union of escaped characters of script_safe and Oj.dump(mode: :rails).

I shouldn't be trusted to write production C code so please review carefully.

3, 3,11, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // 0xE2 is the start of \u2028 and \u2029
//First byte of a 4+ byte code point
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 9, 9,
};
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table is copy/pasted from script_safe with the exception of characters &, <, >.

Co-Authored-By: Ian Ker-Seymer <[email protected]>
@ianks ianks force-pushed the gm/html-entities branch from a9b70d1 to fd25da6 Compare March 13, 2025 01:49
@byroot
Copy link
Member

byroot commented Mar 13, 2025

So this is a feature I considered in the past, to help speedup ActiveSupport::JSON::Encoding, which currently does a big gsub: https://github.com/rails/rails/blob/f575fca24a72cf0631c59ed797c575392fbbc527/activesupport/lib/active_support/json/encoding.rb#L54-L60

I suppose that's what Oj.dump(value, mode: :rails) tries to emulate?

If that's the goal, then this shouldn't escape forward slashes (/).

Now the reason I haven't added this feature is that I wanted to avoid an explosion of settings, and also because when I paired with @etiennebarrie, we managed to optimize the gsub escaping pretty well via doing the gsub on a binary string: rails/rails#54484 and also optimizing gsub to skip string and MatchData allocations: ruby/ruby#12730 / ruby/ruby#12801

So I don't know how useful that extra escaping mode really is.

@byroot
Copy link
Member

byroot commented Mar 13, 2025

Also if we go with this, the JRuby and TruffleRuby implementations will need the feature for parity.

@gmalette
Copy link
Author

I suppose that's what Oj.dump(value, mode: :rails) tries to emulate?

Correct! For our dataset Oj.dump(value, mode: :rails) produces the exact same outcome as ERB::Util.json_escape(JSON.dump(value)), which I believe is identical to ActiveSupport::JSON::Encoding, at least for our dataset.

I was happy to also escape / because, for reasons that escape (lol) me, our project decided to use Oj.dump(value, mode: :rails).gsub!("/", "\\/").

we managed to optimize the gsub escaping pretty well

Do you know top of mind which versions that would have landed in? I only tried with activesupport = 7.2.1.2, but the microbenchmarks were ~60% slower than Oj.

@byroot
Copy link
Member

byroot commented Mar 13, 2025

Do you know top of mind which versions that would have landed in?

It's in 8.1.0.alpha and will be released with Rails 8.1.0. As for the String#gsub optimizations, they'll be in Ruby 3.5.0.

@gmalette
Copy link
Author

Looking at the performance improvements of using gsub, it looks like they won't be enough to bring us in line with Oj, right?

If that's the case, and if you're ready to accept adding something like mode: :rails, can you give me some guidance in how you'd like this API to shape up?

Using the current way was easy because its a superset of both ascii_only and script_safe, but if we add a mode that isn't, I'm not sure what a good API would look like. Using booleans that can each be set but can't each have effect is... challenging

Also if we go with this, the JRuby and TruffleRuby implementations will need the feature for parity.

Totally, I can try to add these if these changes are accepted

@byroot
Copy link
Member

byroot commented Mar 13, 2025

Looking at the performance improvements of using gsub, it looks like they won't be enough to bring us in line with Oj, right?

I don't know, would be worth benchmarking.

Using booleans that can each be set but can't each have effect is... challenging

Yeah, that's probably the biggest blocker for me. One solution could be: script_safe: :<something>, instead of a new option.

I'm almost tempted to do a more generic API where you essentially provide the escape table, but that might be too much.

I think I need to sleep on this, I don't want to add a feature I'd regret.

@gmalette
Copy link
Author

gmalette commented Mar 13, 2025

I'm almost tempted to do a more generic API where you essentially provide the escape table, but that might be too much.

Funny you should mention that, while we were building the thing this is exactly what we wanted, but also... how do we even do that 😅

One solution could be: script_safe: :<something>

script_safe: STRING_OF_ASCII_ONLY_CHARS_THAT_MAP_TO_ESCAPE_TABLE :homer-hides-in-bushes:

I think I need to sleep on this, I don't want to add a feature I'd regret.

Totally yeah, I also wouldn't want that

@byroot
Copy link
Member

byroot commented Mar 13, 2025

Funny you should mention that, while we were building the thing this is exactly what we wanted, but also... how do we even do that 😅

It's not that hard. You can pass an array of characters, or even just a string, and build the table from there.

Of course it would need to be allocated, and the building process would be significant overhead, so it would only make sense to do it via the new JSON::Coder API, so that you only ever do it once.

It's certainly a bit of work, but totally doable.

@gmalette
Copy link
Author

gmalette commented Mar 13, 2025

I benchmarked with this script (subset of our test corpus) with ruby-head.

Using JSON.dump(value, script_safe: true) + gsub is significantly slower than Oj

❯ ruby --version
ruby 3.5.0dev (2025-03-13T13:08:01Z master ee1f39ef88) +PRISM [arm64-darwin24]
❯ ruby --yjit bench.rb
Rehearsal ----------------------------------------
Oj     2.279588   0.011212   2.290800 (  2.293570)
JSON   3.051194   0.012087   3.063281 (  3.065510)
------------------------------- total: 5.354081sec

           user     system      total        real
Oj     2.276474   0.007051   2.283525 (  2.301843)
JSON   3.080154   0.013410   3.093564 (  3.106633)

Removing the regexp, to mimic using only an escape table makes JSON (this patch) much faster

           user     system      total        real
Oj     2.216610   0.012343   2.228953 (  2.232122)
JSON   0.979577   0.004345   0.983922 (  0.984977)

@byroot
Copy link
Member

byroot commented Mar 13, 2025

Your benchmark is missing the most important important optimization we found with @etiennebarrie which is to gsub binary mode: https://gist.github.com/byroot/e5dc39f0ce15ae1869bc7511ebc522e0

$ ruby --yjit /tmp/je.rb 
/opt/rubies/head/lib/ruby/3.5.0+0/bundler/runtime.rb:58: warning: benchmark/ips is found in benchmark, which is not part of the default gems since Ruby 3.5.0.
You can add benchmark to your Gemfile or gemspec to fix this error.
ruby 3.5.0dev (2025-03-12T09:20:40Z master 2782cc75a9) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
                  Oj     5.615k i/100ms
                JSON     8.115k i/100ms
Calculating -------------------------------------
                  Oj     53.352k (± 1.2%) i/s   (18.74 μs/i) -    269.520k in   5.052415s
                JSON     80.977k (± 1.3%) i/s   (12.35 μs/i) -    405.750k in   5.011526s

Comparison:
                  Oj:    53352.2 i/s
                JSON:    80977.0 i/s - 1.52x  faster

@gmalette
Copy link
Author

oooooh I completely missed that from the patch! Incredible. I guess apples to apples, the Oj benchmark would also need to use the binary mode, but this is something I can work with.

I looked at how this transfers to our project and it does improve our baseline, so we don't actually need this patch to replace Oj, which gives us (you!) plenty of time to decide whether

  1. You want to introduce this
  2. What API you're comfortable with

Thanks a lot for the help

@eregon
Copy link
Member

eregon commented Mar 13, 2025

It feels like it's a bug in CRuby it's much faster at gsub(/<>&/, MAP) on BINARY than UTF-8 strings.
In either case it's valid to just do a byte search with these regexp characters (and for any character < 128, maybe also for characters over that but not sure there), IOW to treat the input as if it was in a single-byte encoding.

@byroot
Copy link
Member

byroot commented Mar 13, 2025

It feels like it's a bug in CRuby

To be too, but the regexp engine kinda flies over my head, so I haven't attempted to figure out a solution.

I've also seen that trick used in other gems such as CGI: https://github.com/ruby/cgi/blob/ab84b7fe6624faeba21fb52acac33ea678366e11/lib/cgi/util.rb#L43-L47, so I assume it's a known performance characteristic.

@eregon
Copy link
Member

eregon commented Mar 13, 2025

One concern with it is it potentially discards the computed coderange, if it was known (unless it's CR_7BIT, that may be preserved across force_encoding's).
I think it'd be good to report it as an issue on the CRuby tracker.

@byroot
Copy link
Member

byroot commented Mar 13, 2025

it potentially discards the computed coderange

That specifically isn't a concern, because the coderange of JSON.dump isn't known.

But yes, I generally agree it would be a good idea to investigate it. However I suspect the answer is that using a binary string allow to use a singlebyte encoding fastpath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants