Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize #add_referenced #238

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pkmiec
Copy link

@pkmiec pkmiec commented Feb 2, 2024

Summary

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).

Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.

Performance

The benchmark was done with the following script,

require 'benchmark'
require 'combine_pdf'

puts "Ruby: #{RUBY_VERSION}"
puts "CombinePDF: #{CombinePDF::VERSION}"

files = []
68.times { |i| files << "/tmp/pdfs/#{i}.pdf" }
files = files * 10 # to exacerbate the effect

result_pdf = CombinePDF.new
files.each { |file| result_pdf << CombinePDF.parse(IO.read(file)) }
puts(Benchmark.measure do
    result_pdf.save('/tmp/combined.pdf')
end)

FYI ... we end up with 32053 pdf objects in @objects array.

Before

Ruby: 2.7.7
CombinePDF: 1.0.26
2.598427 0.011980 2.610407 ( 2.617881)

Ruby: 3.2.2
CombinePDF: 1.0.26
15.067833 0.026986 15.094819 ( 15.139298)

After

Ruby: 2.7.7
CombinePDF: 1.0.26 (with this PR)
2.768545 0.006937 2.775482 ( 2.786386)

Ruby: 3.2.2
CombinePDF: 1.0.26 (with this PR)
1.997242 0.016295 2.013537 ( 2.021782)

The #add_referenced kept track of existing object with a hash when the keys
were the objects. This seemed to have been ok with Ruby 2.7, but became
significantly slower in Ruby 3.2.

Profiling showed that many of those objects are instances of Hash and Ruby
uses #eql? method to compare Hash keys. It is not clear what acutally changed
in Ruby, but we can work around the issue by using #hash so that the key
is an integer. We just need to recompute that integer if the object changes.

Co-authored-by: Jeremy Kirchhoff <[email protected]>
@BenMorganMY
Copy link

@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use .hash so that I can find other areas for improvement.

@boazsegev
Copy link
Owner

boazsegev commented Jul 13, 2024

Using .hash instead of .eql? adds a new risk of Hash collisions – rare but definitely possible when squashing variable multi-byte objects into 64 bits (or, I believe that in Ruby, it would actually be limited to 62 bits).

This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications.

Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)?

@pkmiec
Copy link
Author

pkmiec commented Jul 15, 2024

Hi!

@BenMorganMY Have you already profiled your code and identified that it is still slow in this #add_reference method? I can imagine that given a different set of PDFs you may be hitting a slowness in a different part of the code. But to answer your question, the point of the method is to try to de-dup references to objects. I imagine that #eql needs to "walk" the object to compute whether it is the same as another object. Since the objects are Hashes then this walking becomes a recursive operation and thus more expensive. Perhaps the result of this "walking" was cached in earlier versions of Ruby and now it is not. Note, I could not find anything in Ruby's changelog, so I do not know for sure. Using #hash was a way to force that computation to produce a number and then compare numbers instead of Hashes.

@boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers.

@amomchilov
Copy link
Contributor

I found a related perf improvement in this method, see #241

@boazsegev
Copy link
Owner

I assume this was fixed in #241 and should be closed?

@amomchilov
Copy link
Contributor

@boazsegev Not quite. I happened to touch the same line, but a different statement.

I changed the definition and usage of the resolved hash, as in resolved[obj.object_id] = obj, but this PR is changing the existing Hash as in existing[obj.hash] = obj.

As an outsider, I did find this several;statements;per;line style a bit hard to grok, and I think this suggests I'm not the only one 😅

@boazsegev
Copy link
Owner

@amomchilov ,

I love how specific and targeted PR #241 had been. However, I assumed it was meant to deal with the hash issue as well. My apologies for that.

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

SipHash, which is the underlying algorithm for obj.hash, isn't cryptographic and is more likely to produce hash collision than a cryptographic hash. Moreover, Ruby uses 62 bits from the 64 bit SipHash result, slightly increasing the risk.

Not that I believe that hash collisions are likely, but this could still happen and this means that obj.hash requires additional validation. We need to verify that a hash collision doesn't lead to an error (i.e., validating a correct result by manually checking for equality up to a certain depth).

@pkmiec ,

I think we need to rebase this PR if we are going to give it a chance. I would also like to see how we protect against possible hash collisions before I merge.

Thanks.

@amomchilov
Copy link
Contributor

I love how specific and targeted PR #241 had been.

Glad to hear :)

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

I believe you're correct. Apparent from a perfect hash table (which have restricted applicability, and Hash definitely is not), hash collisions are always possible.

Thus, testing for equality will always necessary to ensure the wrong value isn't returned via collision. I don't see any way around that.

I'd be happy to chip in some alternative ideas for speeding this up, but candidly, I had a very hard time following exactly what's going on in this file, and what exact objects are being used as keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants