Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for override of PDF metadata. #206

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

FriscoTony
Copy link

@FriscoTony FriscoTony commented Jan 29, 2022

#194

Hoping to allow a caller to override all the metadata fields in the @Info hash.
Hopefully this is relatively uncontroversial!

It's a bit verbose handling CreationDate and ModDate -- if the caller passes a string, it uses that string and assumes it's properly formatted. If they pass a Ruby Date or Time, then it will format it correctly. I'd be happy to move this logic into a helper method if that would be preferred.

I may open a separate small request around CreationDate. My use case is opening one PDF and then appending a few others. In this case, I had expected the CreationDate of the original PDF would be preserved, but the ModDate would be updated. I see from the code that the dates are deleted on initialization, and set to the same thing on save. So I might submit and additional pull request to preserve the CreationDate on initialization if it exists. But at least this pull requests allows me to set it manually myself.

Thanks for a great library, and hope this helps a little bit.

@boazsegev
Copy link
Owner

Hi @FriscoTony ,

Thanks for the PR

. I agree that the handling of PDF info / metadata could be vastly improved and I am thankful for your work and your desire to improve CombinePDF.

As it stands now, the fact that editing the metadata requires that the user be aware of the possible metadata "names" (Symbols) in the pdf.info object and the fact that they are CamelCase makes the info data object inaccessible and non-intuitive.

...

However, I do not believe that the to_pdf is the place for this change. Why should rendering the data affect its content? That seems to me as both counter-intuitive and surprising (IMHO).

I believe it would be much better to update the metadata while editing the CombinePDF object, perhaps by adding a mix-in to the pdf.info object (not the Hash class), by moving the info object to a sub-class (perhaps inheriting from Hash) or by adding helper methods.

Thanks again.
Bo.

Added explicit helper methods for all PDF metadata symbols/keys. Holding helper methods in separate file to reduce clutter in PDF class. Added a helper date method so ModDate and CreationDate can be set either with a String or Date object. Allow library user to modify CreationDate if desired.
@FriscoTony
Copy link
Author

Hi @boazsegev !

Thanks for your thoughtful response, and apologies for taking so long to circle. I wrote my patch under a deadline and have been swamped, but now may have some cycles to try and improve this.

I think you've made a number of valid points.

Don't do this in to_pdf
Agree! I only added my suggested code here because this was the closest spot I could find in the code base where at least some metadata values could already be passed (Subject and Producer can currently be passed via the options hash and they will update the @info object).

You've proposed that allowing the user to update these metadata values while editing the CombinePDF object makes the most sense, and I wholeheartedly agree.

Allowed symbols
I also agree that it's counter-intuitive and prone to errors to have users update the info hash directly. I’m proposing explicitly named methods for the metadata keys/symbols that are in the spec PDF Spec 1.7 (Section 10.2):

  • Title
  • Author
  • Subject
  • Keywords
  • Creator
  • Producer
  • CreationDate
  • ModDate
  • Trapped

Update via info object or helper methods
I've tried both, and am proposing to have helper methods on the PDF object directly, but defining all those methods in a separate file to reduce clutter. I'm also introducing a PDFInfo class to hold a single class method to handle date formatting so that CreationDate and ModDate can be passed either as a String or a Date. This new file also holds all the new helper methods

I'd originally had PDFInfo inherit from Hash. What I didn't like about that is it required me to change more places in the code where the info object is initialized. I'm proposing to let @info continue to be a hash everywhere in the codebase, but to have the helper methods be the preferred way to update.

Another reason I've opted for this approach is because helper methods already exist for title and author, and I think those would need to stick around anyway for backwards-compatibility. I've moved these methods inside the pdf_info file, again to reduce clutter. I’m not sure I love that these methods are “hidden” so I could move them all back into the PDF class directly, or am open to other suggestions.

Backwards Compatibility
to_pdf currently allows for two metadata values to be passed via the options hash (subject and producer). Again, it doesn't seem appropriate to break those, so that's still allowed, though the code now uses the new explicit helper methods to update the @info hash.

One small change I'm proposing which does change behavior slightly. ModDate is set to Time.now at the start of to_pdf which makes perfect sense. But as I'd mentioned previously, for my own use case (load a PDF and append some others), I'd like to keep the CreationDate as the creation date of my original file. So in to_pdf, I'm only setting CreationDate to the ModDate if it doesn't already exist. This gives calls the opportunity to set/preserve the creation data of the resulting PDF.

I'm tempted to remove this line:

@info.delete :CreationDate

in initialize as well which would automatically preserve the CreationDate of a loaded file if it exists, though I realize that might surprise existing users of the library, so I'd be happy even just to be able to explicitly set the CreationDate myself and have it respected.

What's not to like?
Here are the criticisms I can anticipate (though of course there may be more!)

  • As mentioned above, it’s arguably non-intuitive that all these helper methods are defined inside a “class PDF” block in the new pdf_info.rb file.

  • You might prefer that callers use pdf.info.title = "Magnum Opus" instead of pdf.title = "Magnum Opus" Keeping the info there might make it more obvious that these particular values are all connected. That said, pdf.title and pdf.author currently exist and will need to exist for backwards compatibility anyway, which is why I’ve opted to hang all of these helper methods on PDF directly.

  • Namespace clutter. This introduces 7 new methods onto the PDF object directly. If you could ever imagine wanting to reserve any of these (subject, keywords, creator, producer, creation_date, mod_date, trapped) for future use on PDF, that would be another reason to move all these methods onto the info object

I'll close by saying that I am a solo developer, working on my own Rails app for roughly 10 years now, but have not previously contributed to open source projects, so I appreciate your patience if there's anything about this process I could do better.

Thanks again for a great library, and I welcome your feedback. I do have more time on my hands in the coming weeks, so I'd be happy to make any modifications you might suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants