Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combined PDF size is bigger than the originals #314

Open
wildhart opened this issue Sep 14, 2023 · 2 comments
Open

Combined PDF size is bigger than the originals #314

wildhart opened this issue Sep 14, 2023 · 2 comments

Comments

@wildhart
Copy link

I'm using pdfjs to combine two PDF files. Each file is about 3 MB, but the combined pdf is 40 MB.

One of the files is a scanned document where each page is actually an image. Are the images getting upscaled as they are added to the combined PDF? If so, is there a way I can prevent this?

    const outputDoc = new pdfjs.Document();
    for (const url of urls) {
        const buffer = await downloadUrlAsBuffer(url); // my own code
        const doc = new pdfjs.ExternalDocument(buffer);
        outputDoc.addPagesOf(doc);
    }
    const buffer = outputDoc.asBuffer();

image

In a perfect world I would be able to specify a maximum DPI - if an image is higher resolution then the image is downscaled, otherwise the original image is unmodified.

I'll send the two files to you by email...

@wildhart
Copy link
Author

@rkusa Have you been able to look into this yet?

I've done some more experiments with different files. The source file is a file I generated with another library jspdf containing text, lines and images.

The full file contains 5 pages with a header image on each page plus two pages with photos.

Here's a summary of my findings:

Description Original Orig Size Converted Converted Size
Full orig.pdf 2.15 Mb converted.pdf 10.7 Mb
No headers orig no headers.pdf 2.09 Mb converted no headers.pdf 10.4 Mb
No images orig no images.pdf 71.8 kB converted no images.pdf 341 kB

Why is PDFjs increasing the output file size so much?

  • Are the fonts getting duplicated for each page?
  • Is the image resolution getting upscaled?
  • Is compression being removed?
  • In the source file, the same header image data is re-used on each page - is this getting duplicated by PDFjs?

@wildhart
Copy link
Author

Just FYI, I've moved way from using pdfjs for merging PDFs, due to this issue with excessive file sizes, and also (#312) where certain PDF files throw errors when they are merged.

Instead I'm using pdf-lib which is really easy to use to copy pages from one PDF to another, and it doesn't have any problems with the files we've provided in #312 which throw errors in pdfjs, and the output file size is never bigger than the original files. It also seems a bit faster.

I'm still using pdfjs to generate PDF from html, but then I use pdf-lib to combine that with other PDF files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant