Show progress during postprocessing #1313

user1823 · 2024-05-16T14:18:23Z

For large files, postprocessing takes a lot of time. Showing some progress here would make the UX better.

The main motivation behind this request was that ocrmypdf is stuck on this step (postprocessing) for about 30 min.

And now, it is stuck on this step:

jbarlow83 · 2024-05-16T22:43:44Z

That's when we ask Ghostscript to do PDF/A. Unfortunately, it doesn't give much feedback, so there's not much I can work with it. At least I'm not aware of any behavior I can monitor. It's also single threaded. Color space conversion of large images can be quite expensive in Ghostscript and is often responsible for long delays.

user1823 · 2024-05-17T17:02:21Z

That's when we ask Ghostscript to do PDF/A.

But, in the above case, I used --output-type pdf. So, there would be no PDF/A conversion.

In the above case, I guess that most of the time was consumed for doing the equivalent of the following (obtained by running with -v1 on a different file):

Postprocessing...                                                                                             ocr.py:145
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
xref 13: treating as an optimization candidate                                                           optimize.py:279
xref 12: treating as an optimization candidate                                                           optimize.py:279
XrefExt(xref=12, ext='.png')                                                                             optimize.py:344
XrefExt(xref=13, ext='.png')                                                                             optimize.py:344
Optimizable images: JPEGs: 0 PNGs: 2                                                                     optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--

Unfortunately, it doesn't give much feedback, so there's not much I can work with it. At least I'm not aware of any behavior I can monitor.

If I run this:

gswin64c.exe -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sOutputFile=out.pdf test.pdf

I get:

Processing pages 1 through 2.
Page 1
Page 2

So, you can probably monitor the number of pages processed, which you can use to show the progress.

Might take time for big files. Pdf.open() potentially is expensive as well, but QPDF doesn't give us progress feedback for that. Closes Show progress during postprocessing #1313

user1823 · 2024-05-20T15:18:43Z

I am now using v16.3.0 and it seems that the changes made in 950c700 or 9a3c5a3 have resulted in a bug.

The progress bar in "OCR" says 1182 out of 591.

Also, the following step takes too much time:

What is ocrmypdf doing at this stage? Can we have a progress for this too?

jbarlow83 · 2024-05-21T08:38:20Z

Thanks for "OCR" progress bar issue report - fixed.

After "Total file size..." nothing is happening except copying the finished file from temporary storage to its final output location. Unless you're dealing with very large PDFs (GBs), this suggests network issues or file system contention. How long is "too much time?"

user1823 · 2024-05-22T13:10:46Z

except copying the finished file from temporary storage to its final output location.

Probably also cleaning up all the temp files generated (for e.g., the images)

When ocrmypdf is at this step, I can see the output file in the target directory (with the correct filesize, which means that it is likely not just a placeholder). So, I think that cleaning the temp files is actually what is taking the time.

How long is "too much time?"

Maybe 2-3 minutes. It is not too much when compared to the total time taken. But, it feels too much when you don't know what is happening and how long it is going to last. So, adding a progress here also would be nice.

user1823 added the enhancement label May 16, 2024

user1823 assigned jbarlow83 May 16, 2024

jbarlow83 added a commit that referenced this issue May 19, 2024

Add progressbar for metadata_fixup

9a3c5a3

Might take time for big files. Pdf.open() potentially is expensive as well, but QPDF doesn't give us progress feedback for that. Closes Show progress during postprocessing #1313

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show progress during postprocessing #1313

Show progress during postprocessing #1313

user1823 commented May 16, 2024 •

edited

Loading

jbarlow83 commented May 16, 2024

user1823 commented May 17, 2024

user1823 commented May 20, 2024

jbarlow83 commented May 21, 2024

user1823 commented May 22, 2024

Show progress during postprocessing #1313

Show progress during postprocessing #1313

Comments

user1823 commented May 16, 2024 • edited Loading

jbarlow83 commented May 16, 2024

user1823 commented May 17, 2024

user1823 commented May 20, 2024

jbarlow83 commented May 21, 2024

user1823 commented May 22, 2024

user1823 commented May 16, 2024 •

edited

Loading