-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Scan time increases quadratically with page count #1378
Comments
Note that get_page_analysis runs in a worker thread/process context so a fix that passes all test may not be super straightforward. |
Here is my patch, it passed all tests locally. diff --git a/src/ocrmypdf/pdfinfo/layout.py b/src/ocrmypdf/pdfinfo/layout.py
index 83d5d32b..fbef6a0d 100644
--- a/src/ocrmypdf/pdfinfo/layout.py
+++ b/src/ocrmypdf/pdfinfo/layout.py
@@ -288,6 +288,23 @@ def patch_pdfminer(pscript5_mode: bool):
else:
yield
+class PageCache:
+ cache = {}
+
+ @classmethod
+ def get_cached(cls, infile: PathLike, pageno: int):
+ if infile in cls.cache:
+ for cache_pageno, cache_page in cls.cache[infile]:
+ if cache_pageno == pageno:
+ return cache_page
+ elif cache_pageno > pageno:
+ break
+ else:
+ return None
+
+ f = Path(infile).open('rb')
+ cls.cache[infile] = enumerate(PDFPage.get_pages(f))
+ return cls.get_cached(infile, pageno)
def get_page_analysis(
infile: PathLike, pageno: int, pscript5_mode: bool
@@ -306,8 +323,7 @@ def get_page_analysis(
with patch_pdfminer(pscript5_mode):
try:
with Path(infile).open('rb') as f:
- page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)
- page = next(page_iter, None)
+ page = PageCache.get_cached(infile, pageno)
if page is None:
raise InputFileError(
f"pdfminer could not process page {pageno} (counting from 0)." |
For testing, this is a file with 10k blank pages: Real-world files require much less pages for this bug to be noticeable. The test was done under
|
I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy. |
Can we add a new method, PageCache.clear, which closes all open files and is called somewhere in the pipeline upon completion of the scan? I think this is the only way to fix it without disturbing the public api.
I think you mean PyPi the repository, as the issue is cross-platform compatibility with Windows? |
I think that PageCache needs to be attached to PdfInfo and created as a distinct object rather than using global state. PdfInfo should create a PageCache and pass it on to each PageInfo; after page information is gathered, PageCache can be cleared.
No, PyPy, the alternative Python implementation, which ocrmypdf supports. |
Hmm, this is a bug in pdfminer.six. As far as I know, one does have to traverse the page tree in order to get to a page with a particular index (will have to go check the spec), but the problem is that the implementation of |
Oh, well, it turns out that parsing the pages after traversing the page tree isn't any more expensive than just traversing the tree itself, as initializing a |
Fixed in f77f701 |
When using OCRMyPDF with the
--redo-ocr
option, I noticed that the "Scanning contents" phase can become excessively long for large input files.Surprisingly, the total processing time seems to scale non-linearly with the page count making the estimated time-to-complete feature useless.
Here are the progress bar timings when scanning a 1000 pages file on Core i5 12400 under v16.4.2:
Upon investigation, I traced the issue back to
OCRmyPDF/src/ocrmypdf/pdfinfo/layout.py
Line 309 in 5e1e249
The inefficient implementation of
PDFPage.get_pages
seems to iterate over all previous pages before getting topageno
. This theoretically means a quadratic time complexity.The following quick and dirty patch reduced the scan time from 300s to 10s for the same file
# TODO addthe patch
The text was updated successfully, but these errors were encountered: