How to tackle such error : PdfReadWarning: Object 16920 0 not defined #1846

arcontechnologies · 2022-08-01T10:33:08Z

arcontechnologies
Aug 1, 2022

Hi,
I noticed that from time to time I faced unreadble pdf for some reasons so I put in place this below to prevent it :

def load_pdf(path):
    # check if the file is readable
    if os.stat(path).st_size > 0 and os.access(path, os.R_OK):
        doc = fitz.open(path)
        return doc
    else:
        return None

but seems not to be enough to prevent such warning/error.

What is/are the best practices when it comes to deal or check "corrupted" pdfs ?

thanks for your feedback.

PS : I got also this warning :

UserWarning: Unable to resolve [IndirectObject: IndirectObject(876, 0)], returning NullObject instead [_writer.py:660]

Answered by JorjMcKie

Aug 1, 2022

Your code is more or less already part of PyMuPDF's since the latest 1.19.x version. It is being checked (not only for PDF!), whether the file exists and has a length > 0.
For some file types (non-PDF) a few additional checks are also performed.

For PDFs, MuPDF performs additional checks at open time and automatically starts repair algorithms to, for example, ensure that a usable PDF trailer does exist. If determining that a trailer is missing (often happens b/o incomplete downloads), a complete scan of all xref objects will be made to rebuild the xref table.
But it never walks through all of the PDF's internal structure unnecessarily / without reasons to be suspicious! Which is good.
If …

View full answer

JorjMcKie · 2022-08-01T11:07:40Z

JorjMcKie
Aug 1, 2022
Maintainer

Your code is more or less already part of PyMuPDF's since the latest 1.19.x version. It is being checked (not only for PDF!), whether the file exists and has a length > 0.
For some file types (non-PDF) a few additional checks are also performed.

For PDFs, MuPDF performs additional checks at open time and automatically starts repair algorithms to, for example, ensure that a usable PDF trailer does exist. If determining that a trailer is missing (often happens b/o incomplete downloads), a complete scan of all xref objects will be made to rebuild the xref table.
But it never walks through all of the PDF's internal structure unnecessarily / without reasons to be suspicious! Which is good.
If a repair however had been necessary (and successful), you will find doc.is_repaired to be True. More detail can be seen in the messages concatenated in the result of fitz.TOOLS.mupdf_warnings().

So, because of the lack of any internal consistency guarantee in PDFs, previously undetected consistency errors may pop up after an apparently successful, harmless-looking open.

In fact, all sorts of things can be wrong in a PDF: the page tree, the name tree, any single object (images, fonts, whatever).
So your message above cannot be avoided beforehand. But if in your environment such cases happen often enough, you could force MuPDF to do a walk through the whole file:
Save it temporarily to memory using garbage=4 and e.g. linear=True. Then open that memory copy document.

1 reply

arcontechnologies Aug 1, 2022
Author

Thanks a lot for your explantation. It's more clear for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to tackle such error : PdfReadWarning: Object 16920 0 not defined #1846

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to tackle such error : PdfReadWarning: Object 16920 0 not defined #1846

arcontechnologies Aug 1, 2022

Replies: 1 comment · 1 reply

JorjMcKie Aug 1, 2022 Maintainer

arcontechnologies Aug 1, 2022 Author

arcontechnologies
Aug 1, 2022

Replies: 1 comment 1 reply

JorjMcKie
Aug 1, 2022
Maintainer

arcontechnologies Aug 1, 2022
Author