Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searchable PDFs Not Working #551

Open
winman3000 opened this issue Feb 3, 2025 · 8 comments
Open

Searchable PDFs Not Working #551

winman3000 opened this issue Feb 3, 2025 · 8 comments

Comments

@winman3000
Copy link

Searchable PDFs Not Working

Description

NAPS2 does not create searchable PDF files, even when the option to create searchable PDFs is selected in the settings.

Steps to Reproduce

  1. Scan a document using NAPS2.
  2. Select "Save PDF".
  3. Open the saved PDF in Adobe Acrobat.

Expected Behavior

The PDF should contain selectable and searchable text.

Actual Behavior

The PDF is marked as searchable but only contains an image, making text selection and search impossible.

Environment

  • NAPS2 Version: 8.0.3+e7cf25fa120c30decd76c41030dfb65b1ae5c032
  • Operating System: Windows 11 Professional Version 24H2 (OS Build 26100.3037)
  • OCR Language: German

Additional Notes

  • The issue persists across different scanned documents.
  • The problem occurs regardless of the OCR language selection.
@cyanfish
Copy link
Owner

cyanfish commented Feb 6, 2025

Can you see if there are any error logs? And attach a sample PDF with the issue here?

@winman3000
Copy link
Author

Unfortunately there are no error logs, the text is simply exported. I have attached a file that I created with Naps2. The document is marked as “searchable” for screen readers, but there is only one graphic. If you want to export it as text with Adobe Acrobat, no text comes out. It doesn't matter which document you scan.

TestFile.pdf

@cyanfish
Copy link
Owner

cyanfish commented Feb 8, 2025

That PDF is searchable for me, works fine with Adobe when I Ctrl+A and Ctrl+C.

@winman3000
Copy link
Author

Strange, but you can't read the PDF with a screen reader. It looks like a graphic. If I go to “Export as text” in Adobe Acrobat, no text is exported either. In other searchable PDF files, the text is exported.

@winman3000
Copy link
Author

I have now tested the problem with three different screen readers: JAWS, NVDA and Narrator, the screen reader from Microsoft that comes with Windows.

The quickest and easiest way to test it is with Narrator. It is important that you follow the steps in this way.

  1. Start Narrator with CTRL+Windows+Enter.
  2. Open the attached PDF file with Adobe Acrobat.
  3. If necessary, confirm the accessibility settings.
  4. Now read the document using the arrow keys.

Unfortunately, you cannot read the file as it is a graphic.

@winman3000
Copy link
Author

NAPS2 appears to generate PDF files that are not properly tagged. In PDF documents, tags are essential metadata elements that define the logical structure of the document. They help organize content hierarchically, specifying headings, paragraphs, lists, tables, and other structural elements. These tags are crucial for accessibility, as they enable screen readers and other assistive technologies to interpret and present the document's content correctly.

Without proper tagging, a PDF is essentially just a visual representation of the content rather than a structured document. This means that screen readers cannot navigate the text logically, making it difficult or impossible for visually impaired users to access the information. Even if OCR is applied to make the text selectable and searchable, the absence of proper tags prevents screen readers from reading the content in a meaningful way.

The issue with NAPS2’s PDFs suggests that while the software may perform OCR, it does not add the necessary structural tags to the document. As a result, these PDFs do not meet accessibility standards, such as those outlined in the PDF/UA (Universal Accessibility) specification. Ensuring that PDFs are correctly tagged is important not only for accessibility but also for improved document indexing and searchability.

Addressing this issue would significantly enhance the usability of PDFs generated by NAPS2, making them more accessible to all users, including those who rely on assistive technology.

@cyanfish
Copy link
Owner

Alas, ChatGPT can only make uninformed guesses. But I do have an idea of how it could maybe be fixed.

@winman3000
Copy link
Author

@cyanfish: Thank you!

This is actually the information I received from the JAWS developers and the user community. I only had the text generated by ChatGPT because I have problems formulating English texts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants