Skip to content

Conversation

@AaronCMuller
Copy link
Contributor

Using StreamPdfBoxScanDetector as a base, implement a variant 'strict' method of scan detection to try and decrease false-positives.

This improves on the original in two specific areas:

  • Based on my testing, scanned PDFs can and will often have more images than pages
  • Scans and Children's books share many similar attributes, and this method attempts to address this issue

One of the few ways to differentiate between a scan and a children's book is that children's books often have a colophon with publishing date, author, etc, and it's rare that 'native' children's books have large issues on the colophon page. On the other hand, scanned PDFs in the majority of circumstances should have a large image on every page. Therefore the 'strict' method considers a PDF to be native if it was any pages without large images.

As a personal note, I've extremely grateful for this project as it has formed the basis of my own work in PDF scan detection, so thank you.

AaronCMuller and others added 3 commits September 15, 2025 08:03
… with children's books (amongst other things).
Introduce 'Strict' detector, which contains optimisations for dealing with children's books (amongst other things).
@tledoux
Copy link
Owner

tledoux commented Sep 15, 2025

Hi Aaron, thanks for your pull request and very glad this project help you. Chhers

@tledoux tledoux merged commit 8af1e45 into tledoux:main Sep 15, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants