Add an additional 'strict' method of detecting whether pages are scanned. #2

AaronCMuller · 2025-09-15T00:43:51Z

Using StreamPdfBoxScanDetector as a base, implement a variant 'strict' method of scan detection to try and decrease false-positives.

This improves on the original in two specific areas:

Based on my testing, scanned PDFs can and will often have more images than pages
Scans and Children's books share many similar attributes, and this method attempts to address this issue

One of the few ways to differentiate between a scan and a children's book is that children's books often have a colophon with publishing date, author, etc, and it's rare that 'native' children's books have large issues on the colophon page. On the other hand, scanned PDFs in the majority of circumstances should have a large image on every page. Therefore the 'strict' method considers a PDF to be native if it was any pages without large images.

As a personal note, I've extremely grateful for this project as it has formed the basis of my own work in PDF scan detection, so thank you.

… with children's books (amongst other things).

Introduce 'Strict' detector, which contains optimisations for dealing with children's books (amongst other things).

tledoux · 2025-09-15T07:47:42Z

Hi Aaron, thanks for your pull request and very glad this project help you. Chhers

AaronCMuller and others added 3 commits September 15, 2025 08:03

Introduce 'Strict' detector, which contains optimisations for dealing…

918c948

… with children's books (amongst other things).

Remove assert.

e739416

Merge pull request #1 from nla/am/kidsbooks

c94c17a

Introduce 'Strict' detector, which contains optimisations for dealing with children's books (amongst other things).

tledoux merged commit 8af1e45 into tledoux:main Sep 15, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an additional 'strict' method of detecting whether pages are scanned. #2

Add an additional 'strict' method of detecting whether pages are scanned. #2

Uh oh!

AaronCMuller commented Sep 15, 2025

Uh oh!

tledoux commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add an additional 'strict' method of detecting whether pages are scanned. #2

Add an additional 'strict' method of detecting whether pages are scanned. #2

Uh oh!

Conversation

AaronCMuller commented Sep 15, 2025

Uh oh!

tledoux commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants