Is there a way to detect header and footer coordinates? #1804

ghost · 2022-07-07T18:14:50Z

ghost
Jul 7, 2022

In the script I'm working with I need to redact these parts before getting the text from the document because it breaks up the text and I have no way to regex it out because I need this script to work with different documents. Is there a way to find the footer / header coordinates to put them in a Rect and redact?

Answered by JorjMcKie

Jul 7, 2022

PDF knows nothing about such things as "header" or "footer".
You have to rely on "outside" knowledge.
But once you have the resp. bboxes, you can redact away those text contents of course.
So either you know beforehand that doc type x has headers/footers at positions y and z, or you must extract the full text first and do whatever heuristics to find and eliminate those parts from the extraction output.
In the latter case you obviously no more need to redact anything.

View full answer

JorjMcKie · 2022-07-07T18:33:34Z

JorjMcKie
Jul 7, 2022
Maintainer

PDF knows nothing about such things as "header" or "footer".
You have to rely on "outside" knowledge.
But once you have the resp. bboxes, you can redact away those text contents of course.
So either you know beforehand that doc type x has headers/footers at positions y and z, or you must extract the full text first and do whatever heuristics to find and eliminate those parts from the extraction output.
In the latter case you obviously no more need to redact anything.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a way to detect header and footer coordinates? #1804

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a way to detect header and footer coordinates? #1804

Uh oh!

ghost Jul 7, 2022

Replies: 1 comment

Uh oh!

JorjMcKie Jul 7, 2022 Maintainer

ghost
Jul 7, 2022

JorjMcKie
Jul 7, 2022
Maintainer