Skip to content
This repository was archived by the owner on Nov 29, 2019. It is now read-only.

Section Breakdown #7

Open
netconstructor opened this issue May 3, 2013 · 1 comment
Open

Section Breakdown #7

netconstructor opened this issue May 3, 2013 · 1 comment

Comments

@netconstructor
Copy link

I am trying to wrap my head around this utility but it seems that it is unable to determine sections within a pdf. For example, lets say that one is looking to extract information from documents which contain blocks of text/paragraphs where each of these content blocks either has a title. These sections could be defined by larger text titling the section, might be in upper case, might be italic, might be underlined... or any combination of those elements.

So, what i am look for is a way to somehow get this utility to determine such a pattern and return the content of the document and annotate each of these sections with corresponding tree pattern markers.

How would one go about this?

@kjw
Copy link
Contributor

kjw commented Feb 26, 2014

This tool attempts to detect the flow of text along a page. To do this it will try to undertand columns within a page, and discard any blocks of text that appear not to follow the bounds of detected columns (to avoid including text from cut outs, figures, figure descriptions, etc, in the page flow.)

The tool will then try to pick out what look like headers, and use these to deliniate the page flow into sections. This part of the tool is not well developed - it was still quite error prone when I stopped work on this took a few years ago.

To extract certain types of objects, (say, sections,) you will need to include the relevant command line option (--sections, etc.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants