You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 29, 2019. It is now read-only.
I am trying to wrap my head around this utility but it seems that it is unable to determine sections within a pdf. For example, lets say that one is looking to extract information from documents which contain blocks of text/paragraphs where each of these content blocks either has a title. These sections could be defined by larger text titling the section, might be in upper case, might be italic, might be underlined... or any combination of those elements.
So, what i am look for is a way to somehow get this utility to determine such a pattern and return the content of the document and annotate each of these sections with corresponding tree pattern markers.
How would one go about this?
The text was updated successfully, but these errors were encountered:
This tool attempts to detect the flow of text along a page. To do this it will try to undertand columns within a page, and discard any blocks of text that appear not to follow the bounds of detected columns (to avoid including text from cut outs, figures, figure descriptions, etc, in the page flow.)
The tool will then try to pick out what look like headers, and use these to deliniate the page flow into sections. This part of the tool is not well developed - it was still quite error prone when I stopped work on this took a few years ago.
To extract certain types of objects, (say, sections,) you will need to include the relevant command line option (--sections, etc.)
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I am trying to wrap my head around this utility but it seems that it is unable to determine sections within a pdf. For example, lets say that one is looking to extract information from documents which contain blocks of text/paragraphs where each of these content blocks either has a title. These sections could be defined by larger text titling the section, might be in upper case, might be italic, might be underlined... or any combination of those elements.
So, what i am look for is a way to somehow get this utility to determine such a pattern and return the content of the document and annotate each of these sections with corresponding tree pattern markers.
How would one go about this?
The text was updated successfully, but these errors were encountered: