-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to extract PDF text from specific regions #62
Open
PavlosMelissinos
wants to merge
9
commits into
dotemacs:master
Choose a base branch
from
PavlosMelissinos:extract-text-by-areas
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 8 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
5e2a817
Extract pdf text by areas
2516b55
Appease the linter monster
05a146e
Update changelog
a66aff3
Reorder changelog
133eee2
Update function docstring to reflect reality
af01aa0
Make area-text function more robust
46b6aee
Add documentation for extracting text from regions
bb25fc0
Merge branch 'master' of github.com:dotemacs/pdfboxing into extract-t…
5d76933
Make pdf area extraction eager with reduce
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,27 @@ | ||
(ns pdfboxing.text | ||
(:require [pdfboxing.common :as common]) | ||
(:import org.apache.pdfbox.text.PDFTextStripper)) | ||
(:import (org.apache.pdfbox.text PDFTextStripper | ||
PDFTextStripperByArea) | ||
(java.awt Rectangle))) | ||
|
||
(defn extract | ||
"get text from a PDF document" | ||
[pdfdoc] | ||
(with-open [doc (common/obtain-document pdfdoc)] | ||
(-> (PDFTextStripper.) | ||
(.getText doc)))) | ||
|
||
(defn- area-text [doc {:keys [x y w h page-number] | ||
:or {x 0 y 0 w 0 h 0 page-number 0}}] | ||
(let [rectangle (Rectangle. x y w h) | ||
pdpage (.getPage doc page-number) | ||
textstripper (doto (PDFTextStripperByArea.) | ||
(.addRegion "region" rectangle) | ||
(.extractRegions pdpage))] | ||
(.getTextForRegion textstripper "region"))) | ||
|
||
(defn extract-by-areas | ||
"get text from specified areas of a PDF document" | ||
[pdfdoc areas] | ||
(with-open [doc (common/obtain-document pdfdoc)] | ||
(doall (map #(area-text doc %) areas)))) | ||
dotemacs marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,47 @@ | ||
(ns pdfboxing.text-test | ||
(:require [clojure.test :refer [deftest is]] | ||
[pdfboxing.text :refer [extract]])) | ||
(:require [clojure.test :refer [deftest is testing]] | ||
[pdfboxing.text :refer [extract extract-by-areas]])) | ||
|
||
(def line-separator (System/getProperty "line.separator")) | ||
|
||
(deftest text-extraction | ||
(is (= (str "Hello, this is pdfboxing.text" line-separator) | ||
(extract "test/pdfs/hello.pdf")))) | ||
|
||
(deftest text-extract-by-areas | ||
(let [areas [{:x 150 | ||
:y 100 | ||
:w 260 | ||
:h 40 | ||
:page-number 0} | ||
{:x 380 | ||
:y 500 | ||
:w 27 | ||
:h 23 | ||
:page-number 4}]] | ||
(is (= ["Clojure 1.6 Cheat Sheet (v21)\n" | ||
"*ns*\n"] | ||
(extract-by-areas "test/pdfs/multi-page.pdf" areas)))) | ||
|
||
(testing "default coordinate value is 0" | ||
(let [areas [{:x 150 | ||
:y 100 | ||
:w 260 | ||
:h 40} | ||
{:x 150 | ||
:y 100 | ||
:w 260 | ||
:h 40 | ||
:page-number 0} | ||
{:x 0 | ||
:y 0 | ||
:w 280 | ||
:h 100 | ||
:page-number 0} | ||
{:w 280 | ||
:h 100}]] | ||
(is (= ["Clojure 1.6 Cheat Sheet (v21)\n" | ||
"Clojure 1.6 Cheat Sheet (v21)\n" | ||
"5/23/2015\nClojure\n" | ||
"5/23/2015\nClojure\n"] | ||
(extract-by-areas "test/pdfs/multi-page.pdf" areas)))))) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @PavlosMelissinos
Can you tell me what was your thinking here?
Why is
pdfdoc
an argument on it's own andareas
is a map?Why can't it all go into a map?
My thinking is that if you're passing a map around, where all the arguments are in the map, you don't have to think about the position of your arguments.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's clearer this way.
extract-by-areas
is an operation on a pdf document and the coordinates are just parameters. Sure, they're crucial, but they don't have the same weight as the actual document.I don't have very strong feelings about this though, it's your library 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started off using mostly rest arguments for the functions in the library.
Then I accepted some PRs which used strict arity.
Let me think about this for a bit and see what option/approach to take, because once this is merged it'll be good to provide the least amount of surprise.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Yeah I could make it variadic if you'd prefer that. That would be consistent with split-pdf and other functions!