Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to extract PDF text from specific regions #62

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
- Added the ability to merge multiple images into a single PDF
- Added the ability to load PDFs from byte arrays
- Added the ability to run tests automatically using GitHub actions [#64](https://github.com/dotemacs/pdfboxing/pull/64)
- Added the ability to partially parse PDF content based on a vector of regions

### Changed
- Using lists for :imports
Expand Down
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,36 @@ Clojure PDF manipulation library & wrapper for [PDFBox](http://pdfbox.apache.org
(text/extract "test/pdfs/hello.pdf")
```

### Extract text from specific regions

```clojure
(require '[pdfboxing.text :as text])
(let [areas [{:x 0
:y 100
:w 350
:h 50
:page-number 0}
{:x 0
:y 580
:w 540
:h 100
:page-number 0}]]
(text/extract-by-areas "test/pdfs/clojure-1.pdf" areas))
```

results in
```clojure
=> ("Clojure is a dynamic programming language\n" "Rationale\nFeatures\nDownload\nGetting Started\nDocumentation\nClojureScript\nClojureCLR\n")
```

Then you can easily turn the result into a map using zipmap to get the following:

```clojure
;; Result of (zipmap [:description :links] text-extract)

{:description "Clojure is a dynamic programming language\n", :links "Rationale\nFeatures\nDownload\nGetting Started\nDocumentation\nClojureScript\nClojureCLR\n"}
```

### Merge multiple PDFs

```clojure
Expand Down
19 changes: 18 additions & 1 deletion src/pdfboxing/text.clj
Original file line number Diff line number Diff line change
@@ -1,10 +1,27 @@
(ns pdfboxing.text
(:require [pdfboxing.common :as common])
(:import org.apache.pdfbox.text.PDFTextStripper))
(:import (org.apache.pdfbox.text PDFTextStripper
PDFTextStripperByArea)
(java.awt Rectangle)))

(defn extract
"get text from a PDF document"
[pdfdoc]
(with-open [doc (common/obtain-document pdfdoc)]
(-> (PDFTextStripper.)
(.getText doc))))

(defn- area-text [doc {:keys [x y w h page-number]
:or {x 0 y 0 w 0 h 0 page-number 0}}]
(let [rectangle (Rectangle. x y w h)
pdpage (.getPage doc page-number)
textstripper (doto (PDFTextStripperByArea.)
(.addRegion "region" rectangle)
(.extractRegions pdpage))]
(.getTextForRegion textstripper "region")))

(defn extract-by-areas
"get text from specified areas of a PDF document"
[pdfdoc areas]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @PavlosMelissinos

Can you tell me what was your thinking here?

Why is pdfdoc an argument on it's own and areas is a map?

Why can't it all go into a map?

My thinking is that if you're passing a map around, where all the arguments are in the map, you don't have to think about the position of your arguments.

Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clearer this way. extract-by-areas is an operation on a pdf document and the coordinates are just parameters. Sure, they're crucial, but they don't have the same weight as the actual document.

I don't have very strong feelings about this though, it's your library 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off using mostly rest arguments for the functions in the library.

Then I accepted some PRs which used strict arity.

Let me think about this for a bit and see what option/approach to take, because once this is merged it'll be good to provide the least amount of surprise.

Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah I could make it variadic if you'd prefer that. That would be consistent with split-pdf and other functions!

(with-open [doc (common/obtain-document pdfdoc)]
(reduce (fn [v area] (conj v (area-text doc area))) [] areas)))
42 changes: 40 additions & 2 deletions test/pdfboxing/text_test.clj
Original file line number Diff line number Diff line change
@@ -1,9 +1,47 @@
(ns pdfboxing.text-test
(:require [clojure.test :refer [deftest is]]
[pdfboxing.text :refer [extract]]))
(:require [clojure.test :refer [deftest is testing]]
[pdfboxing.text :refer [extract extract-by-areas]]))

(def line-separator (System/getProperty "line.separator"))

(deftest text-extraction
(is (= (str "Hello, this is pdfboxing.text" line-separator)
(extract "test/pdfs/hello.pdf"))))

(deftest text-extract-by-areas
(let [areas [{:x 150
:y 100
:w 260
:h 40
:page-number 0}
{:x 380
:y 500
:w 27
:h 23
:page-number 4}]]
(is (= ["Clojure 1.6 Cheat Sheet (v21)\n"
"*ns*\n"]
(extract-by-areas "test/pdfs/multi-page.pdf" areas))))

(testing "default coordinate value is 0"
(let [areas [{:x 150
:y 100
:w 260
:h 40}
{:x 150
:y 100
:w 260
:h 40
:page-number 0}
{:x 0
:y 0
:w 280
:h 100
:page-number 0}
{:w 280
:h 100}]]
(is (= ["Clojure 1.6 Cheat Sheet (v21)\n"
"Clojure 1.6 Cheat Sheet (v21)\n"
"5/23/2015\nClojure\n"
"5/23/2015\nClojure\n"]
(extract-by-areas "test/pdfs/multi-page.pdf" areas))))))