amazon_textract

to extract text! This is the one of the repos for restrain and seclusion project.

Here we use Amazon Textract to scrape hand-written numbers and words from over 30,000 restrain and seclusion incidents,most of which are pdfs, in New York State.

Each school district has their own ways to document incidents, so, we customed scripts for each district (Sounds like a lot, right? But we did it!)

The text_config.py is to extract necessary data for the Textract outputs. Here we use keywords to identify which words or numbers we need for the story. To extract the data, you need to check if there is any keywords you need to change in the text_config.py.

Some pdfs contain image-based tables. A small part of them could be read by Tabula which doesn't make our life easier at all. So we turn to Textract again. The table_parser.py will help you scrape data from these tables and create csv files based on the outputs.

The most important thing is that running Textract will cost money, so make sure your manager knows about it before you run any scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
__pycache__		__pycache__
csv		csv
csv_from_image		csv_from_image
image		image
json		json
pdf		pdf
.DS_Store		.DS_Store
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
Textract_PostProcessing.ipynb		Textract_PostProcessing.ipynb
all_df.csv		all_df.csv
asss.py		asss.py
ball.py		ball.py
beth.py		beth.py
bri.py		bri.py
camelot.py		camelot.py
coh.py		coh.py
csv_clean.py		csv_clean.py
glen.py		glen.py
noco.py		noco.py
pdf_tables.py		pdf_tables.py
s3_test.py		s3_test.py
sch.py		sch.py
script.py		script.py
sul.py		sul.py
super_messy_MON1_RISI_01.csv		super_messy_MON1_RISI_01.csv
syr.py		syr.py
table.py		table.py
table_parser.py		table_parser.py
text.docx		text.docx
text.txt		text.txt
text_config.py		text_config.py
troy.py		troy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

amazon_textract

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

sfchronicle/amazon_textract

Folders and files

Latest commit

History

Repository files navigation

amazon_textract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages