aws-pdf-textract-pipeline

🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS Textract. Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

Getting Started

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

yarn install
yarn build
cdk bootstrap
cdk deploy

Overview

The following is an overview of each process performed by this CDK stack.

Scrape PDF download URLs from a website

Scraping data from the COGCC website.
Store PDF download URL in DynamoDB
Download the PDF to S3

A lambda fires off when a new PDF download URL has been created in DynamoDB.
Process the PDF with AWS Textract

Another lambda fires off when a PDF has been downloaded to the S3 bucket.
Process the AWS Textract results

When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.
Save the processed Textract result to DynamoDB.

After the full result is pruned down the the desired datastructure, we save the data in DynamoDB.

Scripts

yarn install - installs dependencies
yarn build - builds the production-ready CDK Stack
yarn test - runs Jest
cdk bootstrap - bootstraps AWS Cloudformation for your CDK deploy
cdk deploy - deploys the CDK stack to AWS

Notes

Warning - the AnalyzeDocument process from AWS Textract costs $50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.
If a PDF download URL has already been added to the pdfUrlsTable DynamoDB table, the pipeline will not re-execute for the PDF.
Includes tests with Jest.
Recommended to use Visual Studio Code with the Format on Save setting turned on.

Built with

Additional Resources

License

Opens source under the MIT License.

Built with ❤️ by aeksco

Name		Name	Last commit message	Last commit date
Latest commit History 592 Commits
.github		.github
.vscode		.vscode
src		src
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
cdk.json		cdk.json
index.ts		index.ts
jest.config.js		jest.config.js
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

aws-pdf-textract-pipeline

Overview

Scripts

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

aeksco/aws-pdf-textract-pipeline

Folders and files

Latest commit

History

Repository files navigation

aws-pdf-textract-pipeline

Overview

Scripts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages