1- # Generic Document Extraction Pipeline
1+ # Document Extraction Pipeline
22
33A flexible, schema-driven pipeline for extracting structured data from any type of document or image using LlamaParse and OpenAI.
44
@@ -9,58 +9,149 @@ A flexible, schema-driven pipeline for extracting structured data from any type
99- ** Flexible Transformations** : Apply custom transformation functions to extracted data
1010- ** Extensible** : Easy to adapt for receipts, invoices, forms, IDs, or any document type
1111
12+ ## File Structure
13+
14+ ```
15+ llm/smart_data_extraction_llamaindex/
16+ ├── document_extraction_pipeline.py # Generic pipeline (reusable)
17+ ├── extract_receipts_pipeline.py # Receipt-specific (schema + logic + example)
18+ └── README.md # This file
19+ ```
20+
1221## Quick Start
1322
14- ### 1. Define Your Schema
23+ ### Option 1: Use the Receipt Pipeline
1524
16- Create a Pydantic model for your document type:
25+ Run the ready-to-use receipt extraction pipeline:
26+
27+ ``` bash
28+ uv run extract_receipts_pipeline.py
29+ ```
30+
31+ The receipt pipeline includes:
32+ - ` Receipt ` and ` ReceiptItem ` Pydantic schemas
33+ - Receipt-specific data transformations
34+ - Pre-configured extraction prompt
35+ - Example usage in ` __main__ ` block
36+
37+ ### Option 2: Create Your Own Pipeline
38+
39+ Import the generic pipeline and create a custom extractor:
1740
1841``` python
42+ from datetime import date
43+ from pathlib import Path
44+ from typing import Optional
45+
46+ import pandas as pd
1947from pydantic import BaseModel, Field
48+ from document_extraction_pipeline import main
49+
2050
51+ # 1. Define your schema
2152class Invoice (BaseModel ):
2253 invoice_number: str = Field(description = " Invoice number" )
2354 vendor_name: str = Field(description = " Vendor name" )
55+ invoice_date: Optional[date] = Field(default = None )
2456 total_amount: float = Field(description = " Total amount" )
57+
58+
59+ # 2. Optional: Define transformations
60+ def transform_invoice_data (df : pd.DataFrame) -> pd.DataFrame:
61+ df = df.copy()
62+ df[" vendor_name" ] = df[" vendor_name" ].str.upper()
63+ df[" total_amount" ] = pd.to_numeric(df[" total_amount" ], errors = " coerce" )
64+ return df
65+
66+
67+ # 3. Define extraction prompt
68+ INVOICE_PROMPT = """
69+ Extract invoice data from the following document.
70+ If a field is missing, return null.
71+
72+ {context_str}
73+ """
74+
75+
76+ # 4. Run extraction
77+ if __name__ == " __main__" :
78+ invoice_paths = [" invoice1.pdf" , " invoice2.pdf" ]
79+
80+ result_df = main(
81+ image_paths = invoice_paths,
82+ output_cls = Invoice,
83+ prompt = INVOICE_PROMPT ,
84+ id_column = " invoice_id" ,
85+ transform_fn = transform_invoice_data,
86+ )
87+
88+ print (result_df)
2589```
2690
27- ### 2. Run Extraction
91+ ## API Reference
2892
29- ``` python
30- from extract_receipts_pipeline import main
93+ ### ` main() ` Function
3194
32- result_df = main(
33- image_paths = [" invoice1.pdf" , " invoice2.pdf" ],
34- output_cls = Invoice,
35- prompt = " Extract invoice data from: {context_str} " ,
36- id_column = " invoice_id" ,
37- )
95+ ``` python
96+ def main (
97+ image_paths : List[str ],
98+ output_cls : Type[BaseModel],
99+ prompt : str ,
100+ id_column : str = " document_id" ,
101+ fields : Optional[List[str ]] = None ,
102+ preprocess : bool = False ,
103+ output_dir : Optional[Path] = None ,
104+ scale_factor : int = 3 ,
105+ transform_fn : Optional[Callable[[pd.DataFrame], pd.DataFrame]] = None ,
106+ ) -> pd.DataFrame
38107```
39108
109+ ** Required Parameters:**
110+ - `image_paths` : List of document/ image paths
111+ - `output_cls` : Pydantic model class for extraction
112+ - `prompt` : Extraction prompt template (must include `{context_str}` )
113+
114+ ** Optional Parameters:**
115+ - `id_column` : Document ID column name (default: " document_id" )
116+ - `fields` : Fields to extract (default: all model fields)
117+ - `preprocess` : Enable image preprocessing (default: False )
118+ - `output_dir` : Directory for preprocessed images
119+ - `scale_factor` : Image scaling factor (default: 3 )
120+ - `transform_fn` : Custom transformation function
121+
122+ ** Returns:**
123+ - `pd.DataFrame` : Extracted data
124+
40125# # Usage Examples
41126
42- ### Basic Extraction (No Ground Truth)
127+ # ## Basic Extraction
43128
44129```python
45- from schemas.receipt_schema import Receipt
46- from extract_receipts_pipeline import main
130+ from document_extraction_pipeline import main
131+ from pydantic import BaseModel, Field
132+
133+ class BusinessCard(BaseModel):
134+ name: str = Field(description = " Person's name" )
135+ company: str = Field(description = " Company name" )
136+ email: str = Field(description = " Email address" )
47137
48138result = main(
49- image_paths = [" receipt1 .jpg" ],
50- output_cls = Receipt ,
51- prompt = " Extract receipt data : {context_str} " ,
139+ image_paths = [" card .jpg" ],
140+ output_cls = BusinessCard ,
141+ prompt = " Extract business card info : {context_str} " ,
52142)
53143```
54144
55- ### With Preprocessing
145+ # ## With Image Preprocessing
56146
57147```python
58148from pathlib import Path
149+ from extract_receipts_pipeline import Receipt
59150
60151result = main(
61152 image_paths = [" low_res.jpg" ],
62153 output_cls = Receipt,
63- prompt = " Extract data : {context_str} " ,
154+ prompt = " Extract receipt : {context_str} " ,
64155 preprocess = True ,
65156 output_dir = Path(" processed_images" ),
66157 scale_factor = 3 ,
@@ -72,50 +163,39 @@ result = main(
72163```python
73164import pandas as pd
74165
75- def transform_data (df : pd.DataFrame) -> pd.DataFrame:
76- df[" vendor " ] = df[" vendor " ].str.upper ()
77- df[" amount " ] = pd.to_numeric( df[" amount " ], errors = " coerce " )
166+ def clean_data (df: pd.DataFrame) -> pd.DataFrame:
167+ df[" name " ] = df[" name " ].str.title ()
168+ df[" email " ] = df[" email " ].str.lower( )
78169 return df
79170
80171result = main(
81- image_paths = [" invoice .pdf" ],
82- output_cls = Invoice ,
172+ image_paths = [" form .pdf" ],
173+ output_cls = FormData ,
83174 prompt = " Extract: {context_str} " ,
84- transform_fn = transform_data ,
175+ transform_fn = clean_data ,
85176)
86177```
87178
88- ## Parameters
179+ # # Creating New Document Extractors
89180
90- ### Required
91- - ` image_paths ` : List of document/image paths
92- - ` output_cls ` : Pydantic model class for extraction
93- - ` prompt ` : Extraction prompt template (must include ` {context_str} ` )
94-
95- ### Optional
96- - ` id_column ` : Document ID column name (default: "document_id")
97- - ` fields ` : Fields to extract (default: all model fields)
98- - ` preprocess ` : Enable image preprocessing (default: False)
99- - ` output_dir ` : Directory for preprocessed images
100- - ` scale_factor ` : Image scaling factor (default: 3)
101- - ` transform_fn ` : Custom transformation function
181+ To create a new document extractor (like the receipt pipeline):
102182
103- ## File Structure
183+ 1 . Import the generic `main` function from `document_extraction_pipeline`
184+ 2 . Define your Pydantic schema(s)
185+ 3 . (Optional) Create transformation function
186+ 4 . Define extraction prompt
187+ 5 . Add `__main__` block with example usage
104188
105- ```
106- llm/smart_data_extraction_llamaindex/
107- ├── extract_receipts_pipeline.py # Main pipeline
108- ├── schemas/
109- │ ├── __init__.py
110- │ └── receipt_schema.py # Receipt schema example
111- ├── example_invoice.py # Invoice extraction example
112- └── README.md # This file
113- ```
189+ See [extract_receipts_pipeline.py](extract_receipts_pipeline.py) for a complete example.
114190
115- ## Custom Schema Examples
191+ # # Dependencies
116192
117- See:
118- - ` schemas/receipt_schema.py ` - Receipt extraction
119- - ` example_invoice.py ` - Invoice extraction example
193+ Both files include uv inline script dependencies. Required packages:
194+ - llama- index
195+ - llama- index- program- openai
196+ - llama- parse
197+ - python- dotenv
198+ - pandas
199+ - pillow
120200
121- Create your own schemas in the ` schemas/ ` directory!
201+ Run with `uv run < script_name > .py` - dependencies will be automatically installed.
0 commit comments