Skip to content

Commit 9c02123

Browse files
refactor(llm): separate general pipeline from receipt-specific logic
- Create document_extraction_pipeline.py with generic reusable functions - Move Receipt schemas into extract_receipts_pipeline.py - Keep all receipt logic (schema + transformations + example) in one file - Keep all general logic in separate reusable pipeline - Delete schemas/ directory and example_invoice.py - Update README with new structure and usage examples
1 parent 966bf3f commit 9c02123

File tree

6 files changed

+375
-344
lines changed

6 files changed

+375
-344
lines changed
Lines changed: 134 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Generic Document Extraction Pipeline
1+
# Document Extraction Pipeline
22

33
A flexible, schema-driven pipeline for extracting structured data from any type of document or image using LlamaParse and OpenAI.
44

@@ -9,58 +9,149 @@ A flexible, schema-driven pipeline for extracting structured data from any type
99
- **Flexible Transformations**: Apply custom transformation functions to extracted data
1010
- **Extensible**: Easy to adapt for receipts, invoices, forms, IDs, or any document type
1111

12+
## File Structure
13+
14+
```
15+
llm/smart_data_extraction_llamaindex/
16+
├── document_extraction_pipeline.py # Generic pipeline (reusable)
17+
├── extract_receipts_pipeline.py # Receipt-specific (schema + logic + example)
18+
└── README.md # This file
19+
```
20+
1221
## Quick Start
1322

14-
### 1. Define Your Schema
23+
### Option 1: Use the Receipt Pipeline
1524

16-
Create a Pydantic model for your document type:
25+
Run the ready-to-use receipt extraction pipeline:
26+
27+
```bash
28+
uv run extract_receipts_pipeline.py
29+
```
30+
31+
The receipt pipeline includes:
32+
- `Receipt` and `ReceiptItem` Pydantic schemas
33+
- Receipt-specific data transformations
34+
- Pre-configured extraction prompt
35+
- Example usage in `__main__` block
36+
37+
### Option 2: Create Your Own Pipeline
38+
39+
Import the generic pipeline and create a custom extractor:
1740

1841
```python
42+
from datetime import date
43+
from pathlib import Path
44+
from typing import Optional
45+
46+
import pandas as pd
1947
from pydantic import BaseModel, Field
48+
from document_extraction_pipeline import main
49+
2050

51+
# 1. Define your schema
2152
class Invoice(BaseModel):
2253
invoice_number: str = Field(description="Invoice number")
2354
vendor_name: str = Field(description="Vendor name")
55+
invoice_date: Optional[date] = Field(default=None)
2456
total_amount: float = Field(description="Total amount")
57+
58+
59+
# 2. Optional: Define transformations
60+
def transform_invoice_data(df: pd.DataFrame) -> pd.DataFrame:
61+
df = df.copy()
62+
df["vendor_name"] = df["vendor_name"].str.upper()
63+
df["total_amount"] = pd.to_numeric(df["total_amount"], errors="coerce")
64+
return df
65+
66+
67+
# 3. Define extraction prompt
68+
INVOICE_PROMPT = """
69+
Extract invoice data from the following document.
70+
If a field is missing, return null.
71+
72+
{context_str}
73+
"""
74+
75+
76+
# 4. Run extraction
77+
if __name__ == "__main__":
78+
invoice_paths = ["invoice1.pdf", "invoice2.pdf"]
79+
80+
result_df = main(
81+
image_paths=invoice_paths,
82+
output_cls=Invoice,
83+
prompt=INVOICE_PROMPT,
84+
id_column="invoice_id",
85+
transform_fn=transform_invoice_data,
86+
)
87+
88+
print(result_df)
2589
```
2690

27-
### 2. Run Extraction
91+
## API Reference
2892

29-
```python
30-
from extract_receipts_pipeline import main
93+
### `main()` Function
3194

32-
result_df = main(
33-
image_paths=["invoice1.pdf", "invoice2.pdf"],
34-
output_cls=Invoice,
35-
prompt="Extract invoice data from: {context_str}",
36-
id_column="invoice_id",
37-
)
95+
```python
96+
def main(
97+
image_paths: List[str],
98+
output_cls: Type[BaseModel],
99+
prompt: str,
100+
id_column: str = "document_id",
101+
fields: Optional[List[str]] = None,
102+
preprocess: bool = False,
103+
output_dir: Optional[Path] = None,
104+
scale_factor: int = 3,
105+
transform_fn: Optional[Callable[[pd.DataFrame], pd.DataFrame]] = None,
106+
) -> pd.DataFrame
38107
```
39108

109+
**Required Parameters:**
110+
- `image_paths`: List of document/image paths
111+
- `output_cls`: Pydantic model class for extraction
112+
- `prompt`: Extraction prompt template (must include `{context_str}`)
113+
114+
**Optional Parameters:**
115+
- `id_column`: Document ID column name (default: "document_id")
116+
- `fields`: Fields to extract (default: all model fields)
117+
- `preprocess`: Enable image preprocessing (default: False)
118+
- `output_dir`: Directory for preprocessed images
119+
- `scale_factor`: Image scaling factor (default: 3)
120+
- `transform_fn`: Custom transformation function
121+
122+
**Returns:**
123+
- `pd.DataFrame`: Extracted data
124+
40125
## Usage Examples
41126

42-
### Basic Extraction (No Ground Truth)
127+
### Basic Extraction
43128

44129
```python
45-
from schemas.receipt_schema import Receipt
46-
from extract_receipts_pipeline import main
130+
from document_extraction_pipeline import main
131+
from pydantic import BaseModel, Field
132+
133+
class BusinessCard(BaseModel):
134+
name: str = Field(description="Person's name")
135+
company: str = Field(description="Company name")
136+
email: str = Field(description="Email address")
47137

48138
result = main(
49-
image_paths=["receipt1.jpg"],
50-
output_cls=Receipt,
51-
prompt="Extract receipt data: {context_str}",
139+
image_paths=["card.jpg"],
140+
output_cls=BusinessCard,
141+
prompt="Extract business card info: {context_str}",
52142
)
53143
```
54144

55-
### With Preprocessing
145+
### With Image Preprocessing
56146

57147
```python
58148
from pathlib import Path
149+
from extract_receipts_pipeline import Receipt
59150

60151
result = main(
61152
image_paths=["low_res.jpg"],
62153
output_cls=Receipt,
63-
prompt="Extract data: {context_str}",
154+
prompt="Extract receipt: {context_str}",
64155
preprocess=True,
65156
output_dir=Path("processed_images"),
66157
scale_factor=3,
@@ -72,50 +163,39 @@ result = main(
72163
```python
73164
import pandas as pd
74165

75-
def transform_data(df: pd.DataFrame) -> pd.DataFrame:
76-
df["vendor"] = df["vendor"].str.upper()
77-
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
166+
def clean_data(df: pd.DataFrame) -> pd.DataFrame:
167+
df["name"] = df["name"].str.title()
168+
df["email"] = df["email"].str.lower()
78169
return df
79170

80171
result = main(
81-
image_paths=["invoice.pdf"],
82-
output_cls=Invoice,
172+
image_paths=["form.pdf"],
173+
output_cls=FormData,
83174
prompt="Extract: {context_str}",
84-
transform_fn=transform_data,
175+
transform_fn=clean_data,
85176
)
86177
```
87178

88-
## Parameters
179+
## Creating New Document Extractors
89180

90-
### Required
91-
- `image_paths`: List of document/image paths
92-
- `output_cls`: Pydantic model class for extraction
93-
- `prompt`: Extraction prompt template (must include `{context_str}`)
94-
95-
### Optional
96-
- `id_column`: Document ID column name (default: "document_id")
97-
- `fields`: Fields to extract (default: all model fields)
98-
- `preprocess`: Enable image preprocessing (default: False)
99-
- `output_dir`: Directory for preprocessed images
100-
- `scale_factor`: Image scaling factor (default: 3)
101-
- `transform_fn`: Custom transformation function
181+
To create a new document extractor (like the receipt pipeline):
102182

103-
## File Structure
183+
1. Import the generic `main` function from `document_extraction_pipeline`
184+
2. Define your Pydantic schema(s)
185+
3. (Optional) Create transformation function
186+
4. Define extraction prompt
187+
5. Add `__main__` block with example usage
104188

105-
```
106-
llm/smart_data_extraction_llamaindex/
107-
├── extract_receipts_pipeline.py # Main pipeline
108-
├── schemas/
109-
│ ├── __init__.py
110-
│ └── receipt_schema.py # Receipt schema example
111-
├── example_invoice.py # Invoice extraction example
112-
└── README.md # This file
113-
```
189+
See [extract_receipts_pipeline.py](extract_receipts_pipeline.py) for a complete example.
114190

115-
## Custom Schema Examples
191+
## Dependencies
116192

117-
See:
118-
- `schemas/receipt_schema.py` - Receipt extraction
119-
- `example_invoice.py` - Invoice extraction example
193+
Both files include uv inline script dependencies. Required packages:
194+
- llama-index
195+
- llama-index-program-openai
196+
- llama-parse
197+
- python-dotenv
198+
- pandas
199+
- pillow
120200

121-
Create your own schemas in the `schemas/` directory!
201+
Run with `uv run <script_name>.py` - dependencies will be automatically installed.

0 commit comments

Comments
 (0)