Extract tables from PDF documents using Azure Document Intelligence (formerly Form Recognizer) prebuilt-layout model.
- Uses
prebuilt-layoutmodel to detect tables - Exports tables to CSV and/or JSON
- Pretty console rendering of tables
- Environment-based configuration
- Azure subscription
- Azure Document Intelligence resource (Cognitive Services / Document Intelligence) with an endpoint + key.
- Python 3.10+
If you have not created a resource yet:
az group create -n my-docint-rg -l eastus
az cognitiveservices account create `
-n my-docint-resource `
-g my-docint-rg `
-l eastus `
--kind FormRecognizer `
--sku S0 `
--yes
az cognitiveservices account keys list -n my-docint-resource -g my-docint-rgTake the endpoint from:
az cognitiveservices account show -n my-docint-resource -g my-docint-rg --query properties.endpoint -o tsvClone / copy this folder then:
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txtCreate .env from template:
Copy-Item .env.example .envFill in values:
AZURE_DOCUMENTINTELLIGENCE_ENDPOINT=https://<your-resource>.cognitiveservices.azure.com/
AZURE_DOCUMENTINTELLIGENCE_KEY=<key>
python .\src\di_extract_tables.py .\sample.pdf --out output --format bothOptions:
--format csv|json|both(default both)--no-prettyskip console table rendering
Outputs:
output/<pdfstem>_table{n}.csvfor each tableoutput/<pdfstem>_tables.jsonaggregated structure (if json enabled)
[
{
"table_index": 0,
"row_count": 5,
"column_count": 4,
"cells": [["A1", "B1", ...], ["A2", ...]]
}
]- For semantic elements (headings, paragraphs) or styles, still use
prebuilt-layout. - For invoices, receipts, IDs: switch to specific prebuilt model (e.g.
prebuilt-invoice). - Large PDFs: consider paging or splitting before processing to reduce latency.
- Handle rate limiting: catch
HttpResponseErrorwith status 429 and backoff.
- Add unit tests (mock client) for table normalization
- Add option to merge multi-page tables (requires analyzing cell bounding regions & spans)
- Integrate into a larger pipeline (e.g., Azure Functions or batch job)
| Issue | Cause | Fix |
|---|---|---|
| Auth error 401 | Wrong key | Regenerate keys in portal/CLI |
| Empty tables | Document has no detectable grid | Verify visually / adjust source PDF |
| Mixed languages | Layout model returns raw text only | Use OCR languages options if/when exposed |
Internal / sample usage. Adapt as needed.