Azure Document Intelligence Table Extraction (Python)

Extract tables from PDF documents using Azure Document Intelligence (formerly Form Recognizer) prebuilt-layout model.

Features

Uses prebuilt-layout model to detect tables
Exports tables to CSV and/or JSON
Pretty console rendering of tables
Environment-based configuration

Prerequisites

Azure subscription
Azure Document Intelligence resource (Cognitive Services / Document Intelligence) with an endpoint + key.
Python 3.10+

Provision (Azure CLI)

If you have not created a resource yet:

az group create -n my-docint-rg -l eastus
az cognitiveservices account create `
  -n my-docint-resource `
  -g my-docint-rg `
  -l eastus `
  --kind FormRecognizer `
  --sku S0 `
  --yes
az cognitiveservices account keys list -n my-docint-resource -g my-docint-rg

Take the endpoint from:

az cognitiveservices account show -n my-docint-resource -g my-docint-rg --query properties.endpoint -o tsv

Setup

Clone / copy this folder then:

python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt

Create .env from template:

Copy-Item .env.example .env

Fill in values:

AZURE_DOCUMENTINTELLIGENCE_ENDPOINT=https://<your-resource>.cognitiveservices.azure.com/
AZURE_DOCUMENTINTELLIGENCE_KEY=<key>

Usage

python .\src\di_extract_tables.py .\sample.pdf --out output --format both

Options:

--format csv|json|both (default both)
--no-pretty skip console table rendering

Outputs:

output/<pdfstem>_table{n}.csv for each table
output/<pdfstem>_tables.json aggregated structure (if json enabled)

JSON Structure

[
  {
    "table_index": 0,
    "row_count": 5,
    "column_count": 4,
    "cells": [["A1", "B1", ...], ["A2", ...]]
  }
]

Notes / Best Practices

For semantic elements (headings, paragraphs) or styles, still use prebuilt-layout.
For invoices, receipts, IDs: switch to specific prebuilt model (e.g. prebuilt-invoice).
Large PDFs: consider paging or splitting before processing to reduce latency.
Handle rate limiting: catch HttpResponseError with status 429 and backoff.

Next Steps

Add unit tests (mock client) for table normalization
Add option to merge multi-page tables (requires analyzing cell bounding regions & spans)
Integrate into a larger pipeline (e.g., Azure Functions or batch job)

Troubleshooting

Issue	Cause	Fix
Auth error 401	Wrong key	Regenerate keys in portal/CLI
Empty tables	Document has no detectable grid	Verify visually / adjust source PDF
Mixed languages	Layout model returns raw text only	Use OCR languages options if/when exposed

License

Internal / sample usage. Adapt as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Document Intelligence Table Extraction (Python)

Features

Prerequisites

Provision (Azure CLI)

Setup

Usage

JSON Structure

Notes / Best Practices

Next Steps

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

License

azure-data-ai-hub/document-intelligence

Folders and files

Latest commit

History

Repository files navigation

Azure Document Intelligence Table Extraction (Python)

Features

Prerequisites

Provision (Azure CLI)

Setup

Usage

JSON Structure

Notes / Best Practices

Next Steps

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages