Automatically transform any PDF into a PDF/UA-1 (ISO 14289-1) compliant, fully accessible document - passing PAC 2024 and veraPDF validation out of the box.
Organizations are legally required to make their digital documents accessible (ADA, Section 508, EN 301 549, EAA 2025). Manually remediating PDFs for accessibility is expensive, slow, and error-prone - a single document can take hours of specialist work.
VAPT is an automated pipeline that takes ordinary PDFs and applies a four-stage remediation process to produce fully accessible PDF/UA-1 documents. It handles structure tagging, metadata injection, font embedding, link annotation wiring, and validation - with both a CLI and a Streamlit web UI.
| Accessibility Issue | How It's Handled |
|---|---|
| Missing structure tags | Injects /Document, /P, /H1-/H6, /Figure, /Table, /TR, /TD, /L, /LI |
| Untagged images | Adds /Figure elements with /Alt text, /BBox, and /Placement layout attributes |
| Missing XMP metadata | Writes dc:title, dc:language, pdfuaid:part=1, pdf:Producer |
| Unembedded fonts | Locates system TTF files, subsets to used glyphs, embeds without altering existing metrics |
| Missing MarkInfo | Sets /Marked true and /Suspects false |
| Tab order & ViewerPrefs | Sets /Tabs /S on all pages and /DisplayDocTitle true |
| Broken link annotations | Wires /Link structure elements with both MCR (text content ref) and OBJR (annotation ref) |
| Watermarks / headers / footers | Detected and tagged as /Artifact so screen readers skip them |
| Missing ToUnicode CMap | Generates Windows-1252 CMap for proper text extraction |
| CIDFont issues | Adds /CIDToGIDMap /Identity, removes invalid CIDSet streams |
Input PDF
|
v
+--------------------+
| pdf_extractor.py | Stage 1 - Content Extraction & Classification
| | Parses text, fonts, images, bounding boxes
| | Detects headings, lists, tables, watermarks, headers/footers
+--------+-----------+
| DocumentContent (dataclass graph)
v
+--------------------+
| pdf_tagger.py | Stage 2 - Structure Tag Injection
| | Rewrites content streams with BDC/EMC markers
| | Builds StructTreeRoot, ParentTree, RoleMap
| | Handles link annotations (MCR + OBJR)
+--------+-----------+
| Tagged PDF
v
+--------------------+
| pdf_postprocess.py | Stage 3 - Metadata & Font Post-Processing
| | XMP metadata, MarkInfo, ViewerPreferences
| | Font embedding with metric preservation
| | CIDFont fixes, annotation tagging, cleanup
+--------+-----------+
| Remediated PDF
v
+--------------------+
| validator.py | Stage 4 - veraPDF Validation (optional)
| | Runs veraPDF CLI with PDF/UA-1 profile
| | Parses JSON report, surfaces failing clauses
+--------------------+
|
v
Output PDF --> <original>_accessible.pdf
| Dependency | Version | Required | Purpose |
|---|---|---|---|
| Python | >= 3.10 | Yes | Runtime |
| pikepdf | >= 8 | Yes | Low-level PDF manipulation |
| pdfminer.six | >= 20221105 | Yes | Text extraction with font metrics |
| Pillow | >= 10 | Yes | Image processing |
| fonttools | >= 4.0 | Yes | TTF subsetting for font embedding |
| langdetect | >= 1.0.9 | Yes | Document language detection |
| streamlit | >= 1.30 | For Web UI | Web interface |
| veraPDF | >= 1.24 | For validation | PDF/UA-1 validation (Java) |
# Clone the repo
git clone https://github.com/Mohakgarg5/PdfRemediationTool.git
cd PdfRemediationTool
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtOptional - Install veraPDF (for validation stage):
# macOS
brew install verapdf
# Linux - download from https://verapdf.org/software/
java -jar verapdf-installer.jar# Process all PDFs in input/ directory
python main.py
# Process a single file
python main.py --input report.pdf
# Custom directories
python main.py --input-dir docs/ --output-dir accessible_docs/
# Skip veraPDF validation (no Java needed)
python main.py --skip-validation
# Debug logging
python main.py --verboseOutput files are written as <original_name>_accessible.pdf in the output directory.
streamlit run app.pyOpen http://localhost:8501 - drag and drop PDFs, process, and download remediated files.
chmod +x run.sh
./run.sh # Process all PDFs in input/
./run.sh --input file.pdf # Single file
./run.sh --skip-validation # Skip veraPDFvapt/
|-- main.py # CLI entry point & pipeline orchestrator
|-- app.py # Streamlit web UI
|-- pdf_extractor.py # Stage 1: content extraction & classification (839 lines)
|-- pdf_tagger.py # Stage 2: structure tag injection (1,096 lines)
|-- pdf_postprocess.py # Stage 3: metadata, fonts, annotations (1,002 lines)
|-- validator.py # Stage 4: veraPDF integration (201 lines)
|-- models.py # Shared dataclasses (DocumentContent, TextBlock, BBox, etc.)
|-- config.py # Tunable constants (heading ratios, zone sizes, etc.)
|-- requirements.txt # Python dependencies
|-- packages.txt # System packages for Streamlit Cloud (fonts-liberation)
|-- run.sh # CLI convenience wrapper
|-- input/ # Drop source PDFs here (CLI mode)
|-- output/ # Remediated PDFs appear here
|-- report/ # veraPDF validation reports
All tunable constants live in config.py:
# Heading detection - ratio of font size to body text
HEADING_SIZE_RATIO_H1 = 1.8 # >= 1.8x body = H1
HEADING_SIZE_RATIO_H2 = 1.5 # >= 1.5x body = H2
HEADING_SIZE_RATIO_H3 = 1.25 # >= 1.25x body = H3
HEADING_SIZE_RATIO_H4 = 1.1 # >= 1.1x body = H4 (must also be bold)
# Header/footer zones (fraction of page height)
HEADER_ZONE_FRACTION = 0.08 # Top 8%
FOOTER_ZONE_FRACTION = 0.08 # Bottom 8%
# Watermark detection
WATERMARK_MIN_ROTATION = 15.0 # degrees
WATERMARK_MAX_ROTATION = 75.0 # degrees
WATERMARK_MIN_FONT_SIZE = 36.0 # points
WATERMARK_LIGHT_COLOR_THRESHOLD = 0.7 # 0=black, 1=white
# veraPDF profile
VERAPDF_PROFILE = "ua1"Target: PDF/UA-1 (ISO 14289-1)
Validated by:
- PAC 2024 (PDF Accessibility Checker) - Swiss PDF Association
- veraPDF - Industry-standard open-source validator
Matterhorn Protocol checkpoints addressed:
| Clause | Requirement |
|---|---|
| 01-004 | Tagged PDF flag (MarkInfo /Marked true) |
| 01-006 | Link elements contain both MCR and OBJR |
| 06-001 | Document language specified (/Lang) |
| 07-001 | Natural language in metadata |
| 07-010 | OCProperties properly configured |
| 07-18.1 | Widget annotations tagged as /Form |
| 07-18.5 | Link annotations tagged as /Link |
| 07-21.3.2 | CIDFontType2 has CIDToGIDMap |
| 07-21.4.2 | CIDSet streams valid or removed |
| 09-004 | /Figure elements have /Alt text |
| 14-002 | Artifacts correctly marked |
| 28-002 | XMP metadata includes pdfuaid:part = 1 |
- Scanned PDFs (image-only) are not supported - the pipeline requires selectable text. Run OCR first (e.g.,
ocrmypdf). - Right-to-left scripts (Arabic, Hebrew) are detected and language-tagged, but reading-order reversal is not applied.
- Complex multi-column layouts may produce suboptimal reading order; manual review is recommended.
- Encrypted PDFs must be decrypted before processing.
- Type1 fonts - only TTF/OTF embedding is supported (not PostScript Type1).
- veraPDF requires Java 11+ and must be installed separately.
PDF Accessibility Remediation Pipeline
==================================================
Processing 1 file(s)
[1/1] Processing: annual_report.pdf
[1/4] Extracting content from annual_report.pdf...
Found 142 text blocks, 8 images across 12 pages
Language: en, Title: Annual Report 2024
Structure: 18 headings, 3 tables, 21 list items, 4 artifacts
[2/4] Adding structure tags to original PDF...
Tagged: headings, paragraphs, images, artifacts
[3/4] Post-processing metadata with pikepdf...
Metadata fixed: MarkInfo, Lang, ViewerPreferences, TabOrder, XMP
[4/4] Validating with veraPDF...
Validation: PASS (94 passed, 0 failed)
============================
SUMMARY
PASS annual_report.pdf (3.2s)
Total: 1 | Processed: 1 | Compliant: 1 | Failed: 0
MIT