Skip to content

srv1n/rag_pdf_extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Extract - Advanced PDF Content Extraction Library

A Rust library for extracting structured content from PDF files with precise positioning data and intelligent text processing for RAG applications.

Features

  • Text Extraction with Layout Analysis - Extracts text with precise positioning, font information, and layout awareness
  • Form XObject Support - Handles text embedded in PDF Form XObjects (common in legal documents)
  • Geometric Heading Detection - Uses visual/geometric features instead of just font properties
  • Smart Line Joining - Joins continuation lines while preserving document structure
  • Token-Aware Chunking - Splits content respecting sentence/paragraph boundaries
  • Location Tracking - Maintains page numbers, bounding boxes, and character ranges for highlighting
  • Header/Footer Filtering - Automatically identifies and filters repetitive content
  • OCR Integration - Built-in support for scanned documents

Installation

[dependencies]
pdf-extract = "0.7.7"

Quick Start

Basic Extraction

use pdf_extract::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let results = parse_pdf(
        "document.pdf",
        1,              // source_id
        "file",         // source_type
        None,           // OCR config
        None,           // OCR cache
        None,           // resume from
        Some(500),      // max tokens per chunk
        None            // LAParams (use default)
    )?;

    for result in results {
        println!("Content: {}", result.content_core.content);
        println!("Tokens: {}", result.content_core.token_count);
    }

    Ok(())
}

With Layout Analysis

Layout analysis enables better text extraction for complex documents:

use pdf_extract::*;

// Enable layout analysis with Form XObject support
let mut laparams = LAParams::default();
laparams.all_texts = true;  // Include text from Form XObjects

let results = parse_pdf(
    "document.pdf",
    1,
    "file",
    None,
    None,
    None,
    Some(500),
    Some(laparams)
)?;

When to use layout analysis:

  • Legal documents (text often in Form XObjects)
  • Multi-column layouts
  • Complex document structures
  • When you need precise line grouping

Architecture

Text Extraction Pipeline

PDF File
  ↓
Layout Analysis (lib.rs process_stream)
  - Glyph collection from content streams
  - Form XObject processing (if all_texts=true)
  - Line grouping by Y-coordinate proximity
  ↓
Segment Processing (document/processing.rs)
  - Line joining (within Form XObjects)
  - Segment merging (across Form XObjects)
  - Title block entity merging
  ↓
Heading Detection (document/analysis.rs)
  - Geometric features (height, width ratios)
  - ALL CAPS detection
  - Standalone line detection
  ↓
Chunking (chunk_accumulator.rs)
  - Token-limited chunks
  - Sentence/paragraph boundary awareness
  - Location metadata tracking

Key Components

Layout Analysis (lib.rs)

  • Extracts glyphs from PDF content streams
  • Groups glyphs into visual lines
  • Processes Form XObjects when LAParams.all_texts = true
  • Joins lines within XObjects based on terminal punctuation

Segment Processing (document/processing.rs)

  • merge_continuation_segments() - Merges segments on same visual line (Y-proximity)
  • merge_title_block_entities() - Joins consecutive short ALL CAPS lines (party names, etc.)
  • Filters headers/footers based on repetition patterns

Heading Detection (document/analysis.rs)

  • classify_line() - Uses geometric features:
    • Height ratio vs body text
    • Width ratio (short lines)
    • ALL CAPS detection
    • Standalone detection (next line at margin)
  • Title block heuristic: 3+ consecutive heading-like lines = metadata block

Chunking (chunk_accumulator.rs)

  • Token-limited accumulation with GPT-4 tokenizer
  • Intelligent boundary detection (sentences, paragraphs)
  • Tracks heading hierarchy per chunk
  • Maintains location metadata (pages, bounding boxes, char ranges)

Data Schema

ExtractionResult

pub struct ExtractionResult {
    pub content_core: ContentCore,
    pub content_ext: ContentExt,
}

pub struct ContentCore {
    pub chunk_id: String,           // blake3(content)
    pub source_id: i64,
    pub source_type: String,        // "file" | "web" | "api"
    pub content: String,            // extracted text
    pub token_count: i32,
    pub headings_json: Option<String>,  // heading hierarchy
    pub status: String,
    pub schema_version: i32,
    pub created_at: i64,
}

pub struct ContentExt {
    pub chunk_id: String,
    pub ext_json: Vec<u8>,          // zstd compressed location data
}

Location Tracking

pub enum FormatLocation {
    Pdf(PdfLocation),
    // Other formats...
}

pub struct PdfLocation {
    pub fragments: Vec<PageFragment>,
}

pub struct PageFragment {
    pub page: u32,
    pub char_range: CharRange,      // start, end positions
    pub bbox: BoundingBox,          // x, y, width, height
}

Configuration

LAParams (Layout Analysis Parameters)

pub struct LAParams {
    pub char_margin: f32,        // Max horizontal gap for word grouping (default: 2.0)
    pub word_margin: f32,        // Space injection threshold (default: 0.10)
    pub line_overlap: f32,       // Min vertical overlap for same line (default: 0.5)
    pub line_margin: f32,        // Max vertical gap for text box grouping (default: 0.5)
    pub boxes_flow: f32,         // Reading order bias (default: 0.5)
    pub detect_vertical: bool,   // Detect vertical text (default: false)
    pub all_texts: bool,         // Include Form XObject text (default: false)
}

Important: Set all_texts = true for documents with text in Form XObjects (common in legal PDFs).

Examples

Extract with Markdown Formatting

cargo run --release --example extract_markdown input.pdf > output.md

This example:

  • Uses layout analysis with all_texts = true
  • Converts headings to markdown format (##)
  • Joins continuation lines intelligently
  • Preserves paragraph structure

Basic Text Extraction

cargo run --release --example extract input.pdf 500

Arguments:

  • input.pdf - PDF file path
  • 500 - max tokens per chunk

Document Type Considerations

Immutable Documents (Court Cases, Published Papers)

  • Use library's built-in chunking
  • Larger chunks acceptable
  • Simpler storage path (no CDC tracking needed)

Editable Documents (Word docs, collaborative documents)

  • Upstream application handles CDC (Change Data Capture)
  • Fine-grained chunk tracking for citation stability
  • Library provides segments + locations, app re-chunks as needed

Architecture Decision: Document type classification and CDC logic belong in the application layer, not the PDF extraction library. This library focuses on quality extraction + location metadata.

Testing

Evaluation System

# Run evaluation on corpus
cd eval && python eval.py

# Generate reference extractions (Gemini, MarkItDown)
python generate_refs.py

# Compare outputs side-by-side
python compare.py "path/to/file.pdf"

Evaluation corpus includes:

  • Legal documents (Indian court cases)
  • Multi-column layouts
  • Documents with embedded fonts
  • Scanned documents (OCR test cases)

Known Limitations

Heading Detection

  • Some edge cases with signature lines (...J.) detected as headings
  • Aggressive merging may lose some intended line breaks in title blocks
  • Fine-tuning available via geometric thresholds in document/analysis.rs

Layout Analysis

  • Y-tolerance for line grouping: body_line_height * 0.25
  • May need adjustment for documents with unusual line spacing

Form XObjects

  • Must set LAParams.all_texts = true to extract text from Form XObjects
  • This is common in legal documents where text is embedded for layout control

Performance Considerations

  • Layout analysis adds overhead but improves quality for complex documents
  • Token counting uses estimation until 50% of chunk capacity, then switches to exact
  • Header/footer detection requires full document pass
  • OCR (when enabled) is the primary performance bottleneck

Contributing

The codebase is organized as:

src/
├── lib.rs                      # Core PDF parsing, layout analysis
├── chunk_accumulator.rs        # Token-aware chunking
├── layout_params.rs            # LAParams configuration
└── document/
    ├── processing.rs           # Segment processing, merging
    ├── analysis.rs             # Heading detection
    ├── stats.rs                # Document statistics, visual lines
    └── header_footer.rs        # Header/footer filtering

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •