Skip to content

[Feature] docling-parse adapter for benchmark-harness #367

@yfedoseev

Description

@yfedoseev

Follow-up from #320 / #361.

Context

The benchmark-harness (PR #361) ships with adapters for three engines:

  • pdf_oxide (in-process, workspace member)
  • pdftotext (subprocess, poppler-utils)
  • pdfium (linked, behind --features pdfium)

#320 originally listed docling-parse as a comparison target (Docling have their own C++ text-extraction engine with superior SF1 scores on supported document classes). This adapter was deferred because docling-parse is distributed as a Python package with a heavier dependency chain.

What to add

  1. src/engine.rs: new arm EngineKind::Docling with a DoclingEngine impl that:

    • Runs python -c \"from docling_parse.parse import extract; print(extract('<pdf>'))\" (or the correct docling-parse API).
    • Or invokes docling's docling CLI if it's simpler / lower-overhead.
    • Honours DOCLING_BIN env var for custom paths.
    • Probes the binary at DoclingEngine::new() so a missing install fails fast.
  2. tools/benchmark-harness/README.md: add docling to the engine matrix.

  3. scripts/fetch-fixtures.sh: unchanged — docling reads the same Kreuzberg PDFs.

  4. Optional: a Docker image recipe in tools/benchmark-harness/scripts/docling.dockerfile so CI doesn't need to install docling-parse natively.

Acceptance criteria

  • benchmark-harness run --engine docling --corpus ... --ground-truth ... --output docling.json produces a valid report.
  • A 3-way comparison table in FINAL_RESULTS.md shows pdf_oxide vs pdftotext vs docling on the 78-fixture Kreuzberg corpus.
  • Documentation calls out docling's expected strength (SF1 on DocLayNet-style documents) so results aren't misread.

Why this matters

docling is the strongest external TF1/SF1 peer for PDFs with complex structure. If pdf_oxide's SF1 lead over pdftotext (+10.8pp) shrinks or inverts against docling, we know where the next wave of quality investment should go.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions