Follow-up from #320 / #361.
Context
The benchmark-harness (PR #361) ships with adapters for three engines:
pdf_oxide (in-process, workspace member)
pdftotext (subprocess, poppler-utils)
pdfium (linked, behind --features pdfium)
#320 originally listed docling-parse as a comparison target (Docling have their own C++ text-extraction engine with superior SF1 scores on supported document classes). This adapter was deferred because docling-parse is distributed as a Python package with a heavier dependency chain.
What to add
-
src/engine.rs: new arm EngineKind::Docling with a DoclingEngine impl that:
- Runs
python -c \"from docling_parse.parse import extract; print(extract('<pdf>'))\" (or the correct docling-parse API).
- Or invokes docling's
docling CLI if it's simpler / lower-overhead.
- Honours
DOCLING_BIN env var for custom paths.
- Probes the binary at
DoclingEngine::new() so a missing install fails fast.
-
tools/benchmark-harness/README.md: add docling to the engine matrix.
-
scripts/fetch-fixtures.sh: unchanged — docling reads the same Kreuzberg PDFs.
-
Optional: a Docker image recipe in tools/benchmark-harness/scripts/docling.dockerfile so CI doesn't need to install docling-parse natively.
Acceptance criteria
benchmark-harness run --engine docling --corpus ... --ground-truth ... --output docling.json produces a valid report.
- A 3-way comparison table in
FINAL_RESULTS.md shows pdf_oxide vs pdftotext vs docling on the 78-fixture Kreuzberg corpus.
- Documentation calls out docling's expected strength (SF1 on DocLayNet-style documents) so results aren't misread.
Why this matters
docling is the strongest external TF1/SF1 peer for PDFs with complex structure. If pdf_oxide's SF1 lead over pdftotext (+10.8pp) shrinks or inverts against docling, we know where the next wave of quality investment should go.
Related
Follow-up from #320 / #361.
Context
The benchmark-harness (PR #361) ships with adapters for three engines:
pdf_oxide(in-process, workspace member)pdftotext(subprocess, poppler-utils)pdfium(linked, behind--features pdfium)#320 originally listed docling-parse as a comparison target (Docling have their own C++ text-extraction engine with superior SF1 scores on supported document classes). This adapter was deferred because docling-parse is distributed as a Python package with a heavier dependency chain.
What to add
src/engine.rs: new armEngineKind::Doclingwith aDoclingEngineimpl that:python -c \"from docling_parse.parse import extract; print(extract('<pdf>'))\"(or the correct docling-parse API).doclingCLI if it's simpler / lower-overhead.DOCLING_BINenv var for custom paths.DoclingEngine::new()so a missing install fails fast.tools/benchmark-harness/README.md: add docling to the engine matrix.scripts/fetch-fixtures.sh: unchanged — docling reads the same Kreuzberg PDFs.Optional: a Docker image recipe in
tools/benchmark-harness/scripts/docling.dockerfileso CI doesn't need to install docling-parse natively.Acceptance criteria
benchmark-harness run --engine docling --corpus ... --ground-truth ... --output docling.jsonproduces a valid report.FINAL_RESULTS.mdshows pdf_oxide vs pdftotext vs docling on the 78-fixture Kreuzberg corpus.Why this matters
docling is the strongest external TF1/SF1 peer for PDFs with complex structure. If pdf_oxide's SF1 lead over pdftotext (+10.8pp) shrinks or inverts against docling, we know where the next wave of quality investment should go.
Related