release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10) by yfedoseev · Pull Request #362 · yfedoseev/pdf_oxide

yfedoseev · 2026-04-15T22:58:34Z

Summary

One-stop release branch stacking every benchmark-harness-driven bug fix plus the harness infrastructure. Intended for single-checkout review and for cutting v0.3.32 from release/v0.3.31 in one merge.

Cumulative impact vs v0.3.31 on 78 unique Kreuzberg fixtures

	v0.3.31	HEAD	Δ
TF1 mean	0.919	0.930	+1.1pp
TF1 p50	0.965	0.974	+0.9pp
TF1 p10	0.776	0.849	+7.3pp ← tied with pdftotext
SF1 mean	0.337	0.355	+1.8pp
SF1 p10	0.121	0.129	+0.8pp
order mean	0.804	0.818	+1.4pp
runtime	8.3 s	4.8 s	−42 %
per-fixture TF1 regressions > 0.5pp	—	zero

vs pdftotext on the same corpus: SF1 +10.8pp ahead, TF1 p10 tied.

Shipped fixes (all have own review-ready sub-PRs)

Bug	PR	Canary	Impact
B1 shared Form XObject per-page CTM	#360	nougat_005	+64.7pp on that fixture, +7.2pp corpus p10
B3 running-artifact first-occurrence kept	#359	pdfa_010	+1.04pp order mean
B4 XY-cut for multi-column pages	#358	synthetic 2×20 grid	corpus neutral, TDD-correct
B7 stroke+fill overlap dedup	#357	nougat_016	fixes "EverestEverest"
B8a soft-hyphen dehyphenation	(on this branch)	pdfa_044, nougat_029	+0.08pp
B9 TrueType cmap format 0	#356	8 MS Office fixtures	warnings resolved
Benchmark harness	#361	—	TF1/SF1 measurement infra

Deferred with precise reproducers (see `tools/benchmark-harness/FINAL_RESULTS.md`)

B5 re-diagnosed: not "empty pages" — ToUnicode CMap CID-miss producing ASCII-shifted ciphertext on nougat_035 page 3+. Est +0.4pp.
B6 re-diagnosed: not table-detector row filtering — stream-decode bug where /Length 1677 /Filter /FlateDecode decompresses to 128 bytes of garbage on nougat_026. Est +0.2pp.
B8b intra-word TJ spaces: needs per-font threshold calibration sweep. Est +0.4pp.
B2: dropped as not-a-bug after re-verification.

Fixing those three would close the remaining 1.5pp TF1 gap to pdftotext while keeping the 10.8pp SF1 lead.

Test plan

cargo test --release --lib — 4365 lib tests pass
cargo test --release --test test_b1_shared_form_xobject_per_page_ctm --test test_b3_first_occurrence_of_running_header_kept --test test_b4_two_column_reading_order --test test_b7_stroke_fill_dedup — 4 new regression tests pass
cargo test -p benchmark-harness — 18 harness tests pass
cargo clippy --all-targets --all-features -- -D warnings — clean
make benchmark-fetch && make benchmark-run ENGINE=pdf-oxide OUTPUT=head.json + diff gate — zero per-fixture regressions > 0.5pp
No runtime regression — 42 % faster end-to-end on the corpus

Merge strategy

Either merge sub-PRs individually (#360, #359, #358, #357, #356, #361) or merge this combined branch if you want it all in one go. Both paths produce an identical release/v0.3.31 head state.

Reproduce

git checkout fix/all-benchmark-bugfixes
make benchmark-fetch
cargo build --release -p benchmark-harness
make benchmark-run ENGINE=pdf-oxide OUTPUT=target/head.json
make benchmark-run ENGINE=pdftotext OUTPUT=target/pdftotext.json
./target/release/benchmark-harness diff target/head.json target/pdftotext.json

…ge (B1) Found via the benchmark-harness Kreuzberg sweep. `extract_text(n)` returned page 0's content for every `n` on PDFs where one Form XObject carries every page's text and each page's content stream applies its own CTM translation to position a viewport into it (ExpertPdf-style output — nougat_005.pdf TF1 0.254 vs pdftotext 0.924). Two independent bugs: 1. `xobject_spans_cache` (src/document.rs, used by the text extractor) stored CTM-transformed page coordinates keyed by ObjectRef only. When page 1 invoked the same XObject with a different CTM, the cache returned page 0's coordinates wholesale. Now the cache is only hit and populated when the current CTM is identity — which still covers the common case (repeated headers/footers stamped at the same origin) without mixing coordinate systems. Non-identity invocations skip the cache and re-extract, which correctly applies the caller's CTM. Added `Matrix::is_identity()` to express this cleanly. 2. Even with (1) fixed, every page still emitted spans from the entire XObject at distinct-but-out-of-bounds Y coordinates because the text pipeline doesn't honour the content stream's `W n` clipping operator yet. Post-filter extracted spans by the page's MediaBox (with a 2pt tolerance so bleed/trim content isn't dropped) so each page only retains the slice that its viewport would show. Regression test synthesises a 2-page PDF where both pages invoke the same Form XObject with different CTM translations — page 0 sees both labels, page 1 sees only the one still inside MediaBox after -700pt translation. Test fails without the fix (both pages return identical text). Verified on the Kreuzberg fixture that surfaced this (nougat_005.pdf, 5 pages, one XObject): each page now returns distinct content — "REPORTING FRAMEWORK", "COMMITMENTS Criteria 1.1", "Criteria 2.2", "Criteria 3.2 ON-SITE WASTE DIVERSION" across pages 1-4 where previously all 5 pages returned page 0's dashboard.

When the document's cover-page title also happens to repeat on every subsequent page as a running header (common in reports: "University of Oklahoma 2009", "Fiscal Year 2010 Appropriations Act"), the detector from c3d3e3f classified it as chrome and dropped it on every page — including the first. Benchmark-harness scored 5PFVA6… at a real TF1 loss for this reason. Signatures map now stores first-seen-page alongside each repeating signature. `mark_running_artifact_spans` skips the marking when the current page is the signature's first occurrence — the chrome removal still fires on every later page, so running headers/footers are still stripped from pages 2..N as intended. Storage type widens from `HashSet<String>` to `HashMap<String, usize>`; the lookup is a single `.get()` per candidate span (previously `.contains()`), same O(1). TDD regression test in tests/test_b3_first_occurrence_of_running_header_kept.rs synthesises a 3-page PDF where "Universal Title" repeats in every page's top band. After the fix page 0 retains it as the title, pages 1 and 2 drop it as chrome — test fails on the old behaviour.

`extract_text` used a row-aware Y-band sort (Y descending within bands, X ascending within a row) for every page. On multi-column layouts this interleaved content: LeftCol-row1 RightCol-row1 LeftCol- row2 RightCol-row2 … which is semantically wrong — a human reads all of the left column first, then all of the right column. Benchmark scored order_mean at 0.80 vs pdftotext's 0.86 because of this. Fix: add `is_multi_column_page` heuristic that bins span X-centers into a 40-bucket histogram, finds peaks above mean + 50 %, and confirms the peaks have ≥50 % vertical overlap (so a one-column body with a footer doesn't register as two columns). When it fires, route the spans through the existing XYCutStrategy from `pipeline/reading_order/xycut.rs`. Falls back to row-aware on single- column or XY-cut error. Safety posture: - False negative (missed column page) → old behaviour, no regression. - False positive (single-column routed through XY-cut) → extra CPU but XY-cut degrades to identity on already-ordered single-column input, no extractor content change. - Degenerate CTM spans filtered out via the 5 000pt MAX_EXTENT guard that the spatial-table detector already uses. Regression test synthesises a 2×20 grid of labels — all Left* / Right* strings must appear in intra-column order, and the middle Left label must come before the middle Right label. Without the fix, row-aware sort interleaves them.

Microsoft Office subset fonts (Calibri, Times New Roman) in Word/Excel PDF exports sometimes ship only a format-0 cmap (legacy 1-byte Mac Roman). Our parser previously bailed with "Unsupported cmap format: 0" and the font had no glyph→unicode mapping, which cascaded into text extraction losing any content rendered with that font. Surfaced on ~8 Kreuzberg fixtures (pdfa_036, pdfa_037, pdfa_049, nougat_033, …) that log this warning on every run. Estimated corpus impact: +0.25pp TF1 mean. Added `parse_cmap_format0`: - Reads the 6-byte header (length 262, language). - Reads exactly 256 glyph ids. - Maps byte code → char (the code itself is the character for ASCII, conservatively safe for the 0x80+ Mac Roman half — better than no mapping, which was the previous behaviour). - Truncated glyph arrays surface as parse errors instead of silent zero-glyph output. Two new unit tests: round-trip byte→char mapping; truncated glyph array rejection. Platform (1, 0) Mac Roman subtables were already accepted by the subtable picker at priority 0. Also adds tools/.gitignore so benchmark-harness fixture symlinks don't sneak into bug-fix branches when a reviewer runs the local benchmark pipeline to measure this change.

Canary: nougat_016 (City of Kirkland Lakeview neighbourhood map). Map labels are drawn twice by design — once as a stroke outline, once as a fill — so they stand out over raster tiles. Both passes surface as distinct TextSpans at essentially the same CTM. The downstream merge_adjacent_spans pass then concatenated them, producing: "EverestEverest", "CentralCentral", "HoughtonHoughton" Added `dedup_stroke_fill_overlap` as a Phase-0 pass before the existing position/content deduplications. For each candidate span, bucket by lowercased trimmed text; on collision, compute IoU against prior bboxes in the same bucket and drop the new span if IoU ≥ 0.7. Conservative to avoid false positives: - Only triggers when text length ≥ 2 (shorter keys are too easily over-matched; existing positional dedup handles single-glyph dupes). - IoU threshold 0.7 means the two bboxes must cover nearly the same rectangle — legitimate layouts where the same word appears twice on a page at different positions stay intact. Regression test: synthetic PDF draws "Everest" twice at identical CTM with Tr-1 (stroke) then Tr-0 (fill) render modes — the exact pattern used by map/poster exports. extract_text must produce the word exactly once, never "EverestEverest". Estimated corpus impact: +0.2pp TF1 mean, closes ~14pp of TF1 loss on nougat_016 and ~6pp on pdfa_026 (underscore form-field dupes). Also adds tools/.gitignore so fixture symlinks from a local benchmark run don't get committed when a reviewer measures this.

# Conflicts: # tools/.gitignore

…results Stacks every benchmark-harness-driven bug fix on one branch for a single-checkout review. Cumulative impact vs v0.3.31 on 78 unique Kreuzberg fixtures: TF1 mean 0.919 → 0.930 (+1.1pp) TF1 p10 0.776 → 0.849 (+7.3pp) ← hard-tail TIED with pdftotext SF1 mean 0.337 → 0.355 (+1.8pp) order mean 0.804 → 0.818 (+1.4pp) runtime 8.3s → 4.8s (−42%) per-fixture TF1 regressions > 0.5pp: zero Shipped fixes: - B1 shared Form XObject per-page CTM (nougat_005 +64.7pp TF1) - B3 running-artifact first-occurrence kept - B4 XY-cut for multi-column pages (synthetic TDD test) - B7 stroke+fill overlap dedup (nougat_016 "EverestEverest") - B8a soft-hyphen line-break dehyphenation - B9 TrueType cmap format 0 parser (MS Office subsets) Deferred with precise reproducers in FINAL_RESULTS.md: - B2 not-a-bug (pages genuinely empty) - B5 re-diagnosed: ToUnicode CMap CID-miss on specific fonts on nougat_035 page 3+. Emits `%B+$%8A//` ciphertext. Needs lazy- parse path instrumentation. Est +0.4pp. - B6 re-diagnosed: Stream-decode bug (obj 10 FlateDecode 1677→128 bytes garbage). Est +0.2pp. - B8b intra-word TJ spaces: needs per-font threshold calibration sweep. Est +0.4pp. Shipping the three deferred fixes would close the remaining 1.5pp TF1 gap to pdftotext while keeping the 10.8pp SF1 lead. Branch structure: each fix has its own clean branch off v0.3.31 for isolated review. This branch merges them all plus the benchmark- harness infrastructure so reviewers can run the full before/after cycle from one checkout.

1. MediaBox filter used (x, y, w, h) interpretation but get_page_media_box returns (llx, lly, urx, ury). Fixed to use absolute coordinates directly. Was accidentally correct for (0,0) origin pages (common) but wrong for non-zero-origin MediaBoxes. 2. B3 first-seen page tracking: if the cover page had no body content (all text in the top/bottom 12% chrome band), the loop skipped it and first_seen pointed to page 2+. Now first_seen_any tracks the earliest page a signature appeared on regardless of body-content, so the cover-page title exemption always fires. 3. B7 dedup: trimmed.len() counted bytes not chars — a single non-ASCII glyph (≥2 bytes) could enter the dedup path when it shouldn't. Changed to trimmed.chars().count(). Removed unused usize index from the seen HashMap (was (usize, Rect), now just Vec<Rect>). Fixed "quadtree-lite" comment to match actual HashMap implementation. 4. B4: added comment explaining why spans.clone() is necessary before XY-cut (apply() takes ownership; Err branch needs the original). Fixed histogram comment: "mean + stddev" → "mean × 1.5" to match the actual threshold code. 5. save_identity_ctm: added doc comment explaining that the check runs post-XObject /Matrix concatenation (conservative-safe, not a correctness bug; perf optimisation deferred). 6. B3 test: removed unused enumerate() + let _ = idx. All 4 regression tests pass. No functional changes to extraction logic except items 1 and 2 (real bug fixes for non-zero-origin pages and all-chrome cover pages respectively).

PRs targeting release/* branches (like #356–#362) weren't getting CI because the workflow only triggered on main/develop. Now release branches get the same coverage. Pattern uses 'release/**' (double-star glob) to match both release/v0.3.31 and any deeper nesting.

1. MediaBox filter used (x, y, w, h) interpretation but get_page_media_box returns (llx, lly, urx, ury). Fixed to use absolute coordinates directly. Was accidentally correct for (0,0) origin pages (common) but wrong for non-zero-origin MediaBoxes. 2. B3 first-seen page tracking: if the cover page had no body content (all text in the top/bottom 12% chrome band), the loop skipped it and first_seen pointed to page 2+. Now first_seen_any tracks the earliest page a signature appeared on regardless of body-content, so the cover-page title exemption always fires. 3. B7 dedup: trimmed.len() counted bytes not chars — a single non-ASCII glyph (≥2 bytes) could enter the dedup path when it shouldn't. Changed to trimmed.chars().count(). Removed unused usize index from the seen HashMap (was (usize, Rect), now just Vec<Rect>). Fixed "quadtree-lite" comment to match actual HashMap implementation. 4. B4: added comment explaining why spans.clone() is necessary before XY-cut (apply() takes ownership; Err branch needs the original). Fixed histogram comment: "mean + stddev" → "mean × 1.5" to match the actual threshold code. 5. save_identity_ctm: added doc comment explaining that the check runs post-XObject /Matrix concatenation (conservative-safe, not a correctness bug; perf optimisation deferred). 6. B3 test: removed unused enumerate() + let _ = idx. All 4 regression tests pass. No functional changes to extraction logic except items 1 and 2 (real bug fixes for non-zero-origin pages and all-chrome cover pages respectively).

yfedoseev · 2026-04-16T05:31:54Z

Superseded by #355 (v0.3.31 release merge). All commits from this branch are in main via the release merge; closing to reduce noise.

yfedoseev added 10 commits April 15, 2026 10:44

Merge branch 'fix/b7-stroke-fill-dedup' into fix/all-benchmark-bugfixes

60d242d

# Conflicts: # tools/.gitignore

Merge branch 'fix/b9-cmap-format-0' into fix/all-benchmark-bugfixes

209116f

# Conflicts: # tools/.gitignore

yfedoseev closed this Apr 16, 2026

yfedoseev deleted the fix/all-benchmark-bugfixes branch April 16, 2026 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10)#362

release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10)#362
yfedoseev wants to merge 10 commits into
release/v0.3.31from
fix/all-benchmark-bugfixes

yfedoseev commented Apr 15, 2026

Uh oh!

yfedoseev commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yfedoseev commented Apr 15, 2026

Summary

Cumulative impact vs v0.3.31 on 78 unique Kreuzberg fixtures

Shipped fixes (all have own review-ready sub-PRs)

Deferred with precise reproducers (see tools/benchmark-harness/FINAL_RESULTS.md)

Test plan

Merge strategy

Reproduce

Uh oh!

yfedoseev commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deferred with precise reproducers (see `tools/benchmark-harness/FINAL_RESULTS.md`)