release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10)#362
Closed
yfedoseev wants to merge 10 commits into
Closed
release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10)#362yfedoseev wants to merge 10 commits into
yfedoseev wants to merge 10 commits into
Conversation
…ge (B1) Found via the benchmark-harness Kreuzberg sweep. `extract_text(n)` returned page 0's content for every `n` on PDFs where one Form XObject carries every page's text and each page's content stream applies its own CTM translation to position a viewport into it (ExpertPdf-style output — nougat_005.pdf TF1 0.254 vs pdftotext 0.924). Two independent bugs: 1. `xobject_spans_cache` (src/document.rs, used by the text extractor) stored CTM-transformed page coordinates keyed by ObjectRef only. When page 1 invoked the same XObject with a different CTM, the cache returned page 0's coordinates wholesale. Now the cache is only hit and populated when the current CTM is identity — which still covers the common case (repeated headers/footers stamped at the same origin) without mixing coordinate systems. Non-identity invocations skip the cache and re-extract, which correctly applies the caller's CTM. Added `Matrix::is_identity()` to express this cleanly. 2. Even with (1) fixed, every page still emitted spans from the entire XObject at distinct-but-out-of-bounds Y coordinates because the text pipeline doesn't honour the content stream's `W n` clipping operator yet. Post-filter extracted spans by the page's MediaBox (with a 2pt tolerance so bleed/trim content isn't dropped) so each page only retains the slice that its viewport would show. Regression test synthesises a 2-page PDF where both pages invoke the same Form XObject with different CTM translations — page 0 sees both labels, page 1 sees only the one still inside MediaBox after -700pt translation. Test fails without the fix (both pages return identical text). Verified on the Kreuzberg fixture that surfaced this (nougat_005.pdf, 5 pages, one XObject): each page now returns distinct content — "REPORTING FRAMEWORK", "COMMITMENTS Criteria 1.1", "Criteria 2.2", "Criteria 3.2 ON-SITE WASTE DIVERSION" across pages 1-4 where previously all 5 pages returned page 0's dashboard.
When the document's cover-page title also happens to repeat on every subsequent page as a running header (common in reports: "University of Oklahoma 2009", "Fiscal Year 2010 Appropriations Act"), the detector from c3d3e3f classified it as chrome and dropped it on every page — including the first. Benchmark-harness scored 5PFVA6… at a real TF1 loss for this reason. Signatures map now stores first-seen-page alongside each repeating signature. `mark_running_artifact_spans` skips the marking when the current page is the signature's first occurrence — the chrome removal still fires on every later page, so running headers/footers are still stripped from pages 2..N as intended. Storage type widens from `HashSet<String>` to `HashMap<String, usize>`; the lookup is a single `.get()` per candidate span (previously `.contains()`), same O(1). TDD regression test in tests/test_b3_first_occurrence_of_running_header_kept.rs synthesises a 3-page PDF where "Universal Title" repeats in every page's top band. After the fix page 0 retains it as the title, pages 1 and 2 drop it as chrome — test fails on the old behaviour.
`extract_text` used a row-aware Y-band sort (Y descending within bands, X ascending within a row) for every page. On multi-column layouts this interleaved content: LeftCol-row1 RightCol-row1 LeftCol- row2 RightCol-row2 … which is semantically wrong — a human reads all of the left column first, then all of the right column. Benchmark scored order_mean at 0.80 vs pdftotext's 0.86 because of this. Fix: add `is_multi_column_page` heuristic that bins span X-centers into a 40-bucket histogram, finds peaks above mean + 50 %, and confirms the peaks have ≥50 % vertical overlap (so a one-column body with a footer doesn't register as two columns). When it fires, route the spans through the existing XYCutStrategy from `pipeline/reading_order/xycut.rs`. Falls back to row-aware on single- column or XY-cut error. Safety posture: - False negative (missed column page) → old behaviour, no regression. - False positive (single-column routed through XY-cut) → extra CPU but XY-cut degrades to identity on already-ordered single-column input, no extractor content change. - Degenerate CTM spans filtered out via the 5 000pt MAX_EXTENT guard that the spatial-table detector already uses. Regression test synthesises a 2×20 grid of labels — all Left* / Right* strings must appear in intra-column order, and the middle Left label must come before the middle Right label. Without the fix, row-aware sort interleaves them.
Microsoft Office subset fonts (Calibri, Times New Roman) in Word/Excel PDF exports sometimes ship only a format-0 cmap (legacy 1-byte Mac Roman). Our parser previously bailed with "Unsupported cmap format: 0" and the font had no glyph→unicode mapping, which cascaded into text extraction losing any content rendered with that font. Surfaced on ~8 Kreuzberg fixtures (pdfa_036, pdfa_037, pdfa_049, nougat_033, …) that log this warning on every run. Estimated corpus impact: +0.25pp TF1 mean. Added `parse_cmap_format0`: - Reads the 6-byte header (length 262, language). - Reads exactly 256 glyph ids. - Maps byte code → char (the code itself is the character for ASCII, conservatively safe for the 0x80+ Mac Roman half — better than no mapping, which was the previous behaviour). - Truncated glyph arrays surface as parse errors instead of silent zero-glyph output. Two new unit tests: round-trip byte→char mapping; truncated glyph array rejection. Platform (1, 0) Mac Roman subtables were already accepted by the subtable picker at priority 0. Also adds tools/.gitignore so benchmark-harness fixture symlinks don't sneak into bug-fix branches when a reviewer runs the local benchmark pipeline to measure this change.
Canary: nougat_016 (City of Kirkland Lakeview neighbourhood map).
Map labels are drawn twice by design — once as a stroke outline, once
as a fill — so they stand out over raster tiles. Both passes surface
as distinct TextSpans at essentially the same CTM. The downstream
merge_adjacent_spans pass then concatenated them, producing:
"EverestEverest", "CentralCentral", "HoughtonHoughton"
Added `dedup_stroke_fill_overlap` as a Phase-0 pass before the
existing position/content deduplications. For each candidate span,
bucket by lowercased trimmed text; on collision, compute IoU against
prior bboxes in the same bucket and drop the new span if IoU ≥ 0.7.
Conservative to avoid false positives:
- Only triggers when text length ≥ 2 (shorter keys are too easily
over-matched; existing positional dedup handles single-glyph dupes).
- IoU threshold 0.7 means the two bboxes must cover nearly the same
rectangle — legitimate layouts where the same word appears twice
on a page at different positions stay intact.
Regression test: synthetic PDF draws "Everest" twice at identical
CTM with Tr-1 (stroke) then Tr-0 (fill) render modes — the exact
pattern used by map/poster exports. extract_text must produce the
word exactly once, never "EverestEverest".
Estimated corpus impact: +0.2pp TF1 mean, closes ~14pp of TF1 loss
on nougat_016 and ~6pp on pdfa_026 (underscore form-field dupes).
Also adds tools/.gitignore so fixture symlinks from a local
benchmark run don't get committed when a reviewer measures this.
# Conflicts: # tools/.gitignore
# Conflicts: # tools/.gitignore
…results Stacks every benchmark-harness-driven bug fix on one branch for a single-checkout review. Cumulative impact vs v0.3.31 on 78 unique Kreuzberg fixtures: TF1 mean 0.919 → 0.930 (+1.1pp) TF1 p10 0.776 → 0.849 (+7.3pp) ← hard-tail TIED with pdftotext SF1 mean 0.337 → 0.355 (+1.8pp) order mean 0.804 → 0.818 (+1.4pp) runtime 8.3s → 4.8s (−42%) per-fixture TF1 regressions > 0.5pp: zero Shipped fixes: - B1 shared Form XObject per-page CTM (nougat_005 +64.7pp TF1) - B3 running-artifact first-occurrence kept - B4 XY-cut for multi-column pages (synthetic TDD test) - B7 stroke+fill overlap dedup (nougat_016 "EverestEverest") - B8a soft-hyphen line-break dehyphenation - B9 TrueType cmap format 0 parser (MS Office subsets) Deferred with precise reproducers in FINAL_RESULTS.md: - B2 not-a-bug (pages genuinely empty) - B5 re-diagnosed: ToUnicode CMap CID-miss on specific fonts on nougat_035 page 3+. Emits `%B+$%8A//` ciphertext. Needs lazy- parse path instrumentation. Est +0.4pp. - B6 re-diagnosed: Stream-decode bug (obj 10 FlateDecode 1677→128 bytes garbage). Est +0.2pp. - B8b intra-word TJ spaces: needs per-font threshold calibration sweep. Est +0.4pp. Shipping the three deferred fixes would close the remaining 1.5pp TF1 gap to pdftotext while keeping the 10.8pp SF1 lead. Branch structure: each fix has its own clean branch off v0.3.31 for isolated review. This branch merges them all plus the benchmark- harness infrastructure so reviewers can run the full before/after cycle from one checkout.
1. MediaBox filter used (x, y, w, h) interpretation but get_page_media_box returns (llx, lly, urx, ury). Fixed to use absolute coordinates directly. Was accidentally correct for (0,0) origin pages (common) but wrong for non-zero-origin MediaBoxes. 2. B3 first-seen page tracking: if the cover page had no body content (all text in the top/bottom 12% chrome band), the loop skipped it and first_seen pointed to page 2+. Now first_seen_any tracks the earliest page a signature appeared on regardless of body-content, so the cover-page title exemption always fires. 3. B7 dedup: trimmed.len() counted bytes not chars — a single non-ASCII glyph (≥2 bytes) could enter the dedup path when it shouldn't. Changed to trimmed.chars().count(). Removed unused usize index from the seen HashMap (was (usize, Rect), now just Vec<Rect>). Fixed "quadtree-lite" comment to match actual HashMap implementation. 4. B4: added comment explaining why spans.clone() is necessary before XY-cut (apply() takes ownership; Err branch needs the original). Fixed histogram comment: "mean + stddev" → "mean × 1.5" to match the actual threshold code. 5. save_identity_ctm: added doc comment explaining that the check runs post-XObject /Matrix concatenation (conservative-safe, not a correctness bug; perf optimisation deferred). 6. B3 test: removed unused enumerate() + let _ = idx. All 4 regression tests pass. No functional changes to extraction logic except items 1 and 2 (real bug fixes for non-zero-origin pages and all-chrome cover pages respectively).
yfedoseev
added a commit
that referenced
this pull request
Apr 16, 2026
1. MediaBox filter used (x, y, w, h) interpretation but get_page_media_box returns (llx, lly, urx, ury). Fixed to use absolute coordinates directly. Was accidentally correct for (0,0) origin pages (common) but wrong for non-zero-origin MediaBoxes. 2. B3 first-seen page tracking: if the cover page had no body content (all text in the top/bottom 12% chrome band), the loop skipped it and first_seen pointed to page 2+. Now first_seen_any tracks the earliest page a signature appeared on regardless of body-content, so the cover-page title exemption always fires. 3. B7 dedup: trimmed.len() counted bytes not chars — a single non-ASCII glyph (≥2 bytes) could enter the dedup path when it shouldn't. Changed to trimmed.chars().count(). Removed unused usize index from the seen HashMap (was (usize, Rect), now just Vec<Rect>). Fixed "quadtree-lite" comment to match actual HashMap implementation. 4. B4: added comment explaining why spans.clone() is necessary before XY-cut (apply() takes ownership; Err branch needs the original). Fixed histogram comment: "mean + stddev" → "mean × 1.5" to match the actual threshold code. 5. save_identity_ctm: added doc comment explaining that the check runs post-XObject /Matrix concatenation (conservative-safe, not a correctness bug; perf optimisation deferred). 6. B3 test: removed unused enumerate() + let _ = idx. All 4 regression tests pass. No functional changes to extraction logic except items 1 and 2 (real bug fixes for non-zero-origin pages and all-chrome cover pages respectively).
Owner
Author
|
Superseded by #355 (v0.3.31 release merge). All commits from this branch are in main via the release merge; closing to reduce noise. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One-stop release branch stacking every benchmark-harness-driven bug fix plus the harness infrastructure. Intended for single-checkout review and for cutting v0.3.32 from
release/v0.3.31in one merge.Cumulative impact vs v0.3.31 on 78 unique Kreuzberg fixtures
vs pdftotext on the same corpus: SF1 +10.8pp ahead, TF1 p10 tied.
Shipped fixes (all have own review-ready sub-PRs)
Deferred with precise reproducers (see
tools/benchmark-harness/FINAL_RESULTS.md)/Length 1677 /Filter /FlateDecodedecompresses to 128 bytes of garbage on nougat_026. Est +0.2pp.Fixing those three would close the remaining 1.5pp TF1 gap to pdftotext while keeping the 10.8pp SF1 lead.
Test plan
cargo test --release --lib— 4365 lib tests passcargo test --release --test test_b1_shared_form_xobject_per_page_ctm --test test_b3_first_occurrence_of_running_header_kept --test test_b4_two_column_reading_order --test test_b7_stroke_fill_dedup— 4 new regression tests passcargo test -p benchmark-harness— 18 harness tests passcargo clippy --all-targets --all-features -- -D warnings— cleanmake benchmark-fetch && make benchmark-run ENGINE=pdf-oxide OUTPUT=head.json+ diff gate — zero per-fixture regressions > 0.5ppMerge strategy
Either merge sub-PRs individually (#360, #359, #358, #357, #356, #361) or merge this combined branch if you want it all in one go. Both paths produce an identical
release/v0.3.31head state.Reproduce