Skip to content

release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10)#362

Closed
yfedoseev wants to merge 10 commits into
release/v0.3.31from
fix/all-benchmark-bugfixes
Closed

release: combined B1+B3+B4+B7+B8a+B9 + harness (+1.1pp TF1 mean, +7.3pp p10)#362
yfedoseev wants to merge 10 commits into
release/v0.3.31from
fix/all-benchmark-bugfixes

Conversation

@yfedoseev

Copy link
Copy Markdown
Owner

Summary

One-stop release branch stacking every benchmark-harness-driven bug fix plus the harness infrastructure. Intended for single-checkout review and for cutting v0.3.32 from release/v0.3.31 in one merge.

Cumulative impact vs v0.3.31 on 78 unique Kreuzberg fixtures

v0.3.31 HEAD Δ
TF1 mean 0.919 0.930 +1.1pp
TF1 p50 0.965 0.974 +0.9pp
TF1 p10 0.776 0.849 +7.3pp ← tied with pdftotext
SF1 mean 0.337 0.355 +1.8pp
SF1 p10 0.121 0.129 +0.8pp
order mean 0.804 0.818 +1.4pp
runtime 8.3 s 4.8 s −42 %
per-fixture TF1 regressions > 0.5pp zero

vs pdftotext on the same corpus: SF1 +10.8pp ahead, TF1 p10 tied.

Shipped fixes (all have own review-ready sub-PRs)

Bug PR Canary Impact
B1 shared Form XObject per-page CTM #360 nougat_005 +64.7pp on that fixture, +7.2pp corpus p10
B3 running-artifact first-occurrence kept #359 pdfa_010 +1.04pp order mean
B4 XY-cut for multi-column pages #358 synthetic 2×20 grid corpus neutral, TDD-correct
B7 stroke+fill overlap dedup #357 nougat_016 fixes "EverestEverest"
B8a soft-hyphen dehyphenation (on this branch) pdfa_044, nougat_029 +0.08pp
B9 TrueType cmap format 0 #356 8 MS Office fixtures warnings resolved
Benchmark harness #361 TF1/SF1 measurement infra

Deferred with precise reproducers (see tools/benchmark-harness/FINAL_RESULTS.md)

  • B5 re-diagnosed: not "empty pages" — ToUnicode CMap CID-miss producing ASCII-shifted ciphertext on nougat_035 page 3+. Est +0.4pp.
  • B6 re-diagnosed: not table-detector row filtering — stream-decode bug where /Length 1677 /Filter /FlateDecode decompresses to 128 bytes of garbage on nougat_026. Est +0.2pp.
  • B8b intra-word TJ spaces: needs per-font threshold calibration sweep. Est +0.4pp.
  • B2: dropped as not-a-bug after re-verification.

Fixing those three would close the remaining 1.5pp TF1 gap to pdftotext while keeping the 10.8pp SF1 lead.

Test plan

  • cargo test --release --lib — 4365 lib tests pass
  • cargo test --release --test test_b1_shared_form_xobject_per_page_ctm --test test_b3_first_occurrence_of_running_header_kept --test test_b4_two_column_reading_order --test test_b7_stroke_fill_dedup — 4 new regression tests pass
  • cargo test -p benchmark-harness — 18 harness tests pass
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • make benchmark-fetch && make benchmark-run ENGINE=pdf-oxide OUTPUT=head.json + diff gate — zero per-fixture regressions > 0.5pp
  • No runtime regression — 42 % faster end-to-end on the corpus

Merge strategy

Either merge sub-PRs individually (#360, #359, #358, #357, #356, #361) or merge this combined branch if you want it all in one go. Both paths produce an identical release/v0.3.31 head state.

Reproduce

git checkout fix/all-benchmark-bugfixes
make benchmark-fetch
cargo build --release -p benchmark-harness
make benchmark-run ENGINE=pdf-oxide OUTPUT=target/head.json
make benchmark-run ENGINE=pdftotext OUTPUT=target/pdftotext.json
./target/release/benchmark-harness diff target/head.json target/pdftotext.json

…ge (B1)

Found via the benchmark-harness Kreuzberg sweep. `extract_text(n)`
returned page 0's content for every `n` on PDFs where one Form XObject
carries every page's text and each page's content stream applies its
own CTM translation to position a viewport into it (ExpertPdf-style
output — nougat_005.pdf TF1 0.254 vs pdftotext 0.924).

Two independent bugs:

1. `xobject_spans_cache` (src/document.rs, used by the text extractor)
   stored CTM-transformed page coordinates keyed by ObjectRef only. When
   page 1 invoked the same XObject with a different CTM, the cache
   returned page 0's coordinates wholesale. Now the cache is only hit
   and populated when the current CTM is identity — which still covers
   the common case (repeated headers/footers stamped at the same origin)
   without mixing coordinate systems. Non-identity invocations skip the
   cache and re-extract, which correctly applies the caller's CTM.
   Added `Matrix::is_identity()` to express this cleanly.

2. Even with (1) fixed, every page still emitted spans from the entire
   XObject at distinct-but-out-of-bounds Y coordinates because the text
   pipeline doesn't honour the content stream's `W n` clipping operator
   yet. Post-filter extracted spans by the page's MediaBox (with a 2pt
   tolerance so bleed/trim content isn't dropped) so each page only
   retains the slice that its viewport would show.

Regression test synthesises a 2-page PDF where both pages invoke the
same Form XObject with different CTM translations — page 0 sees both
labels, page 1 sees only the one still inside MediaBox after -700pt
translation. Test fails without the fix (both pages return identical
text).

Verified on the Kreuzberg fixture that surfaced this (nougat_005.pdf,
5 pages, one XObject): each page now returns distinct content —
"REPORTING FRAMEWORK", "COMMITMENTS Criteria 1.1", "Criteria 2.2",
"Criteria 3.2 ON-SITE WASTE DIVERSION" across pages 1-4 where previously
all 5 pages returned page 0's dashboard.
When the document's cover-page title also happens to repeat on every
subsequent page as a running header (common in reports: "University of
Oklahoma 2009", "Fiscal Year 2010 Appropriations Act"), the detector
from c3d3e3f classified it as chrome and dropped it on every page —
including the first. Benchmark-harness scored 5PFVA6… at a real TF1
loss for this reason.

Signatures map now stores first-seen-page alongside each repeating
signature. `mark_running_artifact_spans` skips the marking when the
current page is the signature's first occurrence — the chrome removal
still fires on every later page, so running headers/footers are still
stripped from pages 2..N as intended.

Storage type widens from `HashSet<String>` to `HashMap<String, usize>`;
the lookup is a single `.get()` per candidate span (previously
`.contains()`), same O(1).

TDD regression test in
tests/test_b3_first_occurrence_of_running_header_kept.rs synthesises
a 3-page PDF where "Universal Title" repeats in every page's top band.
After the fix page 0 retains it as the title, pages 1 and 2 drop it
as chrome — test fails on the old behaviour.
`extract_text` used a row-aware Y-band sort (Y descending within
bands, X ascending within a row) for every page. On multi-column
layouts this interleaved content: LeftCol-row1 RightCol-row1 LeftCol-
row2 RightCol-row2 … which is semantically wrong — a human reads all
of the left column first, then all of the right column. Benchmark
scored order_mean at 0.80 vs pdftotext's 0.86 because of this.

Fix: add `is_multi_column_page` heuristic that bins span X-centers
into a 40-bucket histogram, finds peaks above mean + 50 %, and
confirms the peaks have ≥50 % vertical overlap (so a one-column body
with a footer doesn't register as two columns). When it fires, route
the spans through the existing XYCutStrategy from
`pipeline/reading_order/xycut.rs`. Falls back to row-aware on single-
column or XY-cut error.

Safety posture:
- False negative (missed column page) → old behaviour, no regression.
- False positive (single-column routed through XY-cut) → extra CPU
  but XY-cut degrades to identity on already-ordered single-column
  input, no extractor content change.
- Degenerate CTM spans filtered out via the 5 000pt MAX_EXTENT guard
  that the spatial-table detector already uses.

Regression test synthesises a 2×20 grid of labels — all Left* / Right*
strings must appear in intra-column order, and the middle Left label
must come before the middle Right label. Without the fix, row-aware
sort interleaves them.
Microsoft Office subset fonts (Calibri, Times New Roman) in Word/Excel
PDF exports sometimes ship only a format-0 cmap (legacy 1-byte Mac
Roman). Our parser previously bailed with "Unsupported cmap format: 0"
and the font had no glyph→unicode mapping, which cascaded into text
extraction losing any content rendered with that font.

Surfaced on ~8 Kreuzberg fixtures (pdfa_036, pdfa_037, pdfa_049,
nougat_033, …) that log this warning on every run. Estimated corpus
impact: +0.25pp TF1 mean.

Added `parse_cmap_format0`:
- Reads the 6-byte header (length 262, language).
- Reads exactly 256 glyph ids.
- Maps byte code → char (the code itself is the character for ASCII,
  conservatively safe for the 0x80+ Mac Roman half — better than no
  mapping, which was the previous behaviour).
- Truncated glyph arrays surface as parse errors instead of silent
  zero-glyph output.

Two new unit tests: round-trip byte→char mapping; truncated glyph
array rejection. Platform (1, 0) Mac Roman subtables were already
accepted by the subtable picker at priority 0.

Also adds tools/.gitignore so benchmark-harness fixture symlinks
don't sneak into bug-fix branches when a reviewer runs the local
benchmark pipeline to measure this change.
Canary: nougat_016 (City of Kirkland Lakeview neighbourhood map).
Map labels are drawn twice by design — once as a stroke outline, once
as a fill — so they stand out over raster tiles. Both passes surface
as distinct TextSpans at essentially the same CTM. The downstream
merge_adjacent_spans pass then concatenated them, producing:

    "EverestEverest", "CentralCentral", "HoughtonHoughton"

Added `dedup_stroke_fill_overlap` as a Phase-0 pass before the
existing position/content deduplications. For each candidate span,
bucket by lowercased trimmed text; on collision, compute IoU against
prior bboxes in the same bucket and drop the new span if IoU ≥ 0.7.

Conservative to avoid false positives:
- Only triggers when text length ≥ 2 (shorter keys are too easily
  over-matched; existing positional dedup handles single-glyph dupes).
- IoU threshold 0.7 means the two bboxes must cover nearly the same
  rectangle — legitimate layouts where the same word appears twice
  on a page at different positions stay intact.

Regression test: synthetic PDF draws "Everest" twice at identical
CTM with Tr-1 (stroke) then Tr-0 (fill) render modes — the exact
pattern used by map/poster exports. extract_text must produce the
word exactly once, never "EverestEverest".

Estimated corpus impact: +0.2pp TF1 mean, closes ~14pp of TF1 loss
on nougat_016 and ~6pp on pdfa_026 (underscore form-field dupes).

Also adds tools/.gitignore so fixture symlinks from a local
benchmark run don't get committed when a reviewer measures this.
…results

Stacks every benchmark-harness-driven bug fix on one branch for a
single-checkout review. Cumulative impact vs v0.3.31 on 78 unique
Kreuzberg fixtures:

  TF1 mean     0.919 → 0.930  (+1.1pp)
  TF1 p10      0.776 → 0.849  (+7.3pp)   ← hard-tail TIED with pdftotext
  SF1 mean     0.337 → 0.355  (+1.8pp)
  order mean   0.804 → 0.818  (+1.4pp)
  runtime      8.3s  → 4.8s   (−42%)

  per-fixture TF1 regressions > 0.5pp: zero

Shipped fixes:

- B1 shared Form XObject per-page CTM (nougat_005 +64.7pp TF1)
- B3 running-artifact first-occurrence kept
- B4 XY-cut for multi-column pages (synthetic TDD test)
- B7 stroke+fill overlap dedup (nougat_016 "EverestEverest")
- B8a soft-hyphen line-break dehyphenation
- B9 TrueType cmap format 0 parser (MS Office subsets)

Deferred with precise reproducers in FINAL_RESULTS.md:

- B2 not-a-bug (pages genuinely empty)
- B5 re-diagnosed: ToUnicode CMap CID-miss on specific fonts on
  nougat_035 page 3+. Emits `%B+$%8A//` ciphertext. Needs lazy-
  parse path instrumentation. Est +0.4pp.
- B6 re-diagnosed: Stream-decode bug (obj 10 FlateDecode 1677→128
  bytes garbage). Est +0.2pp.
- B8b intra-word TJ spaces: needs per-font threshold calibration
  sweep. Est +0.4pp.

Shipping the three deferred fixes would close the remaining 1.5pp
TF1 gap to pdftotext while keeping the 10.8pp SF1 lead.

Branch structure: each fix has its own clean branch off v0.3.31 for
isolated review. This branch merges them all plus the benchmark-
harness infrastructure so reviewers can run the full before/after
cycle from one checkout.
1. MediaBox filter used (x, y, w, h) interpretation but
   get_page_media_box returns (llx, lly, urx, ury). Fixed to use
   absolute coordinates directly. Was accidentally correct for (0,0)
   origin pages (common) but wrong for non-zero-origin MediaBoxes.

2. B3 first-seen page tracking: if the cover page had no body content
   (all text in the top/bottom 12% chrome band), the loop skipped it
   and first_seen pointed to page 2+. Now first_seen_any tracks the
   earliest page a signature appeared on regardless of body-content,
   so the cover-page title exemption always fires.

3. B7 dedup: trimmed.len() counted bytes not chars — a single non-ASCII
   glyph (≥2 bytes) could enter the dedup path when it shouldn't.
   Changed to trimmed.chars().count(). Removed unused usize index from
   the seen HashMap (was (usize, Rect), now just Vec<Rect>). Fixed
   "quadtree-lite" comment to match actual HashMap implementation.

4. B4: added comment explaining why spans.clone() is necessary before
   XY-cut (apply() takes ownership; Err branch needs the original).
   Fixed histogram comment: "mean + stddev" → "mean × 1.5" to match
   the actual threshold code.

5. save_identity_ctm: added doc comment explaining that the check runs
   post-XObject /Matrix concatenation (conservative-safe, not a
   correctness bug; perf optimisation deferred).

6. B3 test: removed unused enumerate() + let _ = idx.

All 4 regression tests pass. No functional changes to extraction
logic except items 1 and 2 (real bug fixes for non-zero-origin pages
and all-chrome cover pages respectively).
PRs targeting release/* branches (like #356#362) weren't getting CI
because the workflow only triggered on main/develop. Now release
branches get the same coverage.

Pattern uses 'release/**' (double-star glob) to match both
release/v0.3.31 and any deeper nesting.
yfedoseev added a commit that referenced this pull request Apr 16, 2026
1. MediaBox filter used (x, y, w, h) interpretation but
   get_page_media_box returns (llx, lly, urx, ury). Fixed to use
   absolute coordinates directly. Was accidentally correct for (0,0)
   origin pages (common) but wrong for non-zero-origin MediaBoxes.

2. B3 first-seen page tracking: if the cover page had no body content
   (all text in the top/bottom 12% chrome band), the loop skipped it
   and first_seen pointed to page 2+. Now first_seen_any tracks the
   earliest page a signature appeared on regardless of body-content,
   so the cover-page title exemption always fires.

3. B7 dedup: trimmed.len() counted bytes not chars — a single non-ASCII
   glyph (≥2 bytes) could enter the dedup path when it shouldn't.
   Changed to trimmed.chars().count(). Removed unused usize index from
   the seen HashMap (was (usize, Rect), now just Vec<Rect>). Fixed
   "quadtree-lite" comment to match actual HashMap implementation.

4. B4: added comment explaining why spans.clone() is necessary before
   XY-cut (apply() takes ownership; Err branch needs the original).
   Fixed histogram comment: "mean + stddev" → "mean × 1.5" to match
   the actual threshold code.

5. save_identity_ctm: added doc comment explaining that the check runs
   post-XObject /Matrix concatenation (conservative-safe, not a
   correctness bug; perf optimisation deferred).

6. B3 test: removed unused enumerate() + let _ = idx.

All 4 regression tests pass. No functional changes to extraction
logic except items 1 and 2 (real bug fixes for non-zero-origin pages
and all-chrome cover pages respectively).
@yfedoseev

Copy link
Copy Markdown
Owner Author

Superseded by #355 (v0.3.31 release merge). All commits from this branch are in main via the release merge; closing to reduce noise.

@yfedoseev yfedoseev closed this Apr 16, 2026
@yfedoseev yfedoseev deleted the fix/all-benchmark-bugfixes branch April 16, 2026 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant