Skip to content

perf(render): dirty-rect bound the overprint after-paint pass — 3.3× corpus aggregate, 7–21× on overprint-heavy PDFs, byte-identical#736

Open
RayVR wants to merge 2 commits into
yfedoseev:mainfrom
RayVR:perf/overprint-dirty-rect
Open

perf(render): dirty-rect bound the overprint after-paint pass — 3.3× corpus aggregate, 7–21× on overprint-heavy PDFs, byte-identical#736
RayVR wants to merge 2 commits into
yfedoseev:mainfrom
RayVR:perf/overprint-dirty-rect

Conversation

@RayVR

@RayVR RayVR commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

apply_overprint_after_paint (the §11.7.4 CompatibleOverprint after-paint pass) snapshotted the full page pixmap (pixmap.data().to_vec()) and re-scanned every pixel after every paint operator with /OP//op active. Print-targeted producers commonly set /OP true in their ExtGState defaults, so an ordinary text-heavy form page with thousands of glyph paints did tens of gigabytes of byte traffic. Profiling an 80-document real-world corpus (academic papers, manuals, invoices, government forms, leaflets, commercial artwork) measured the scan alone at 72% of aggregate render CPU, with the snapshot memcpy hiding in another ~6% of memmove time — documents like a state tax form (NY IT-2104) spent 96% of their entire render inside this one function.

This PR bounds both the snapshot and the scan to a device rect that provably contains the painted geometry, with a full-page fallback whenever no bound can be proven. Output is byte-identical: 511/511 corpus pages render byte-for-byte the same as the parent commit. Corpus aggregate render time drops 109.7 s → 32.9 s (3.33×); the affected document class runs 7–21× faster.

Workload user before → after speedup
NY IT-2104 tax form 8.88 s → 0.42 s 20.9×
spot-colour label artwork 25.68 s → 1.39 s 18.5×
pharma patient leaflet 11.27 s → 0.72 s 15.7×
DS-82 passport form (2 variants) 6.0 s → 0.49 s 12.1× / 12.3×
journal article reprint 9.28 s → 0.76 s 12.2×
corpus geomean (71 docs) 1.49× user / 1.45× wall

53 documents without overprint are flat — a regression probe pins that they do no scan work at all. The one nominal sub-1× entry (0.948×) did not reproduce under interleaved re-measurement (base 1.06–1.09 s vs head 1.06–1.08 s — noise).

samply before/after, apply_overprint_after_paint self time: 95.8% → 1.4% (IT-2104), 93.7% → 3.5% (leaflet), 92.6% → 2.5% (label artwork), 89.3% → 5.6% (DS-82), 85.8% → 3.4% (clinical leaflet).

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Tests
  • CI/CD changes

Related Issues

None tracked upstream — surfaced by a corpus-wide profiling sweep after the §11 transparency surface landed.

Changes Made

  • DeviceRect / RectSnapshot / PaintBounds (page_renderer.rs): a half-open clamped device rect; a row-packed pre-paint snapshot of a rect; and an accumulator for the device-space AABB of what a paint helper actually rasterised. All conversions degrade toward over-coverage (non-finite coordinates → full page; clamping happens only after expansion), never under.
  • Path fill/stroke arms (Fill, Stroke, the four fill+stroke combos): compute a pre-paint rect from the affine-mapped path bbox + a 2 px AA margin. Strokes expand the unclamped AABB (an off-page path can still paint in-page through a fat stroke) by half line width × the transform's Frobenius norm × √2 for square caps × the miter limit only when the path has joins and the join style is miter — single-segment rule lines, ubiquitous in forms, never pay the PDF-default miter limit of 10. Zero-width hairlines floor at 1 px. Snapshot and scan are both rect-sized.
  • Text arms (Tj, ', ", TJ): every glyph paint site in the text rasteriser accumulates the exact union of transformed glyph-path bounds into a PaintBounds threaded through render_text / render_tj_array / render_unicode_text / render_cid_direct / render_substituted_cjk / render_text_fallback; the after-paint scan is restricted to that union. The pre-paint snapshot for text stays full-page for now — glyph bounds are only knowable once outlines are built, and guessing from font metadata (FontBBox, ascent/descent) is not provable against broken descriptors.
  • Shadings (sh): paint the whole clip region — no provable bbox, full-page snapshot + scan retained (pinned by a test).
  • apply_overprint_after_paint / apply_overprint_after_paint_with_coverage take the rect snapshot plus an optional scan-narrowing rect; the per-pixel blend math is untouched.

Byte equivalence is structural: a pixel outside a provable bound cannot differ from the snapshot, so the historical full-page diff skipped it anyway; restricting the walk visits exactly the pixels the diff could ever act on.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • All new and existing tests pass locally
  • I have run cargo test --all-features
  • I have run cargo clippy -- -D warnings
  • I have run cargo fmt

Bounding contract (tests/test_overprint_dirty_rect.rs, gated on test-support): a PageRenderer::overprint_scanned_pixels counter pins five behaviours — small /OP true DeviceCMYK fill, stroke, and Tj paints must each scan ≤ 25% of the page (they scan ~5–10%; the pre-change behaviour was 100% and the tests were written first and watched fail at exactly scanned == total); sh must remain full-page; and a document without /OP must scan zero pixels. Counters, not wall-clock — exact and machine-independent.

Byte equivalence: 80-document corpus, pages 1–20 per document at 150 DPI, rendered with this branch and its parent commit (e12609e5, the only diff being this PR): 511/511 pages byte-identical (cmp on every PNG).

Semantics: the full suite — 8857 passed, 0 failed with rendering icc test-support — includes the §11.7.4 byte-exact CompatibleOverprint probes (OPM=0/1, DeviceGray/Separation sources, knockout-group interaction), which all pass unchanged.

Timing: medians of 10 warmed runs per (document, binary) via /usr/bin/time -p on an idle M2 Max; geomean and per-class numbers above.

Python Bindings (if applicable)

  • Python bindings updated (if needed)
  • Python tests pass
  • Python code formatted with ruff format
  • Python code linted with ruff check

No binding changes — pure-Rust rendering hot path; bindings inherit the speedup through render_page.

Documentation

  • I have updated the documentation (README, docs/, code comments)
  • I have added/updated examples (if applicable)
  • I have updated CHANGELOG.md

No user-facing docs change. The new types and the bounding invariants carry doc comments, including the obligation that every new glyph paint site must accumulate into PaintBounds (or the text scan under-covers).

Checklist

  • My code follows the project's coding guidelines (see CONTRIBUTING.md)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings

Additional Notes

After this change, the remaining cost on overprint-heavy documents is memmove (20–78% self in the after-profiles): the surviving text-arm full-page snapshot (one per text-showing op) and the sibling snapshot families (smask_snapshot, cmyk_compose_snapshot, spot_paint_snapshot, cmyk_sidecar_snapshot*) which still clone the full page per gated paint. The DeviceRect/RectSnapshot/PaintBounds machinery introduced here is deliberately reusable for those passes — natural follow-ups, kept out of this PR to keep the blast radius reviewable. A provable pre-paint text bound (outline dry pass or an incrementally-synced shadow buffer) would eliminate the last full-page memcpy; both options have correctness subtleties (metadata-independent bounds, shadow desync on gate transitions) that deserve their own review.

… scan

apply_overprint_after_paint snapshotted the full page pixmap
(pixmap.data().to_vec()) and re-scanned every pixel after every paint
operator with /OP or /op active. Print-targeted producers set /OP true
in their ExtGState defaults, so a text-heavy form page with thousands
of glyph paints did tens of gigabytes of byte traffic: corpus profiling
measured the scan alone at 72% of aggregate render CPU across 78
real-world documents, with the snapshot memcpy hiding in another 6% of
memmove time.

The pass now snapshots and scans only a device rect that provably
bounds the painted geometry:

- Path fills bound by the transformed path bbox (affine corner
  mapping) plus a 2px anti-aliasing margin. Strokes additionally
  expand by half line width x the transform's Frobenius norm, times
  sqrt(2) for square caps, times the miter limit only when the path
  has joins and the join style is miter - so single-segment rule
  lines never pay the PDF default miter limit of 10.
- Text paints accumulate the exact union of transformed glyph-path
  bounds (PaintBounds) at every fill site in the text rasteriser; the
  scan is restricted to that union. The pre-paint snapshot for text
  remains full-page (glyph bounds are only known once outlines are
  built), so text keeps one page-sized memcpy per operator for now.
- Shadings (sh) paint the whole clip region and have no provable
  bbox: they keep the historical full-page snapshot and scan, as does
  any geometry whose mapped coordinates are non-finite. The fallback
  is always toward over-coverage, never under.

Byte equivalence is structural: pixels outside a provable bound
cannot differ from the snapshot, so the historical full-page diff
skipped them anyway. The stroke outset expands the unclamped AABB
because a path lying off-page can still paint in-page through a fat
stroke.

A test-support counter (overprint_scanned_pixels) pins the bounding
contract: small /OP-true fill, stroke, and text paints must scan a
rect-bounded neighbourhood, sh must remain full-page, and documents
without /OP must not scan at all.
@RayVR RayVR requested a review from yfedoseev as a code owner June 12, 2026 13:52
@RayVR

RayVR commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

CI triage — the 11 red checks decompose into two groups, neither caused by this PR's change:

Advisory-driven (3): Security Audit, Dependency Check (cargo-deny), Security audit (bundler-audit + OSV-Scanner). Two new PyO3 advisories published against the existing lockfile — RUSTSEC-2026-0176 (BoundListIterator/BoundTupleIterator unchecked nth/nth_back index arithmetic) and RUSTSEC-2026-0177 (PyCFunction::new_closure missing Sync bound). These fail on main's own latest CI run as well (cargo-deny red there too) and will fail every open PR until a pyo3 bump lands on main. Happy to send that bump as a separate PR if useful.

Network flakes (8): Test (ubuntu-latest, stable) died fetching the base64 crate from crates.io (SSL_read: unexpected eof, job log) before compiling anything; the 5 PHP jobs and 2 wheel builds all failed in setup-stage downloads in the same window. A re-run of failed jobs should clear all eight (I can't trigger re-runs from a fork).

All 165 functional checks — tests on the other platforms, clippy, docs, lib builds, bindings — are green.

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Reviewed the change against ISO 32000-1 §8.6.7 / §11.7.4 and independently re-verified the byte-identity claim with a fresh render sweep.

Spec & design

The correctness argument rests entirely on the dirty rect being a provable superset of the painted pixels — and every margin here is a conservative upper bound: the 2px AA pad, the √2 square-cap outset, the Frobenius-norm scale bound (≥ the spectral norm, so genuinely an upper bound on the transform's stretch), the miter limit applied only when the path actually has miter joins, the 1px hairline floor, and the non-finite → full-page fallback. The §11.7.4 CompatibleOverprint blend math is untouched; this only narrows where the after-paint scan runs, and shadings/text correctly retain the full-page snapshot. The doc comments capture the invariants well.

Independent render regression

Built base (e12609e5) and this branch, rendered 68 diverse PDFs (overprint / CMYK / knockout / shading / smask / tiling-pattern / clip / CJK / forms), pages 1–5 @150 DPI, pixel-diffed:

132/132 pages byte-identical — including the stroke-heavy CAD overprint plans and tiger-as-form-xobject (0 px differ).

Reproduces your 511/511 result on a fresh corpus.

CI heads-up

The red cargo-deny / Security Audit / OSV checks aren't from this change — they're the pyo3 RUSTSEC-2026-0176/0177 advisory-ignores that just landed in deny.toml / .cargo/audit.toml / osv-scanner.toml with v0.3.64, after this branch's base. Merging current main will clear them.

A 3.3× corpus-aggregate win on the render hot path with provably equivalent output is excellent — thank you for the careful work and the thorough write-up. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants