Skip to content

perf(render): apply_pending_clip in-place mask intersect — 1.11× corpus geomean, up to 1.8× on clip-heavy PDFs#654

Merged
yfedoseev merged 6 commits into
yfedoseev:mainfrom
RayVR:perf/apply-pending-clip
Jun 12, 2026
Merged

perf(render): apply_pending_clip in-place mask intersect — 1.11× corpus geomean, up to 1.8× on clip-heavy PDFs#654
yfedoseev merged 6 commits into
yfedoseev:mainfrom
RayVR:perf/apply-pending-clip

Conversation

@RayVR

@RayVR RayVR commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Description

apply_pending_clip was consuming ~54% self time / ~64% inclusive time on tests/fixtures/1.pdf (samply, 7-page Form-XObject-heavy PDF, release build with debug symbols). A single function dominating two-thirds of a real-PDF render is a clear sign of redundant work, so I dug in.

This PR removes the two avoidable per-call page-sized byte operations the function was doing. Measured across a 71-document real-world corpus (academic papers, technical manuals, invoices, receipts, scanned forms, commercial artwork; medians of 10 warmed runs per document per binary):

  • 1.11× geomean user-time speedup / 1.10× wall, with zero regressions (no document below 0.95×).
  • The distribution is bimodal: 54 text-dominated documents are flat, while 15 clip-heavy documents (scanned forms, receipts, register extracts) run 1.3–1.83× faster. Profiling confirms why: on those documents apply_pending_clip has the same ~50% baseline self-time as the fixture — the fixture is representative of a real document class, not an outlier.
  • Hardware counters on the fixture workload: −53% instructions, −50% cycles.

Output is byte-identical on 503 of 511 corpus pages; the remaining 8 pages differ by ±1 LSB on 1–12 isolated anti-aliased edge pixels per page (quantified below — sub-perceptual by any standard).

Type of Change

  • Performance improvement (output equivalent; worst case ±1 LSB on isolated AA pixels, quantified under Testing)
  • Bug fix
  • New feature
  • Breaking change

Related Issues

None filed — surfaced by a profiling investigation. Documenting it inline rather than after-the-fact.

Changes Made

src/rendering/page_renderer.rs::apply_pending_clip rewrite:

Before (per call when a clip narrowing was pending):

  1. Build combined CTM
  2. path.transform(transform) — clone+transform the path
  3. Allocate a fresh page-sized tiny_skia::Mask (W×H zero-fill)
  4. Rasterize the transformed path into it
  5. If a current clip mask exists: current_mask.clone() — another full W×H memcpy (probably added to dodge a borrow-checker conflict between clip_stack.last() and clip_stack.last_mut())
  6. Scalar (combined[i] as u32 * new[i] as u32) / 255 byte loop to combine the two masks

After:

  1. clip_stack.last_mut() match
  2. If parent mask exists: tiny_skia::Mask::intersect_path(path, ...) in place on the existing mask — tiny_skia uses rounded premultiply_u8 (no divide) and the upstream iter_mut().zip() form is auto-vectorizable
  3. Otherwise allocate one fresh mask and rasterize the path into it
  4. CTM combined inside intersect_path — the explicit path.transform call is dropped

Net: ~3 × W×H byte operations per call → ~1 × W×H. On 1.pdf, that's removing ~11 GB of byte work over 7 pages.

Testing

Real-world corpus measurement

A/B pair: this branch vs its mainline merge parent (the only diff between the two binaries is this PR's page_renderer.rs change). Corpus: 80 real-world PDFs (all parse and render); byte comparison over pages 1–20 of every document at 150 DPI (511 pages per binary); timing over the 71 documents ≤ 30 pages with measurable medians (--all, 150 DPI, 1 discarded warmup + 10 timed runs per document per binary, medians via /usr/bin/time -p, idle M2 Max).

Timing: geomean 1.109× user / 1.103× wall. Buckets (user): ≥1.3×: 15 docs · 1.1–1.3×: 2 · 0.95–1.1×: 54 · <0.95×: 0.

Method-level profiles (samply @ 8 kHz, whole-document render, apply_pending_clip share):

Document base self / incl post self / incl doc speedup
tests/fixtures/1.pdf (control) 53.7% / 64.2% 36.7% / 47.8% 1.80×
scanned family-register extract 53.2% / 60.6% 32.8% / 37.8% 1.70×
expense-note receipt 46.7% / 70.1% 26.4% / 55.8% 1.53×
preprints.org 202110.0334 (20 p) 15.5% / 45.2% 6.7% / 39.5% 1.15×
MDPI mathematics-11-00327 (17 p) 0.3% / 0.7% 0.1% / 0.4% 1.00×

The function remains the top self-time entry on clip-heavy documents even after this fix (26–37%) — see follow-on opportunities below.

Output divergence, quantified

cmp -s across all 511 corpus pages: 503 byte-identical. The 8 divergent pages (3 documents) were decoded to uncompressed bitmaps and quantified with cmp -l + ImageMagick compare -metric AE:

| Page | Differing pixels (of ~2.1 M) | Max |Δ| |
|---|---|---|
| mathematics-11-00327 p1 | 1 | 1 |
| preprints202110.0334 p5 / p19 | 12 / 1 | 1 |
| 5-page LaTeX document, p1–p5 | 2 each | 1 |

Every differing byte across all 8 pages differs by exactly ±1 (the cmp -l delta histogram is 100% at |Δ|=1; worst page: 36 bytes of 8,706,994). Mechanism: the previous scalar loop combined masks with truncating (a*b)/255 while intersect_path uses tiny_skia's rounded premultiply_u8; the two differ in the bottom bit when the product's fractional part is ≥ 0.5, which materializes only on partially-covered anti-aliased clip-edge pixels. The rounded form is the more accurate of the two. An earlier revision of this PR claimed fully byte-identical output from a 3-fixture check; the corpus measurement above replaces that claim with the precise characterization.

Frequency measurement

Added a #[cfg(test)] static APC_MATERIALIZED: AtomicU64 (test-only, no release-binary impact) and counted on the affected workload:

  • tests/fixtures/1.pdf (7 pages, 150 DPI): 2,372 calls / 1,820 materializations (~260 W operators per page — the calls are not redundant; the per-call cost was the problem)
  • tests/fixtures/1008.3918v2.pdf (arxiv, 29 pages): 71,053 calls / 4 materializations — the function was negligible here, and stays negligible
  • tests/fixtures/issue_regressions/alpha_channel/538250-1.pdf: 1 call / 0 materializations

Fixture wall-clock / hardware counters

Release build (CARGO_PROFILE_RELEASE_DEBUG=true, --features "rendering icc"), same 10-run median methodology:

Workload Pre wall / user Post wall / user Speedup (wall / user)
1.pdf (7p) 0.91s / 0.88s 0.54s / 0.51s 1.69× / 1.73×
arxiv (29p) 1.02s / 0.99s 1.02s / 0.99s flat
alpha (1p) 0.08s / 0.06s 0.08s / 0.06s flat

Instruction count on 1.pdf: 14.6 G → 6.9 G (−53%). Cycles: 3.6 G → 1.8 G (−50%). These hardware-counter deltas are independent of wall-clock noise.

Note: an earlier revision of this PR cited a "5.7× wall" headline on 1.pdf from a single-run measurement. That number turned out to be a cold-start / scheduler-contention outlier (the original measurement was taken while concurrent cargo build and samply processes were running). A 10-run warmed re-measurement consistently lands at ~1.7× and could not reproduce the 2.83s baseline under any cold-start regime (page-cache eviction, fresh inodes, fresh binary). The user-time and hardware-counter claims were unaffected. See PR comments below for full re-measurement methodology.

Perf-regression probe

apply_pending_clip_materializes_only_per_clip_state_change pins two contracts so a future regression surfaces loudly:

  • 1 W operator + K=100 paint-op invocations → exactly 1 materialization
  • N=5 W operators + K=100 paint ops each → exactly N materializations (not N×K)

Backed by the test-only atomic counter + a Mutex for thread-safe shared access.

Full test suite

  • cargo fmt --all -- --check — clean
  • cargo clippy --all-targets --features "rendering icc" -- -D warnings — clean
  • cargo check --lib --no-default-features — clean
  • cargo test (default) — 136 passed, 0 failed
  • cargo test --features "rendering icc" — all green
  • cargo doc --no-deps --features "rendering icc" — clean

Python Bindings

No binding changes — this is a pure-Rust rendering hot-path fix. Python bindings inherit the speedup automatically through render_page.

Documentation

No user-facing docs change. Inline docstring on apply_pending_clip updated to reflect the new in-place semantics.

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have added a regression-pinning test
  • All existing tests pass (output divergence quantified: ±1 LSB on 8 of 511 corpus pages)
  • No new clippy warnings
  • No new unsafe code

Additional Notes

Follow-on opportunities surfaced during this investigation

These are out of scope here but worth noting for future perf work — and the corpus profiling shows apply_pending_clip is still the top self-time function on clip-heavy documents after this PR:

  1. q (SaveState) still clones the entire page-sized mask at :466-467 even when the scope never narrows the clip. Rc<Mask>/Arc<Mask> with copy-on-write would make q cheap and defer cloning until actual narrowing fires. Probably halves the remaining clip-related memmove cost on Form-XObject-heavy PDFs.
  2. Tight-rect clip fast path — many real clips are axis-aligned rects (column / image-box clips). path.is_rect() → store as tiny_skia::IntRect (rect-intersection is O(1)) instead of full Mask. Bigger refactor (touches path_rasterizer::fill_path_clipped / stroke_path_clipped).
  3. Clip-cache by (path-hash, transform-hash, parent-hash) — only helps if same clip replays across pages (running headers / footers), worth profiling before committing.

Even with this PR, 1.pdf still does 1,820 page-sized mask allocations across 7 pages. The follow-ons should bring that down further.

@RayVR RayVR requested a review from yfedoseev as a code owner June 6, 2026 14:50
The clip-mask materialization path cloned the current scope's mask before
intersecting the new clip path, then ran a hand-rolled scalar (a*b)/255 loop
over every byte. The clone was redundant — every `q` (SaveState) already
pushes a cloned mask onto `clip_stack`, so the top-of-stack mask at the
current depth is already this scope's private copy and may be mutated in
place.

Switching to tiny_skia::Mask::intersect_path skips that extra page-sized
memcpy and uses the library's rounded premultiply, which is auto-vectorized
by the compiler. On a Form-XObject-heavy PDF that issues ~260 clip changes
per page (tests/fixtures/1.pdf, 7 pages at 150 DPI):

  before: 2.83s wall / 0.98s user / 14.6 G instructions / 3.6 G cycles
  after:  0.50s wall / 0.49s user /  6.9 G instructions / 1.8 G cycles

Rendered PNGs are byte-identical to the previous code across:
  - tests/fixtures/1.pdf (7 pages, the workload that motivated this)
  - tests/fixtures/1008.3918v2.pdf (29 pages, was not previously hot)
  - tests/fixtures/issue_regressions/alpha_channel/538250-1.pdf

Adds a perf-regression probe that drives apply_pending_clip directly with
K paint-op-style invocations under N clip-state changes and asserts the
materialization count equals N — not K — pinning the per-paint-op fast
path against future regressions.
@yfedoseev yfedoseev force-pushed the perf/apply-pending-clip branch from f1203f8 to c034e1b Compare June 6, 2026 16:33

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving.

Verified the correctness-critical parts against the actual tiny-skia 0.12 source (not from memory):

  • Intersection is equivalent. Mask::intersect_path (mask.rs:352) fills the path into a temp submask and folds it in with premultiply_u8(self[i], submask[i]). premultiply_u8 (color.rs:432) is the rounding (c*a+128 + ((c*a+128)>>8)) >> 8, vs the old hand-rolled truncating (a*b)/255. So results match except a ≤1/255 rounding delta on anti-aliased clip edges — invisible, arguably more correct.
  • In-place mutation is safe. The q operator (page_renderer.rs:413-414) pushes clip_stack.last().cloned().flatten() — a deep copy of the parent mask. So the top-of-stack mask is always the current scope's private copy; intersect_path mutating it cannot corrupt a parent clip scope, and Q just pops the private copy. The PR comment describes this accurately.
  • Perf rationale confirmed. The win is dropping the per-clip full-page current_mask.clone() the old Some branch did; the new path does one submask alloc + one fold loop.
  • No unsafe, no new external-input surface. The #[cfg(test)] regression probe passes locally under --features rendering.

One minor note (non-blocking): on a non-finite CTM the old code skipped the clip entirely while the new code rasterizes nothing → empty mask → clips everything out. Only reachable with inf/NaN coordinates, so not a real concern — just flagging the behavior delta.

Rebased onto main (resolved the conflict with the #652 CMYK change in the same file — both branches had appended tests to mod tests; kept both test sets) and re-ran CI on the rebased head.

@RayVR RayVR changed the title perf(render): apply_pending_clip 5.7× speedup via in-place mask intersect perf(render): apply_pending_clip 1.7× speedup via in-place mask intersect Jun 6, 2026
@RayVR

RayVR commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Updating this PR's headline with corrected measurements — the original "5.7× wall" figure was an outlier from a single-run measurement taken under noisy conditions. The fix itself, the byte-identity claim, and the correctness arguments are unchanged. Only the headline magnitude is being softened to reflect a robust re-measurement.

What changed

Title: "5.7× speedup" → "1.7× speedup"
Description headline: "5.7× wall-clock speedup" → "~1.7× wall-clock / ~1.7× user-time speedup (warm steady-state, median of 10 runs)" plus the unchanged "-53% instructions, -50% cycles"
Wall-clock table: updated to verified medians (1.pdf 1.69× wall / 1.73× user; arxiv flat; alpha flat)

Why

The original 2.83s baseline on 1.pdf was a single-shot measurement that I couldn't reproduce under any condition during a follow-up audit:

Cold-start tactic Baseline wall Baseline user
Pure first-shot (no warmup) 0.84s 0.82s
4 GB page-cache eviction + first run 0.86s 0.83s
Fresh inode (cp PDF to new file) + first run 0.86s 0.84s
Eviction + fresh inode + fresh binary + first run 1.11s 0.85s
Steady-state (10 runs, median) 0.91s 0.88s

Baseline wall never exceeded 1.11s across any regime. The 2.83s was almost certainly contamination from concurrent CPU work — most likely a parallel cargo build and samply profile running in the same session when I took the original timing.

What stays unchanged

  • Hardware counters: -53% instructions, -50% cycles on 1.pdf. Independent of wall noise.
  • User-time deltas: the PR's original 0.98s → 0.49s (2.0×) is consistent with the re-measured 0.88s → 0.51s (1.73×) within machine-state variation. The wall figure was the only anomaly.
  • Byte-identity: 37/37 rendered PNGs across 10 fixture PDFs are still bit-identical (cmp -s verified across both arms, expanded from 3 PDFs to 10).
  • Function-frequency measurement: 2,372 calls / 1,820 materializations on 1.pdf — unchanged.
  • The qualitative argument: apply_pending_clip was a self-time hotspot consuming ~3× W×H byte operations per call; this rewrite reduces it to ~1× W×H using tiny_skia::Mask::intersect_path's in-place auto-vectorized implementation. That mechanism is the same; only the speedup magnitude was overstated.

Methodology (re-measurement)

  • Identical build flags on both arms: CARGO_PROFILE_RELEASE_DEBUG=true CARGO_PROFILE_RELEASE_STRIP=false cargo build --release --features "rendering icc" --example render_pages
  • One discarded warmup run per (binary, PDF) before timing
  • 10 timed runs via /usr/bin/time -p, captured wall and user separately
  • Baseline and postfix runs interleaved per PDF (identical FS-cache state for both arms)
  • Median reported as point estimate (robust to outliers); stdev reported as % of median
  • Corpus expanded from the original 3 PDFs to 10 (1.pdf, arxiv, alpha, signature regression, hybrid text-image, multi-column table, hello structure, outline, simple, OCR fixture)

Apologies for the noisy original measurement. The fix itself stands on its merits — even at 1.7×, eliminating a 58% self-time function call with byte-identical output is a clear win for the Form-XObject-heavy workload class, and the hardware-counter deltas confirm the mechanism is real.

@RayVR

RayVR commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Converting this back to draft pending further verification. Two issues need resolution before this is ready for re-review:

1. Byte-identity claim is incorrect on real-world PDFs

A broader-corpus measurement (27 PDFs spanning academic papers, technical manuals, journal articles, and commercial artwork) surfaced byte divergences on at least 2 PDFs (preprint_202110_0334, mathematics-11-00327) that the original 3-fixture verification missed.

The mechanism is the rounding-mode difference between the previous scalar (a*b)/255 truncating-division loop and tiny_skia::Mask::intersect_path's rounded premultiply_u8. The commit message acknowledged "modulo a tighter rounding mode, which produces the same u8 output" — that claim was true on the 3 test PDFs but doesn't hold across the broader corpus.

The divergences are almost certainly ±1 LSB on isolated AA-edge pixels (rounded vs truncating premultiply diverge in the bottom bit when the product's fractional part is exactly 0.5). But I haven't quantified yet with cmp -l. Until I do, the "byte-identical PNG output" claim in the PR body can't stand.

2. Real-world speedup magnitude is much smaller than the headline

The wider-corpus measurement (10 timed runs per PDF after one warmup, median reported, std-dev <2% on most PDFs) gives a geomean speedup of ~1.02× user-time across 19 real-world PDFs — within measurement noise. The fixture_1 (Form-XObject-heavy) win of 1.73× user that the corrected headline cites is genuinely there but is an outlier — it doesn't characterize the typical workload.

What needs to happen before re-ready

  • Run cmp -l on the divergent pages to quantify the byte deltas. Sub-pixel rounding (≤±1 LSB on isolated AA pixels) is defensible with corrected wording; visible-magnitude divergence would warrant closing this PR.
  • Update the PR body to honestly characterize the divergence (whatever the magnitude turns out to be) AND to lead with the real-world corpus geomean rather than the fixture_1 number.
  • Decide whether ~1.02× geomean is worth merging given the rounding-mode change. The fix is correct in spirit (the previous scalar loop was redundant work) but the corpus-average benefit may not justify a code change that subtly alters rasterization rounding.

Marking draft to give those verifications time without churning the review queue.

@RayVR RayVR marked this pull request as draft June 7, 2026 01:58
@yfedoseev

Copy link
Copy Markdown
Owner

Thanks @RayVR — this is approved and CI is green, but it's still marked draft so it can't be merged. Could you hit "Ready for review" to take it out of draft? Once it's a regular PR it'll be mergeable. Really nice write-up, by the way — the hardware-counter evidence and the honest retraction of the earlier 5.7× single-run outlier are exactly the kind of rigor we want. 🙏

@yfedoseev

Copy link
Copy Markdown
Owner

@RayVR Thanks again for this — the optimization is approved on our side and CI is green, so it's ready to land. The only thing blocking the merge is that the PR is still marked as a draft, which prevents merging. Whenever you have a moment, could you flip it from draft to Ready for review? Once you do, we'll get it merged. Much appreciated!

@yfedoseev

Copy link
Copy Markdown
Owner

@RayVR friendly nudge on this one 🙂

Totally understand you draft-converted it to verify the byte divergences (the ±1 LSB rounding question on preprint_202110_0334 / mathematics-11-00327) and to re-frame around the real-world geomean — that was exactly the right call, no churn on our side.

Just checking where you've landed:

  • If cmp -l confirms the divergence is sub-pixel (≤±1 LSB on isolated AA edges) and you update the wording to lead with the ~1.02× geomean (keeping the Form-XObject 1.73× as the standout case), we're glad to take it as a correct-and-honest cleanup — the redundant-work removal stands on its own.
  • If you'd rather not touch rasterization rounding for a marginal average gain, equally happy to close it — your call.

Whenever you get a moment, could you let us know which way you're leaning — or flip it to Ready for review if the verification's done? Thanks again for the rigor on this. 🙏

@RayVR RayVR changed the title perf(render): apply_pending_clip 1.7× speedup via in-place mask intersect perf(render): apply_pending_clip in-place mask intersect — 1.11× corpus geomean, up to 1.8× on clip-heavy PDFs Jun 12, 2026
@RayVR RayVR marked this pull request as ready for review June 12, 2026 01:49
@RayVR

RayVR commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Un-drafting — both open items from the draft-conversion note above are resolved, and the PR body is rewritten with the corrected numbers.

1. Byte deltas quantified. Across 511 rendered pages from an 80-document real-world corpus (pages 1–20 each, 150 DPI), 503 are byte-identical. The 8 divergent pages (3 documents) were decoded to uncompressed bitmaps and run through cmp -l: every differing byte differs by exactly ±1 — the delta histogram is 100% at |Δ|=1 — touching 1–12 pixels per page out of ~2.1 M (worst page: 36 bytes of 8.7 MB). That's the truncating-vs-rounded premultiply_u8 LSB difference on partially-covered AA clip-edge pixels, with the rounded form being the more accurate one. The visible-magnitude scenario that would have warranted closing this PR is ruled out.

2. Corpus-wide perf re-measured — the earlier "~1.02× geomean, within noise" reading was wrong. On 71 documents (medians of 10 warmed runs per document per binary): geomean 1.109× user / 1.103× wall, zero regressions. The distribution is bimodal: 54 text-dominated documents flat, 15 clip-heavy documents (scanned forms, receipts, register extracts) at 1.3–1.83×. samply method-level profiles explain it — those real-world documents show the same ~50% apply_pending_clip baseline self-time as 1.pdf, so the fixture represents a genuine document class rather than an artificial best case. Profile table and methodology are in the updated PR body.

The title's old "1.7× speedup" headline is replaced by the corpus framing (1.11× geomean, up to 1.8× on the affected class). One more data point from the profiling: even after this change, apply_pending_clip remains the top self-time function on clip-heavy documents (26–37%), which is what the follow-on items in Additional Notes are for.

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Reviewed against §8.5.4 and re-ran an independent render regression.

Safety & spec

The in-place mutation is sound: q pushes a private clone of the parent mask (page_renderer.rs:741) and Q pops it, so the top-of-stack mask is always this scope's own copy — mutating it can't corrupt a parent. §8.5.4 intersect semantics are preserved, and the empty-stack early-return is safer than the prior .unwrap().

Independent render regression

68 diverse PDFs, pages 1–5 @150 DPI, base e12609e5 vs this branch — 131/132 byte-identical. Two small notes:

  1. The "every differing byte differs by exactly ±1" characterization is very slightly optimistic. tiger-as-form-xobject.pdf (nested clips) shows maxΔ=2 — histogram {Δ1: 159 bytes, Δ2: 6 bytes}, RGB only, 55 px of 2.1M (0.003%). Nested clips stack two rounded-vs-truncated premultiply deltas on the same AA edge. Utterly sub-perceptual, and the rounded form is the more accurate of the two — just suggest the claim read "≤±2 LSB on nested-clip AA edges."

  2. One unmentioned behavior change: on a non-finite/degenerate clip-path transform, tiny_skia::fill_path returns early leaving a zero submask, so intersect_path zeroes the mask (everything clips out) — whereas the old if let Some(path.transform(…)) guard skipped the clip (content stayed visible). It's arguably more spec-correct and didn't surface anywhere on the corpus, but worth a sentence in the PR body.

Neither blocks merge.

CI heads-up

Same as the sibling perf PR: the cargo-deny / Security Audit / OSV reds are the v0.3.64 pyo3 advisory-ignores missing from this branch's base — merging current main clears them.

Thanks for the careful profiling and the honest re-measurement write-up (the ±1/5.7× corrections inline are exactly the right way to do it). 🙏

@yfedoseev yfedoseev merged commit a891472 into yfedoseev:main Jun 12, 2026
283 of 285 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants