perf(render): apply_pending_clip in-place mask intersect — 1.11× corpus geomean, up to 1.8× on clip-heavy PDFs#654
Conversation
The clip-mask materialization path cloned the current scope's mask before intersecting the new clip path, then ran a hand-rolled scalar (a*b)/255 loop over every byte. The clone was redundant — every `q` (SaveState) already pushes a cloned mask onto `clip_stack`, so the top-of-stack mask at the current depth is already this scope's private copy and may be mutated in place. Switching to tiny_skia::Mask::intersect_path skips that extra page-sized memcpy and uses the library's rounded premultiply, which is auto-vectorized by the compiler. On a Form-XObject-heavy PDF that issues ~260 clip changes per page (tests/fixtures/1.pdf, 7 pages at 150 DPI): before: 2.83s wall / 0.98s user / 14.6 G instructions / 3.6 G cycles after: 0.50s wall / 0.49s user / 6.9 G instructions / 1.8 G cycles Rendered PNGs are byte-identical to the previous code across: - tests/fixtures/1.pdf (7 pages, the workload that motivated this) - tests/fixtures/1008.3918v2.pdf (29 pages, was not previously hot) - tests/fixtures/issue_regressions/alpha_channel/538250-1.pdf Adds a perf-regression probe that drives apply_pending_clip directly with K paint-op-style invocations under N clip-state changes and asserts the materialization count equals N — not K — pinning the per-paint-op fast path against future regressions.
f1203f8 to
c034e1b
Compare
yfedoseev
left a comment
There was a problem hiding this comment.
LGTM — approving.
Verified the correctness-critical parts against the actual tiny-skia 0.12 source (not from memory):
- Intersection is equivalent.
Mask::intersect_path(mask.rs:352) fills the path into a temp submask and folds it in withpremultiply_u8(self[i], submask[i]).premultiply_u8(color.rs:432) is the rounding(c*a+128 + ((c*a+128)>>8)) >> 8, vs the old hand-rolled truncating(a*b)/255. So results match except a ≤1/255 rounding delta on anti-aliased clip edges — invisible, arguably more correct. - In-place mutation is safe. The
qoperator (page_renderer.rs:413-414) pushesclip_stack.last().cloned().flatten()— a deep copy of the parent mask. So the top-of-stack mask is always the current scope's private copy;intersect_pathmutating it cannot corrupt a parent clip scope, andQjust pops the private copy. The PR comment describes this accurately. - Perf rationale confirmed. The win is dropping the per-clip full-page
current_mask.clone()the oldSomebranch did; the new path does one submask alloc + one fold loop. - No
unsafe, no new external-input surface. The#[cfg(test)]regression probe passes locally under--features rendering.
One minor note (non-blocking): on a non-finite CTM the old code skipped the clip entirely while the new code rasterizes nothing → empty mask → clips everything out. Only reachable with inf/NaN coordinates, so not a real concern — just flagging the behavior delta.
Rebased onto main (resolved the conflict with the #652 CMYK change in the same file — both branches had appended tests to mod tests; kept both test sets) and re-ran CI on the rebased head.
|
Updating this PR's headline with corrected measurements — the original "5.7× wall" figure was an outlier from a single-run measurement taken under noisy conditions. The fix itself, the byte-identity claim, and the correctness arguments are unchanged. Only the headline magnitude is being softened to reflect a robust re-measurement. What changedTitle: "5.7× speedup" → "1.7× speedup" WhyThe original 2.83s baseline on
Baseline wall never exceeded 1.11s across any regime. The 2.83s was almost certainly contamination from concurrent CPU work — most likely a parallel What stays unchanged
Methodology (re-measurement)
Apologies for the noisy original measurement. The fix itself stands on its merits — even at 1.7×, eliminating a 58% self-time function call with byte-identical output is a clear win for the Form-XObject-heavy workload class, and the hardware-counter deltas confirm the mechanism is real. |
|
Converting this back to draft pending further verification. Two issues need resolution before this is ready for re-review: 1. Byte-identity claim is incorrect on real-world PDFsA broader-corpus measurement (27 PDFs spanning academic papers, technical manuals, journal articles, and commercial artwork) surfaced byte divergences on at least 2 PDFs ( The mechanism is the rounding-mode difference between the previous scalar The divergences are almost certainly ±1 LSB on isolated AA-edge pixels (rounded vs truncating premultiply diverge in the bottom bit when the product's fractional part is exactly 0.5). But I haven't quantified yet with 2. Real-world speedup magnitude is much smaller than the headlineThe wider-corpus measurement (10 timed runs per PDF after one warmup, median reported, std-dev <2% on most PDFs) gives a geomean speedup of ~1.02× user-time across 19 real-world PDFs — within measurement noise. The fixture_1 (Form-XObject-heavy) win of 1.73× user that the corrected headline cites is genuinely there but is an outlier — it doesn't characterize the typical workload. What needs to happen before re-ready
Marking draft to give those verifications time without churning the review queue. |
|
Thanks @RayVR — this is approved and CI is green, but it's still marked draft so it can't be merged. Could you hit "Ready for review" to take it out of draft? Once it's a regular PR it'll be mergeable. Really nice write-up, by the way — the hardware-counter evidence and the honest retraction of the earlier 5.7× single-run outlier are exactly the kind of rigor we want. 🙏 |
|
@RayVR Thanks again for this — the optimization is approved on our side and CI is green, so it's ready to land. The only thing blocking the merge is that the PR is still marked as a draft, which prevents merging. Whenever you have a moment, could you flip it from draft to Ready for review? Once you do, we'll get it merged. Much appreciated! |
|
@RayVR friendly nudge on this one 🙂 Totally understand you draft-converted it to verify the byte divergences (the ±1 LSB rounding question on Just checking where you've landed:
Whenever you get a moment, could you let us know which way you're leaning — or flip it to Ready for review if the verification's done? Thanks again for the rigor on this. 🙏 |
|
Un-drafting — both open items from the draft-conversion note above are resolved, and the PR body is rewritten with the corrected numbers. 1. Byte deltas quantified. Across 511 rendered pages from an 80-document real-world corpus (pages 1–20 each, 150 DPI), 503 are byte-identical. The 8 divergent pages (3 documents) were decoded to uncompressed bitmaps and run through 2. Corpus-wide perf re-measured — the earlier "~1.02× geomean, within noise" reading was wrong. On 71 documents (medians of 10 warmed runs per document per binary): geomean 1.109× user / 1.103× wall, zero regressions. The distribution is bimodal: 54 text-dominated documents flat, 15 clip-heavy documents (scanned forms, receipts, register extracts) at 1.3–1.83×. samply method-level profiles explain it — those real-world documents show the same ~50% The title's old "1.7× speedup" headline is replaced by the corpus framing (1.11× geomean, up to 1.8× on the affected class). One more data point from the profiling: even after this change, |
yfedoseev
left a comment
There was a problem hiding this comment.
Approving. Reviewed against §8.5.4 and re-ran an independent render regression.
Safety & spec
The in-place mutation is sound: q pushes a private clone of the parent mask (page_renderer.rs:741) and Q pops it, so the top-of-stack mask is always this scope's own copy — mutating it can't corrupt a parent. §8.5.4 intersect semantics are preserved, and the empty-stack early-return is safer than the prior .unwrap().
Independent render regression
68 diverse PDFs, pages 1–5 @150 DPI, base e12609e5 vs this branch — 131/132 byte-identical. Two small notes:
-
The "every differing byte differs by exactly ±1" characterization is very slightly optimistic.
tiger-as-form-xobject.pdf(nested clips) shows maxΔ=2 — histogram{Δ1: 159 bytes, Δ2: 6 bytes}, RGB only, 55 px of 2.1M (0.003%). Nested clips stack two rounded-vs-truncated premultiply deltas on the same AA edge. Utterly sub-perceptual, and the rounded form is the more accurate of the two — just suggest the claim read "≤±2 LSB on nested-clip AA edges." -
One unmentioned behavior change: on a non-finite/degenerate clip-path transform,
tiny_skia::fill_pathreturns early leaving a zero submask, sointersect_pathzeroes the mask (everything clips out) — whereas the oldif let Some(path.transform(…))guard skipped the clip (content stayed visible). It's arguably more spec-correct and didn't surface anywhere on the corpus, but worth a sentence in the PR body.
Neither blocks merge.
CI heads-up
Same as the sibling perf PR: the cargo-deny / Security Audit / OSV reds are the v0.3.64 pyo3 advisory-ignores missing from this branch's base — merging current main clears them.
Thanks for the careful profiling and the honest re-measurement write-up (the ±1/5.7× corrections inline are exactly the right way to do it). 🙏
Description
apply_pending_clipwas consuming ~54% self time / ~64% inclusive time ontests/fixtures/1.pdf(samply, 7-page Form-XObject-heavy PDF, release build with debug symbols). A single function dominating two-thirds of a real-PDF render is a clear sign of redundant work, so I dug in.This PR removes the two avoidable per-call page-sized byte operations the function was doing. Measured across a 71-document real-world corpus (academic papers, technical manuals, invoices, receipts, scanned forms, commercial artwork; medians of 10 warmed runs per document per binary):
apply_pending_cliphas the same ~50% baseline self-time as the fixture — the fixture is representative of a real document class, not an outlier.Output is byte-identical on 503 of 511 corpus pages; the remaining 8 pages differ by ±1 LSB on 1–12 isolated anti-aliased edge pixels per page (quantified below — sub-perceptual by any standard).
Type of Change
Related Issues
None filed — surfaced by a profiling investigation. Documenting it inline rather than after-the-fact.
Changes Made
src/rendering/page_renderer.rs::apply_pending_cliprewrite:Before (per call when a clip narrowing was pending):
path.transform(transform)— clone+transform the pathtiny_skia::Mask(W×H zero-fill)current_mask.clone()— another full W×H memcpy (probably added to dodge a borrow-checker conflict betweenclip_stack.last()andclip_stack.last_mut())(combined[i] as u32 * new[i] as u32) / 255byte loop to combine the two masksAfter:
clip_stack.last_mut()matchtiny_skia::Mask::intersect_path(path, ...)in place on the existing mask — tiny_skia uses roundedpremultiply_u8(no divide) and the upstreamiter_mut().zip()form is auto-vectorizableintersect_path— the explicitpath.transformcall is droppedNet: ~3 × W×H byte operations per call → ~1 × W×H. On
1.pdf, that's removing ~11 GB of byte work over 7 pages.Testing
Real-world corpus measurement
A/B pair: this branch vs its mainline merge parent (the only diff between the two binaries is this PR's
page_renderer.rschange). Corpus: 80 real-world PDFs (all parse and render); byte comparison over pages 1–20 of every document at 150 DPI (511 pages per binary); timing over the 71 documents ≤ 30 pages with measurable medians (--all, 150 DPI, 1 discarded warmup + 10 timed runs per document per binary, medians via/usr/bin/time -p, idle M2 Max).Timing: geomean 1.109× user / 1.103× wall. Buckets (user): ≥1.3×: 15 docs · 1.1–1.3×: 2 · 0.95–1.1×: 54 · <0.95×: 0.
Method-level profiles (samply @ 8 kHz, whole-document render,
apply_pending_clipshare):tests/fixtures/1.pdf(control)The function remains the top self-time entry on clip-heavy documents even after this fix (26–37%) — see follow-on opportunities below.
Output divergence, quantified
cmp -sacross all 511 corpus pages: 503 byte-identical. The 8 divergent pages (3 documents) were decoded to uncompressed bitmaps and quantified withcmp -l+ ImageMagickcompare -metric AE:| Page | Differing pixels (of ~2.1 M) | Max |Δ| |
|---|---|---|
| mathematics-11-00327 p1 | 1 | 1 |
| preprints202110.0334 p5 / p19 | 12 / 1 | 1 |
| 5-page LaTeX document, p1–p5 | 2 each | 1 |
Every differing byte across all 8 pages differs by exactly ±1 (the
cmp -ldelta histogram is 100% at |Δ|=1; worst page: 36 bytes of 8,706,994). Mechanism: the previous scalar loop combined masks with truncating(a*b)/255whileintersect_pathuses tiny_skia's roundedpremultiply_u8; the two differ in the bottom bit when the product's fractional part is ≥ 0.5, which materializes only on partially-covered anti-aliased clip-edge pixels. The rounded form is the more accurate of the two. An earlier revision of this PR claimed fully byte-identical output from a 3-fixture check; the corpus measurement above replaces that claim with the precise characterization.Frequency measurement
Added a
#[cfg(test)] static APC_MATERIALIZED: AtomicU64(test-only, no release-binary impact) and counted on the affected workload:tests/fixtures/1.pdf(7 pages, 150 DPI): 2,372 calls / 1,820 materializations (~260 W operators per page — the calls are not redundant; the per-call cost was the problem)tests/fixtures/1008.3918v2.pdf(arxiv, 29 pages): 71,053 calls / 4 materializations — the function was negligible here, and stays negligibletests/fixtures/issue_regressions/alpha_channel/538250-1.pdf: 1 call / 0 materializationsFixture wall-clock / hardware counters
Release build (
CARGO_PROFILE_RELEASE_DEBUG=true,--features "rendering icc"), same 10-run median methodology:1.pdf(7p)Instruction count on
1.pdf: 14.6 G → 6.9 G (−53%). Cycles: 3.6 G → 1.8 G (−50%). These hardware-counter deltas are independent of wall-clock noise.Note: an earlier revision of this PR cited a "5.7× wall" headline on
1.pdffrom a single-run measurement. That number turned out to be a cold-start / scheduler-contention outlier (the original measurement was taken while concurrentcargo buildandsamplyprocesses were running). A 10-run warmed re-measurement consistently lands at ~1.7× and could not reproduce the 2.83s baseline under any cold-start regime (page-cache eviction, fresh inodes, fresh binary). The user-time and hardware-counter claims were unaffected. See PR comments below for full re-measurement methodology.Perf-regression probe
apply_pending_clip_materializes_only_per_clip_state_changepins two contracts so a future regression surfaces loudly:Backed by the test-only atomic counter + a
Mutexfor thread-safe shared access.Full test suite
cargo fmt --all -- --check— cleancargo clippy --all-targets --features "rendering icc" -- -D warnings— cleancargo check --lib --no-default-features— cleancargo test(default) — 136 passed, 0 failedcargo test --features "rendering icc"— all greencargo doc --no-deps --features "rendering icc"— cleanPython Bindings
No binding changes — this is a pure-Rust rendering hot-path fix. Python bindings inherit the speedup automatically through
render_page.Documentation
No user-facing docs change. Inline docstring on
apply_pending_clipupdated to reflect the new in-place semantics.Checklist
Additional Notes
Follow-on opportunities surfaced during this investigation
These are out of scope here but worth noting for future perf work — and the corpus profiling shows
apply_pending_clipis still the top self-time function on clip-heavy documents after this PR:q(SaveState) still clones the entire page-sized mask at:466-467even when the scope never narrows the clip.Rc<Mask>/Arc<Mask>with copy-on-write would makeqcheap and defer cloning until actual narrowing fires. Probably halves the remaining clip-related memmove cost on Form-XObject-heavy PDFs.path.is_rect()→ store astiny_skia::IntRect(rect-intersection is O(1)) instead of full Mask. Bigger refactor (touchespath_rasterizer::fill_path_clipped/stroke_path_clipped).Even with this PR,
1.pdfstill does 1,820 page-sized mask allocations across 7 pages. The follow-ons should bring that down further.