Release v0.3.61 — separation image rendering + ActualText, Node/form/macOS-OCR fixes, faster table extraction, article-thread foundation, cross-OS + cross-language CI by yfedoseev · Pull Request #653 · yfedoseev/pdf_oxide

yfedoseev · 2026-06-06T05:50:39Z

Release v0.3.61 (target 2026-06-07)

Bindings + manifests bumped 0.3.60 → 0.3.61. Validated by the 419-PDF weighted corpus sweep: zero regressions, google_doc table guard byte-identical (1242 == 1242). Builds clean and unit tests pass.

Added

Vertical writing mode (WMode 1 / tategaki) (feat: render wmode vertical #645, @RayVR) — vertical CJK / tategaki across extraction, rendering, and reading order; per-CID /W2·/DW2 metrics, a single axis-swap advance helper, and a vertical-majority reading-order override (§9.4.4).
Path flattening PathContent::to_points(tolerance) (Path-to-points flattening for signal processing / medical PDF use cases #147, @mbeschastn0v) — flattens extracted vector paths into polylines (adaptive Bézier subdivision, §8.5.2) for chart/ECG/CAD digitisation. Thanks @joelparkerhenderson for the use case.
Separation-plate image rendering (feat: separation image rendering #631, @RayVR) — raster Image XObjects routed to the matching ink plates (§8.9); previously only Form XObjects were handled.
/ActualText extraction for structure-tree spans (feat: struct actualtext #646, @RayVR) — §14.9.4 replacement text on StructElem (drop caps, ligature spans) emitted once, correctly positioned, across extract_text/to_markdown/to_html.
Article-thread (/Threads) parsing foundation (Article-thread (/Threads) handling for legacy magazine-style PDFs (§12.4.6) #458) — parser + reading-order strategy shipped as tested public building blocks; default reading order unchanged (auto-wiring deferred).
Cross-language test-parity suite — one shared functional spec across all nine bindings (Rust/Python/Node/Go/Java/Ruby/PHP/C#/WASM).

Changed

Renderer resolution pipeline refactor (refactor: renderer resolution pipeline #649, @RayVR) — unified paint resolution across page + separation renderers; fixes Type 4 calculator tint transforms for /Separation·/DeviceN spot colours.
Faster table-heavy extraction — output-preserving spatial prune of the O(n²) cell-edge scan; ~30% faster on dense regulatory volumes, byte-identical output.
Cross-OS + FIPS example verification in CI — core scenarios for every binding assert their output on Linux/macOS/Windows (incl. cross-OS Java JNI) — the guard that would have caught [Bug]:Could not run very primary sample on node.js #648.
CI reliability hardening (ci: harden against transient GitHub Actions runner flakes (rustup DNS, apt cleanup, network) #544, partial) — SHA-pinned the last floating action + network retries; rust-cache added to the one Rust workflow lacking caching. The reported transient flakes (rustup DNS / apt / network) are addressed; the optional Proposal-4 (pin macos-latest to an explicit image, only if a macOS outage recurs) remains open, so ci: harden against transient GitHub Actions runner flakes (rustup DNS, apt cleanup, network) #544 stays referenced rather than auto-closed.
Dependency & CI-action bumps — imageproc 0.27, subsetter 0.2.6, log 0.4.32, actions/checkout, taiki-e/install-action, astral-sh/setup-uv (chore(deps): bump imageproc from 0.26.2 to 0.27.0 #639/chore(deps): bump subsetter from 0.2.4 to 0.2.6 #637/chore(deps): bump log from 0.4.30 to 0.4.32 #636/chore(ci): bump actions/checkout from 6.0.2 to 6.0.3 #643/chore(ci): bump taiki-e/install-action from 2.81.0 to 2.81.3 #641/chore(ci): bump astral-sh/setup-uv from 8.1.0 to 8.2.0 #640).

Fixed

Node.js quickstart ([Bug]:Could not run very primary sample on node.js #648) — ESM import + PdfDocument.open(path); constructor rejects a path with an actionable error. Report @abeq; docs fix @lihouwenbin (docs: fix Node.js PdfDocument open examples #651).
Form fields filled but not displayed for inline-AcroForm PDFs ([Bug]: Form Data Is Filled Correctly but Not Displayed Properly #647, @mitslabo) — /NeedAppearances survives full-rewrite save (§12.7.3.3).
macOS OCR onnxruntime detection ([Bug]: logic to detect libonnxruntime library do not include macos naming format #632, @paliwalvimal) — version-tolerant libonnxruntime.<ver>.dylib matching.
Deterministic table detection — three HashMap-iteration-order leaks sorted.
CMSY/Symbol decimal point — spaced rendering (1¬ 00 → 1.00) recovered.
Encrypted-text surface parity — to_plain_text returns empty like the other text surfaces for undecryptable PDFs (§7.6).
Text-extraction quality — RTL & Indic (Arabic RTL extraction garbled across text/md/HTML (CER 0.94 vs poppler 0.16) — over-reversal + combining-mark detachment #656, Hebrew RTL extraction word-order reversed across text/md/HTML (CER 0.71 vs pdfbox 0.00) #657, Indic complex-script extraction mis-orders matras/clusters (Tamil CER 0.095, Bengali 0.175 vs PyMuPDF/poppler 0.00) #663) — Hebrew word order in tagged PDFs fixed to poppler parity (benchmark CER 0.71→0.05); Arabic word order + Arabic/Hebrew combining marks corrected (0.94→0.45); Tamil/Bengali/Devanagari spurious matra spaces removed (0.10/0.18/0.07 → 0.04/0.03/0.02).
Number corruption in plain-text table cells — per-glyph Td <hex> Tj cells (0.99, Q1) lost the decimal point (0 99) or gained a space (Q 1) when merged spans desynced char_widths; now resynced on merge.
Java binding native-lib load on macOS/Windows (Release v0.3.61 — separation image rendering + ActualText, Node/form/macOS-OCR fixes, faster table extraction, article-thread foundation, cross-OS + cross-language CI #653) — NativeLoader hard-failed on the Linux-only .so override path; now falls through to the bundled .dylib/.dll. Surfaced by the new cross-OS Java JNI CI.
Node.js binding ESM barrel exports (Release v0.3.61 — separation image rendering + ActualText, Node/form/macOS-OCR fixes, faster table extraction, article-thread foundation, cross-OS + cross-language CI #653) — the managers barrel didn't re-export ContentType / ImageFormat / ThumbnailManager / OCR* aliases, breaking the cross-language core-parity test's strict ESM import.

Contributors

Thanks @RayVR, @mbeschastn0v, @joelparkerhenderson, @lihouwenbin, @mitslabo, @paliwalvimal, and @abeq.

Closes #648
Closes #647
Closes #632
Closes #663
Ref #458
Ref #544
Ref #650
Ref #656
Ref #657

@RayVR

…start (#648)/form-display (#647)/macOS-OCR (#632) fixes, faster table extraction, article-thread foundation (#458), cross-OS + cross-language CI Bindings + manifests bumped 0.3.60 → 0.3.61. Added - Separation-plate image rendering (#631, @RayVR) — raster Image XObjects routed to ink plates (§8.9). - /ActualText extraction for structure-tree spans (#646, @RayVR) — §14.9.4 replacement text emitted once, correctly positioned. - Article-thread (/Threads) parsing foundation (#458) — parser + reading-order strategy; default order unchanged. - Cross-language test-parity suite — one shared functional spec across all nine bindings. Changed - Renderer resolution pipeline refactor (#649, @RayVR) — unified paint resolution; fixes Type 4 tint transforms for Separation/DeviceN spot colours. - Faster table-heavy extraction — output-preserving spatial prune of the O(n²) cell-edge scan; ~30% faster on dense regulatory volumes, byte-identical output. - Cross-OS + FIPS example verification in CI — core scenarios for every binding assert output on Linux/macOS/Windows (incl. cross-OS Java JNI); the #648-class guard. - CI reliability hardening (#544) + rust-cache on codeql.yml. - Dependency/CI-action bumps (dependabot #639/#637/#636/#643/#641/#640). Fixed - Node.js quickstart (#648, report @abeq; docs fix @lihouwenbin #651). - Form fields filled but not displayed for inline-AcroForm PDFs (#647, @mitslabo) — /NeedAppearances survives full-rewrite save (§12.7.3.3). - macOS OCR onnxruntime detection (#632, @paliwalvimal) — version-tolerant dylib matching. - Deterministic table detection — three HashMap-iteration-order leaks sorted. - CMSY/Symbol decimal point — spaced rendering (1¬ 00 → 1.00) recovered. - Encrypted-text surface parity — to_plain_text returns empty like the other surfaces (§7.6). Validated by the 419-PDF weighted corpus sweep: zero regressions, google_doc table guard byte-identical (1242==1242). Builds clean and unit tests pass after rebase onto the renderer refactor (#649) and ActualText (#646).

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

… release/v0.3.61 # Conflicts: # CHANGELOG.md

…link, biome, spotless, java smoke-test classpath) - src/document.rs: reflow logicalnot-decimal doc comment (clippy doc_lazy_continuation / doc-list-indent) → fixes Clippy, WASM, FIPS - src/structure/article_threads.rs: drop private intra-doc link to MAX_BEADS_PER_THREAD → fixes Test (cargo doc -D warnings) - cargo fmt: 3 rust examples + tests/core_parity.rs → fixes Format Check - js: biome import-sort + format in node tests/index.ts → fixes Node.js Bindings - java: spotless formatting in CoreParityTest → fixes Java Lint - .github/workflows/ci.yml: resolve runtime classpath (slf4j-api, org.json) for the Java example smoke test so it runs the JAR the way Maven/Gradle consumers do → fixes Java Bindings (NoClassDefFoundError: org/slf4j/LoggerFactory)

… noqa)

…656, #657) extract_text on a tagged PDF assembles from the structure tree and never reaches the untagged reverse_rtl_visual_order_runs pass, so pure-RTL word-spans were emitted in visual (LTR) order — the whole line reversed. Add row_aware_span_cmp_rtl (X-descending within a row) and route pure-RTL MCIDs through it in order_mcid_spans, reconstructing logical reading order from page geometry regardless of how the producer stored the run. Per-span glyph order is still handled by push_span_text_bidi; mixed RTL+Latin runs are left untouched pending full UAX #9 bidi. Hebrew benchmark text CER: worst-tier -> parity with poppler/pdfium. Arabic word order corrected (intra-word spacing + md/HTML tracked in #656/#657).

…ed (#656) push_span_text_bidi reversed pure-RTL spans with chars().rev(), which moves combining marks (kasra/shadda U+0650/U+0651, Hebrew points) in front of their base letter so they float off as standalone marks. Replace with reverse_rtl_keeping_marks: group each base char with its trailing diacritics, reverse the group order, preserve each group's internal order. Arabic benchmark text CER 0.477 -> 0.471; diacritics now render correctly (قِطّ, not floating ِقّط). Hebrew unaffected (no combining marks in fixture).

get_space_glyph_width returned ~0 for CID subset fonts that omit code 0x20 (shaped Arabic from Chrome/browser print). The geometric word-gap threshold is space_width × ratio, so a zero width collapsed it to 0 and EVERY inter-glyph kerning gap was read as a word boundary — cursive Arabic words shattered into single letters. Fall back to 0.25em (250 font units) when the space glyph is missing, matching the no-font fallback already used by should_insert_space. Arabic benchmark text/md/html CER all improve; intra-word phantom gaps from ink-vs-advance width remain (tracked in #656). 20 space-detection tests pass.

…managers barrel (#653) The cross-language core-parity test imports through the package's ESM entry, which re-exports these symbols from `./managers/index.js`. The managers barrel never re-exported the hybrid-ml (`ContentType`) or thumbnail (`ThumbnailManager`/`ThumbnailSize`/`ImageFormat`) modules, and lacked the upper-case `OCRDetectionMode`/`OCRLanguage` aliases. The CJS `require` path tolerated the gaps silently (missing names destructure to undefined); the strict ESM import in the parity test did not, failing with "does not provide an export named 'ContentType'". Re-export the three modules' public values from the managers barrel. Verified: the parity test now resolves all imports (remaining local-only failure is native-module staging, which CI handles).

…l exports (#656, #653)

… corruption (#653 tables) merge_adjacent_spans concatenated span text but never extended char_widths, so a merged multi-glyph span built from per-glyph `Td <hex> Tj` table cells (e.g. "0.99", "Q1") kept char_widths.len()==1. The downstream width-based splitters is_column_spanning_decimal and char_widths_boundary_split (document.rs) fire when char_widths.len() < char_count, so they wrongly split those spans — DROPPING the decimal point ("0.99" -> "0 99", corrupting the value) and inserting a space at the letter->digit boundary ("Q1" -> "Q 1"). Re-sync char_widths to the merged char count at the shared bbox-extend step, padding for any inserted '.'/' ' separator. Benchmark: table-borderless CER 0.117->0.067, table-bordered 0.091->0.061; "1.20"/"0.99"/"Q1" now intact. The 12 column-spanning-decimal / merge tests still pass (genuine sparse-width splits still fire).

…anagari) Brahmic text extracted the correct codepoints but inserted a word space after nearly every dependent vowel sign (matra), e.g. Tamil பாலூட்டி -> "பா லூட்டி". A matra carries its own advance, so the geometric gap test reads matra->consonant as a word boundary; real word breaks carry an explicit space glyph. Three script-gated changes: - detect_from_characters: add Bengali/Tamil/Telugu/Kannada/Malayalam blocks so those docs reach the complex-script boundary path (only Devanagari/Thai/Khmer were recognised, which is why Tamil/Bengali were worst). - handle_indic_boundary / handle_devanagari_boundary: suppress the boundary AFTER a matra/virama/sign when the next glyph is same-script (intra-word). - should_insert_space: add a complex-script combining-mark guard on the primary path so the strong-geometric / consensus branches (which never consult WordBoundaryDetector) also suppress. Benchmark text CER: Tamil 0.095->0.035, Bengali 0.175->0.032, Devanagari 0.066->0.016. No regression on Latin/CJK/Thai/Hebrew; 42 complex-script / word-boundary tests pass.

#653) The Maven build sets fyi.oxide.pdf.lib.path to a Linux `.so` default (java/pom.xml), and ci.yml's non-FIPS `mvn test` does not override it per-OS (unlike ci-fips.yml). On macOS (`.dylib`) and Windows (`.dll`) that override path does not exist, so NativeLoader's System.load() hard-failed with UnsatisfiedLinkError — even though the correct platform native is staged into the JAR resources and would load via loadBundled(). This surfaced once v0.3.61 began running the Java JNI tests cross-OS. Guard the override with Files.exists(): when the path is absent, log and fall through to the system-library / bundled-resource paths instead of failing.

…ader (#653, #656)

# Conflicts: # CHANGELOG.md

Reformat the hand-edited export blocks in js/src/managers/index.ts (Biome export organizer) and the LOG.debug call in NativeLoader.java (Spotless) that drifted in #653's barrel-export fix. Format-only; the exported symbol set and behavior are unchanged. Fixes the Node.js Bindings (Biome ci) and Java Lint (Spotless) checks.

The Node.js Bindings job failed on ubuntu/macOS/windows because the lib-backed tests could not load the compiled library: - macOS/Windows: lib/ was compiled only on ubuntu (the compile step was gated on matrix.os == 'ubuntu-latest'), so index.js's './lib/*' imports were missing. Build lib/ on every OS via the canonical npm run build:ts (tsc + fix-esm-imports); bare tsc leaves extensionless imports that Node's ESM resolver rejects at runtime. - ubuntu: the freshly-built native addon was staged into prebuilds/ only AFTER the test step, so lib/native.js could not resolve it. Stage the prebuilt addon BEFORE running the tests. Also fix two latent test bugs that surface now that native actually loads (both tests previously skipped in CI): - core-parity: PdfDocument has no bare search(); use the idiomatic searchAll() (mirrors Go/C# SearchAll in the parity spec). - render-options: renderPageWithOptions returns a plain Uint8Array; wrap it in a Buffer before calling readUInt32BE. Point the two new tests at the published entry (../lib/index.js) like every other test in the npm-test suite, instead of the unpublished legacy top-level index.js.

The HTML+CSS tests only located a font via the Linux system path (/usr/share/fonts/...), so on macOS/Windows runners loadFont() returned null and the 9 font-dependent tests failed once the native lib actually loaded. Prefer the git-tracked tests/fixtures/fonts/DejaVuSans.ttf, which is present on every OS runner.

yfedoseev requested a review from Copilot June 6, 2026 05:51

Copilot started reviewing on behalf of yfedoseev June 6, 2026 05:51 View session

yfedoseev mentioned this pull request Jun 6, 2026

ci: harden against transient GitHub Actions runner flakes (rustup DNS, apt cleanup, network) #544

Open

Copilot AI reviewed Jun 6, 2026

View reviewed changes

yfedoseev added 17 commits June 6, 2026 09:31

Merge main (#652 press-accurate CMYK→RGB via /OutputIntents ICC) into…

b9ea331

… release/v0.3.61 # Conflicts: # CHANGELOG.md

docs(changelog): attribute #652 color limitations to tracking issue #655

e966dc6

fix(ci): ruff lint in test_core_parity.py (lines-after-imports + B017…

c36cdd2

… noqa)

docs(changelog): note Arabic grapheme/space-gap fixes + Node.js barre…

bd988f7

…l exports (#656, #653)

docs(changelog): note table/Indic extraction fixes + Java cross-OS lo…

bd3c9fe

…ader (#653, #656)

Merge remote-tracking branch 'origin/main' into release/v0.3.61

0f2fb84

# Conflicts: # CHANGELOG.md

yfedoseev merged commit 37825d9 into main Jun 7, 2026
240 checks passed

yfedoseev deleted the release/v0.3.61 branch June 7, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.3.61 — separation image rendering + ActualText, Node/form/macOS-OCR fixes, faster table extraction, article-thread foundation, cross-OS + cross-language CI#653

Release v0.3.61 — separation image rendering + ActualText, Node/form/macOS-OCR fixes, faster table extraction, article-thread foundation, cross-OS + cross-language CI#653
yfedoseev merged 18 commits into
mainfrom
release/v0.3.61

yfedoseev commented Jun 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yfedoseev commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v0.3.61 (target 2026-06-07)

Added

Changed

Fixed

Contributors

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yfedoseev commented Jun 6, 2026 •

edited

Loading