Skip to content

Release v0.3.61 — separation image rendering + ActualText, Node/form/macOS-OCR fixes, faster table extraction, article-thread foundation, cross-OS + cross-language CI#653

Merged
yfedoseev merged 18 commits into
mainfrom
release/v0.3.61
Jun 7, 2026
Merged

Conversation

@yfedoseev
Copy link
Copy Markdown
Owner

@yfedoseev yfedoseev commented Jun 6, 2026

Release v0.3.61 (target 2026-06-07)

Bindings + manifests bumped 0.3.60 → 0.3.61. Validated by the 419-PDF weighted corpus sweep: zero regressions, google_doc table guard byte-identical (1242 == 1242). Builds clean and unit tests pass.

Added

Changed

Fixed

Contributors

Thanks @RayVR, @mbeschastn0v, @joelparkerhenderson, @lihouwenbin, @mitslabo, @paliwalvimal, and @abeq.


Closes #648
Closes #647
Closes #632
Closes #663
Ref #458
Ref #544
Ref #650
Ref #656
Ref #657

…start (#648)/form-display (#647)/macOS-OCR (#632) fixes, faster table extraction, article-thread foundation (#458), cross-OS + cross-language CI

Bindings + manifests bumped 0.3.60 → 0.3.61.

Added
- Separation-plate image rendering (#631, @RayVR) — raster Image XObjects routed to ink plates (§8.9).
- /ActualText extraction for structure-tree spans (#646, @RayVR) — §14.9.4 replacement text emitted once, correctly positioned.
- Article-thread (/Threads) parsing foundation (#458) — parser + reading-order strategy; default order unchanged.
- Cross-language test-parity suite — one shared functional spec across all nine bindings.

Changed
- Renderer resolution pipeline refactor (#649, @RayVR) — unified paint resolution; fixes Type 4 tint transforms for Separation/DeviceN spot colours.
- Faster table-heavy extraction — output-preserving spatial prune of the O(n²) cell-edge scan; ~30% faster on dense regulatory volumes, byte-identical output.
- Cross-OS + FIPS example verification in CI — core scenarios for every binding assert output on Linux/macOS/Windows (incl. cross-OS Java JNI); the #648-class guard.
- CI reliability hardening (#544) + rust-cache on codeql.yml.
- Dependency/CI-action bumps (dependabot #639/#637/#636/#643/#641/#640).

Fixed
- Node.js quickstart (#648, report @abeq; docs fix @lihouwenbin #651).
- Form fields filled but not displayed for inline-AcroForm PDFs (#647, @mitslabo) — /NeedAppearances survives full-rewrite save (§12.7.3.3).
- macOS OCR onnxruntime detection (#632, @paliwalvimal) — version-tolerant dylib matching.
- Deterministic table detection — three HashMap-iteration-order leaks sorted.
- CMSY/Symbol decimal point — spaced rendering (1¬ 00 → 1.00) recovered.
- Encrypted-text surface parity — to_plain_text returns empty like the other surfaces (§7.6).

Validated by the 419-PDF weighted corpus sweep: zero regressions, google_doc table guard byte-identical (1242==1242). Builds clean and unit tests pass after rebase onto the renderer refactor (#649) and ActualText (#646).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

yfedoseev added 17 commits June 6, 2026 09:31
… release/v0.3.61

# Conflicts:
#	CHANGELOG.md
…link, biome, spotless, java smoke-test classpath)

- src/document.rs: reflow logicalnot-decimal doc comment (clippy doc_lazy_continuation / doc-list-indent) → fixes Clippy, WASM, FIPS
- src/structure/article_threads.rs: drop private intra-doc link to MAX_BEADS_PER_THREAD → fixes Test (cargo doc -D warnings)
- cargo fmt: 3 rust examples + tests/core_parity.rs → fixes Format Check
- js: biome import-sort + format in node tests/index.ts → fixes Node.js Bindings
- java: spotless formatting in CoreParityTest → fixes Java Lint
- .github/workflows/ci.yml: resolve runtime classpath (slf4j-api, org.json) for the Java example smoke test so it runs the JAR the way Maven/Gradle consumers do → fixes Java Bindings (NoClassDefFoundError: org/slf4j/LoggerFactory)
…656, #657)

extract_text on a tagged PDF assembles from the structure tree and never
reaches the untagged reverse_rtl_visual_order_runs pass, so pure-RTL
word-spans were emitted in visual (LTR) order — the whole line reversed.

Add row_aware_span_cmp_rtl (X-descending within a row) and route pure-RTL
MCIDs through it in order_mcid_spans, reconstructing logical reading order
from page geometry regardless of how the producer stored the run. Per-span
glyph order is still handled by push_span_text_bidi; mixed RTL+Latin runs
are left untouched pending full UAX #9 bidi.

Hebrew benchmark text CER: worst-tier -> parity with poppler/pdfium.
Arabic word order corrected (intra-word spacing + md/HTML tracked in #656/#657).
…ed (#656)

push_span_text_bidi reversed pure-RTL spans with chars().rev(), which moves
combining marks (kasra/shadda U+0650/U+0651, Hebrew points) in front of their
base letter so they float off as standalone marks. Replace with
reverse_rtl_keeping_marks: group each base char with its trailing diacritics,
reverse the group order, preserve each group's internal order.

Arabic benchmark text CER 0.477 -> 0.471; diacritics now render correctly
(قِطّ, not floating ِقّط). Hebrew unaffected (no combining marks in fixture).
get_space_glyph_width returned ~0 for CID subset fonts that omit code 0x20
(shaped Arabic from Chrome/browser print). The geometric word-gap threshold
is space_width × ratio, so a zero width collapsed it to 0 and EVERY inter-glyph
kerning gap was read as a word boundary — cursive Arabic words shattered into
single letters. Fall back to 0.25em (250 font units) when the space glyph is
missing, matching the no-font fallback already used by should_insert_space.

Arabic benchmark text/md/html CER all improve; intra-word phantom gaps from
ink-vs-advance width remain (tracked in #656). 20 space-detection tests pass.
…managers barrel (#653)

The cross-language core-parity test imports through the package's ESM entry,
which re-exports these symbols from `./managers/index.js`. The managers barrel
never re-exported the hybrid-ml (`ContentType`) or thumbnail
(`ThumbnailManager`/`ThumbnailSize`/`ImageFormat`) modules, and lacked the
upper-case `OCRDetectionMode`/`OCRLanguage` aliases. The CJS `require` path
tolerated the gaps silently (missing names destructure to undefined); the
strict ESM import in the parity test did not, failing with
"does not provide an export named 'ContentType'".

Re-export the three modules' public values from the managers barrel. Verified:
the parity test now resolves all imports (remaining local-only failure is
native-module staging, which CI handles).
… corruption (#653 tables)

merge_adjacent_spans concatenated span text but never extended char_widths, so a
merged multi-glyph span built from per-glyph `Td <hex> Tj` table cells (e.g.
"0.99", "Q1") kept char_widths.len()==1. The downstream width-based splitters
is_column_spanning_decimal and char_widths_boundary_split (document.rs) fire when
char_widths.len() < char_count, so they wrongly split those spans — DROPPING the
decimal point ("0.99" -> "0 99", corrupting the value) and inserting a space at
the letter->digit boundary ("Q1" -> "Q 1").

Re-sync char_widths to the merged char count at the shared bbox-extend step,
padding for any inserted '.'/' ' separator. Benchmark: table-borderless CER
0.117->0.067, table-bordered 0.091->0.061; "1.20"/"0.99"/"Q1" now intact. The 12
column-spanning-decimal / merge tests still pass (genuine sparse-width splits
still fire).
…anagari)

Brahmic text extracted the correct codepoints but inserted a word space after
nearly every dependent vowel sign (matra), e.g. Tamil பாலூட்டி -> "பா லூட்டி".
A matra carries its own advance, so the geometric gap test reads matra->consonant
as a word boundary; real word breaks carry an explicit space glyph.

Three script-gated changes:
- detect_from_characters: add Bengali/Tamil/Telugu/Kannada/Malayalam blocks so
  those docs reach the complex-script boundary path (only Devanagari/Thai/Khmer
  were recognised, which is why Tamil/Bengali were worst).
- handle_indic_boundary / handle_devanagari_boundary: suppress the boundary
  AFTER a matra/virama/sign when the next glyph is same-script (intra-word).
- should_insert_space: add a complex-script combining-mark guard on the primary
  path so the strong-geometric / consensus branches (which never consult
  WordBoundaryDetector) also suppress.

Benchmark text CER: Tamil 0.095->0.035, Bengali 0.175->0.032, Devanagari
0.066->0.016. No regression on Latin/CJK/Thai/Hebrew; 42 complex-script /
word-boundary tests pass.
#653)

The Maven build sets fyi.oxide.pdf.lib.path to a Linux `.so` default
(java/pom.xml), and ci.yml's non-FIPS `mvn test` does not override it per-OS
(unlike ci-fips.yml). On macOS (`.dylib`) and Windows (`.dll`) that override
path does not exist, so NativeLoader's System.load() hard-failed with
UnsatisfiedLinkError — even though the correct platform native is staged into
the JAR resources and would load via loadBundled(). This surfaced once
v0.3.61 began running the Java JNI tests cross-OS.

Guard the override with Files.exists(): when the path is absent, log and fall
through to the system-library / bundled-resource paths instead of failing.
Reformat the hand-edited export blocks in js/src/managers/index.ts
(Biome export organizer) and the LOG.debug call in NativeLoader.java
(Spotless) that drifted in #653's barrel-export fix. Format-only; the
exported symbol set and behavior are unchanged.

Fixes the Node.js Bindings (Biome ci) and Java Lint (Spotless) checks.
The Node.js Bindings job failed on ubuntu/macOS/windows because the
lib-backed tests could not load the compiled library:

- macOS/Windows: lib/ was compiled only on ubuntu (the compile step was
  gated on matrix.os == 'ubuntu-latest'), so index.js's './lib/*' imports
  were missing. Build lib/ on every OS via the canonical npm run build:ts
  (tsc + fix-esm-imports); bare tsc leaves extensionless imports that
  Node's ESM resolver rejects at runtime.
- ubuntu: the freshly-built native addon was staged into prebuilds/ only
  AFTER the test step, so lib/native.js could not resolve it. Stage the
  prebuilt addon BEFORE running the tests.

Also fix two latent test bugs that surface now that native actually loads
(both tests previously skipped in CI):

- core-parity: PdfDocument has no bare search(); use the idiomatic
  searchAll() (mirrors Go/C# SearchAll in the parity spec).
- render-options: renderPageWithOptions returns a plain Uint8Array; wrap
  it in a Buffer before calling readUInt32BE.

Point the two new tests at the published entry (../lib/index.js) like
every other test in the npm-test suite, instead of the unpublished
legacy top-level index.js.
The HTML+CSS tests only located a font via the Linux system path
(/usr/share/fonts/...), so on macOS/Windows runners loadFont() returned
null and the 9 font-dependent tests failed once the native lib actually
loaded. Prefer the git-tracked tests/fixtures/fonts/DejaVuSans.ttf, which
is present on every OS runner.
@yfedoseev yfedoseev merged commit 37825d9 into main Jun 7, 2026
240 checks passed
@yfedoseev yfedoseev deleted the release/v0.3.61 branch June 7, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment