You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pdf_oxide does not support vertical writing mode (tategaki). Vertical CJK pages extract every glyph correctly but in shredded reading order. On pdf_benches: vertical-cjk-jaCER 0.962, vertical-cjk-zh0.958 (pypdfium2 ~0.10–0.20). Horizontal CJK (wiki-cat-zh) is fine at 0.059.
All glyphs present; the Y-band sort reads one horizontal slice across all columns as a "row".
Root cause: the feature is entirely absent
No vertical reading-order strategy.src/pipeline/reading_order/mod.rs:13-28 ships only StructureTree, Geometric, XYCut, Simple — all horizontal. There is no tategaki/VerticalStrategy.
No writing-mode detection. No WMode/writing_mode/DW2/W2 handling in the text pipeline. Identity-V is treated identically to Identity-H (just a CID encoding) in src/fonts/font_dict.rs:1895,3324 — never a layout signal.
ReadingOrderContext (mod.rs:66-95) has no writing-mode field.
XY-cut is hardwired horizontal (xycut.rs:120-128); row_aware_span_cmp (src/lib.rs:408-424) sorts Y-band then X-ascending.
These specific PDFs carry noIdentity-V/WMode marker — the vertical layout is built by explicit per-glyph positioning (fixed X, descending Y, columns right-to-left), so detection must be geometric.
Proposed fix (deferred — new feature, ~3–5 days)
Detection (~1d): vertical mode from Type0 Identity-V/CMap WMode 1 AND a geometric detector (glyph runs sharing X with monotonically decreasing Y; columns ordered by descending X).
Wiring (~1d): writing-mode field on ReadingOrderContext, auto-select, honor in StructureTreeStrategy's single-MCID fallback (structure_tree.rs:23,146). + tests + the two probe docs as fixtures.
A cheap partial win (geometric "is vertical?" + swapped-axis comparator) would likely drop CER from ~0.96 toward ~0.10–0.20 without font/CMap work.
Found via pdf_benchescorpus/A_golden/probes/vertical-cjk-{ja,zh}.
Summary
pdf_oxide does not support vertical writing mode (tategaki). Vertical CJK pages extract every glyph correctly but in shredded reading order. On
pdf_benches:vertical-cjk-jaCER 0.962,vertical-cjk-zh0.958 (pypdfium2 ~0.10–0.20). Horizontal CJK (wiki-cat-zh) is fine at 0.059.All glyphs present; the Y-band sort reads one horizontal slice across all columns as a "row".
Root cause: the feature is entirely absent
src/pipeline/reading_order/mod.rs:13-28ships only StructureTree, Geometric, XYCut, Simple — all horizontal. There is notategaki/VerticalStrategy.WMode/writing_mode/DW2/W2handling in the text pipeline.Identity-Vis treated identically toIdentity-H(just a CID encoding) insrc/fonts/font_dict.rs:1895,3324— never a layout signal.ReadingOrderContext(mod.rs:66-95) has no writing-mode field.xycut.rs:120-128);row_aware_span_cmp(src/lib.rs:408-424) sorts Y-band then X-ascending.Identity-V/WModemarker — the vertical layout is built by explicit per-glyph positioning (fixed X, descending Y, columns right-to-left), so detection must be geometric.Proposed fix (deferred — new feature, ~3–5 days)
Identity-V/CMapWMode 1AND a geometric detector (glyph runs sharing X with monotonically decreasing Y; columns ordered by descending X).VerticalStrategybanding by X (right→left) then Y descending within each column —row_aware_span_cmp_rtl(src/lib.rs:438, added for Arabic RTL extraction garbled across text/md/HTML (CER 0.94 vs poppler 0.16) — over-reversal + combining-mark detachment #656/Hebrew RTL extraction word-order reversed across text/md/HTML (CER 0.71 vs pdfbox 0.00) #657) is a close structural template (swap axes, reverse X).ReadingOrderContext, auto-select, honor inStructureTreeStrategy's single-MCID fallback (structure_tree.rs:23,146). + tests + the two probe docs as fixtures.A cheap partial win (geometric "is vertical?" + swapped-axis comparator) would likely drop CER from ~0.96 toward ~0.10–0.20 without font/CMap work.
Found via
pdf_benchescorpus/A_golden/probes/vertical-cjk-{ja,zh}.