Skip to content

Vertical CJK writing mode (tategaki) unsupported — reading order shredded (CER 0.96) #659

@yfedoseev

Description

@yfedoseev

Summary

pdf_oxide does not support vertical writing mode (tategaki). Vertical CJK pages extract every glyph correctly but in shredded reading order. On pdf_benches: vertical-cjk-ja CER 0.962, vertical-cjk-zh 0.958 (pypdfium2 ~0.10–0.20). Horizontal CJK (wiki-cat-zh) is fine at 0.059.

GOLD : 縦書きの日本語テキストです。右から左へ列が進みます。
PDFOX: すす縦。。書右きかのら日左本へ語列テがキ進スみトまで

All glyphs present; the Y-band sort reads one horizontal slice across all columns as a "row".

Root cause: the feature is entirely absent

  • No vertical reading-order strategy. src/pipeline/reading_order/mod.rs:13-28 ships only StructureTree, Geometric, XYCut, Simple — all horizontal. There is no tategaki/VerticalStrategy.
  • No writing-mode detection. No WMode/writing_mode/DW2/W2 handling in the text pipeline. Identity-V is treated identically to Identity-H (just a CID encoding) in src/fonts/font_dict.rs:1895,3324 — never a layout signal.
  • ReadingOrderContext (mod.rs:66-95) has no writing-mode field.
  • XY-cut is hardwired horizontal (xycut.rs:120-128); row_aware_span_cmp (src/lib.rs:408-424) sorts Y-band then X-ascending.
  • These specific PDFs carry no Identity-V/WMode marker — the vertical layout is built by explicit per-glyph positioning (fixed X, descending Y, columns right-to-left), so detection must be geometric.

Proposed fix (deferred — new feature, ~3–5 days)

  1. Detection (~1d): vertical mode from Type0 Identity-V/CMap WMode 1 AND a geometric detector (glyph runs sharing X with monotonically decreasing Y; columns ordered by descending X).
  2. Strategy (~1–2d): a VerticalStrategy banding by X (right→left) then Y descending within each column — row_aware_span_cmp_rtl (src/lib.rs:438, added for Arabic RTL extraction garbled across text/md/HTML (CER 0.94 vs poppler 0.16) — over-reversal + combining-mark detachment #656/Hebrew RTL extraction word-order reversed across text/md/HTML (CER 0.71 vs pdfbox 0.00) #657) is a close structural template (swap axes, reverse X).
  3. Wiring (~1d): writing-mode field on ReadingOrderContext, auto-select, honor in StructureTreeStrategy's single-MCID fallback (structure_tree.rs:23,146). + tests + the two probe docs as fixtures.

A cheap partial win (geometric "is vertical?" + swapped-axis comparator) would likely drop CER from ~0.96 toward ~0.10–0.20 without font/CMap work.

Found via pdf_benches corpus/A_golden/probes/vertical-cjk-{ja,zh}.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions