Summary
pdf_oxide's Markdown output scores well on element detection (heading F1 0.625, emphasis F1 0.755 — both beat pymupdf4llm) but lower on document tree structure (TED similarity 0.390 vs pymupdf4llm 0.534). The gap is concentrated in list block-boundary placement.
Three divergences (from investigation)
In src/pipeline/converters/markdown.rs (break/flush logic ~lines 1378–1476 and the standalone-bullet handler ~1467):
- A. First list item glued onto the preceding heading line —
## Highlights - Revenue grew steadily instead of a separate - Revenue grew steadily. The first bullet/ordered marker shares a baseline with the heading (same_line), so the break doesn't fire and the item appends to the heading line.
- B. Blank line inserted between every list item — each tagged
/LI has its own block_id, so the main break path emits \n\n between items, splitting one <ul> into single-item lists.
- C. Ordered lists rendered as bullets with the number inline —
- 1. Finalize instead of 1. Finalize (the - prefix is added even though the text already carries 1.).
Status / why deferred from v0.3.61
A focused attempt fixed B (single \n between consecutive list items) and C (no - prefix for ordered/bullet-text items), lifting mean TED to ~0.439 with no regressions. But A could not be fixed without regressing other docs: forcing a break on any bullet/ordered marker regardless of same_line mis-fires on table number cells (table-bordered TED 0.947→0.500, invoice 0.931→0.862, because is_ordered_list_marker matches Q1/1.20-style cell text).
A correct fix needs a list-start signal that distinguishes a heading→list-item transition from a table-row baseline — likely using structure-tree role context (tagged) and a stricter untagged bullet-glyph gate, rather than text-pattern matching. The converter is intricate (many guarded edge cases: irs_f1040, pdfa_049, newspapers, #377 D3/D5), so this warrants careful, separately-validated work against all 12 markdown bench docs.
Acceptance: mean TED ≥ ~0.48 with zero per-doc regressions on the pdf_benches markdown corpus.
Summary
pdf_oxide's Markdown output scores well on element detection (heading F1 0.625, emphasis F1 0.755 — both beat pymupdf4llm) but lower on document tree structure (TED similarity 0.390 vs pymupdf4llm 0.534). The gap is concentrated in list block-boundary placement.
Three divergences (from investigation)
In
src/pipeline/converters/markdown.rs(break/flush logic ~lines 1378–1476 and the standalone-bullet handler ~1467):## Highlights - Revenue grew steadilyinstead of a separate- Revenue grew steadily. The first bullet/ordered marker shares a baseline with the heading (same_line), so the break doesn't fire and the item appends to the heading line./LIhas its ownblock_id, so the main break path emits\n\nbetween items, splitting one<ul>into single-item lists.- 1. Finalizeinstead of1. Finalize(the-prefix is added even though the text already carries1.).Status / why deferred from v0.3.61
A focused attempt fixed B (single
\nbetween consecutive list items) and C (no-prefix for ordered/bullet-text items), lifting mean TED to ~0.439 with no regressions. But A could not be fixed without regressing other docs: forcing a break on any bullet/ordered marker regardless ofsame_linemis-fires on table number cells (table-borderedTED 0.947→0.500,invoice0.931→0.862, becauseis_ordered_list_markermatchesQ1/1.20-style cell text).A correct fix needs a list-start signal that distinguishes a heading→list-item transition from a table-row baseline — likely using structure-tree role context (tagged) and a stricter untagged bullet-glyph gate, rather than text-pattern matching. The converter is intricate (many guarded edge cases: irs_f1040, pdfa_049, newspapers, #377 D3/D5), so this warrants careful, separately-validated work against all 12 markdown bench docs.
Acceptance: mean TED ≥ ~0.48 with zero per-doc regressions on the
pdf_benchesmarkdown corpus.