Skip to content

Markdown list block boundaries hurt TED structure score (heading-glue / item-splitting / ordered markers) #664

@yfedoseev

Description

@yfedoseev

Summary

pdf_oxide's Markdown output scores well on element detection (heading F1 0.625, emphasis F1 0.755 — both beat pymupdf4llm) but lower on document tree structure (TED similarity 0.390 vs pymupdf4llm 0.534). The gap is concentrated in list block-boundary placement.

Three divergences (from investigation)

In src/pipeline/converters/markdown.rs (break/flush logic ~lines 1378–1476 and the standalone-bullet handler ~1467):

  • A. First list item glued onto the preceding heading line## Highlights - Revenue grew steadily instead of a separate - Revenue grew steadily. The first bullet/ordered marker shares a baseline with the heading (same_line), so the break doesn't fire and the item appends to the heading line.
  • B. Blank line inserted between every list item — each tagged /LI has its own block_id, so the main break path emits \n\n between items, splitting one <ul> into single-item lists.
  • C. Ordered lists rendered as bullets with the number inline- 1. Finalize instead of 1. Finalize (the - prefix is added even though the text already carries 1.).

Status / why deferred from v0.3.61

A focused attempt fixed B (single \n between consecutive list items) and C (no - prefix for ordered/bullet-text items), lifting mean TED to ~0.439 with no regressions. But A could not be fixed without regressing other docs: forcing a break on any bullet/ordered marker regardless of same_line mis-fires on table number cells (table-bordered TED 0.947→0.500, invoice 0.931→0.862, because is_ordered_list_marker matches Q1/1.20-style cell text).

A correct fix needs a list-start signal that distinguishes a heading→list-item transition from a table-row baseline — likely using structure-tree role context (tagged) and a stricter untagged bullet-glyph gate, rather than text-pattern matching. The converter is intricate (many guarded edge cases: irs_f1040, pdfa_049, newspapers, #377 D3/D5), so this warrants careful, separately-validated work against all 12 markdown bench docs.

Acceptance: mean TED ≥ ~0.48 with zero per-doc regressions on the pdf_benches markdown corpus.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions