Markdown list block boundaries hurt TED structure score (heading-glue / item-splitting / ordered markers)

## Summary

pdf_oxide's Markdown output scores well on element detection (heading F1 0.625, emphasis F1 0.755 — both beat pymupdf4llm) but lower on document **tree structure** (TED similarity 0.390 vs pymupdf4llm 0.534). The gap is concentrated in list block-boundary placement.

## Three divergences (from investigation)

In `src/pipeline/converters/markdown.rs` (break/flush logic ~lines 1378–1476 and the standalone-bullet handler ~1467):

- **A. First list item glued onto the preceding heading line** — `## Highlights - Revenue grew steadily` instead of a separate `- Revenue grew steadily`. The first bullet/ordered marker shares a baseline with the heading (`same_line`), so the break doesn't fire and the item appends to the heading line.
- **B. Blank line inserted between every list item** — each tagged `/LI` has its own `block_id`, so the main break path emits `\n\n` between items, splitting one `<ul>` into single-item lists.
- **C. Ordered lists rendered as bullets with the number inline** — `- 1. Finalize` instead of `1. Finalize` (the `- ` prefix is added even though the text already carries `1.`).

## Status / why deferred from v0.3.61

A focused attempt fixed **B** (single `\n` between consecutive list items) and **C** (no `- ` prefix for ordered/bullet-text items), lifting mean TED to ~0.439 with no regressions. But **A** could not be fixed without regressing other docs: forcing a break on any bullet/ordered marker regardless of `same_line` mis-fires on **table number cells** (`table-bordered` TED 0.947→0.500, `invoice` 0.931→0.862, because `is_ordered_list_marker` matches `Q1`/`1.20`-style cell text). 

A correct fix needs a list-start signal that distinguishes a heading→list-item transition from a table-row baseline — likely using structure-tree role context (tagged) and a stricter untagged bullet-glyph gate, rather than text-pattern matching. The converter is intricate (many guarded edge cases: irs_f1040, pdfa_049, newspapers, #377 D3/D5), so this warrants careful, separately-validated work against all 12 markdown bench docs.

Acceptance: mean TED ≥ ~0.48 with zero per-doc regressions on the `pdf_benches` markdown corpus.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Markdown list block boundaries hurt TED structure score (heading-glue / item-splitting / ordered markers) #664

Summary

Three divergences (from investigation)

Status / why deferred from v0.3.61

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Markdown list block boundaries hurt TED structure score (heading-glue / item-splitting / ordered markers) #664

Description

Summary

Three divergences (from investigation)

Status / why deferred from v0.3.61

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions