Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ phpstan.neon
/phpunit.xml
/.phpunit.cache/
###< phpunit/phpunit ###

/.ralph-tui/
1 change: 1 addition & 0 deletions GEMINI.md
3 changes: 3 additions & 0 deletions phpunit.dist.xml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@
<testsuite name="Utils Suite">
<file>tests/UtilsTest.php</file>
</testsuite>
<testsuite name="Third Party Fixtures Suite">
<file>tests/ThirdPartyGoFixturesTest.php</file>
</testsuite>
</testsuites>

<source ignoreSuppressionOfDeprecations="true"
Expand Down
155 changes: 155 additions & 0 deletions plans/fixture_import_step_by_step_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Fixture Import Plan (Step by Step)

Goal: import best third-party HTML->Markdown fixtures into this repo without destabilizing existing behavior.

## Scope

- Keep existing suites green (`PHP Fixtures Suite`, `Rust Fixtures Suite`, `Utils Suite`).
- Add third-party fixtures in phases, with source attribution and predictable normalization.
- Track intentional style differences separately from true conversion bugs.

## Proposed Target Layout

Use dedicated directories under `tests/files`:

- `tests/files/thirdPartyFixtures/go/`
- `tests/files/thirdPartyFixtures/dotnet/`
- `tests/files/thirdPartyFixtures/js/`
- `tests/files/thirdPartyFixtures/ruby/`
- `tests/files/thirdPartyFixtures/java/`
- `tests/files/thirdPartyFixtures/THIRD_PARTY_FIXTURES.md` (source + license + commit SHA)

Use normalized pair naming:

- `<source>_<group>_<index>_<slug>.html`
- `<source>_<group>_<index>_<slug>.md`

## Phase 0: Foundation

1. Add third-party fixture root directories and attribution file.
2. Add a dedicated PHPUnit suite file, for example `tests/ThirdPartyFixturesTest.php`.
3. Reuse the Rust suite style: load only `tests/files/thirdPartyFixtures/**/*.html` and assert against matching `.md`.
4. Add metadata support file (JSON map) for known divergence buckets.

Done criteria:
- New suite can run with zero fixtures and pass.

## Phase 1: Go Golden Files (First Import)

Source priority:
- `JohannesKaufmann/html-to-markdown/plugin/commonmark/testdata/GoldenFiles`
- `JohannesKaufmann/html-to-markdown/plugin/table/testdata/GoldenFiles`
- `JohannesKaufmann/html-to-markdown/plugin/strikethrough/testdata/GoldenFiles`

Steps:
1. Copy `*.in.html` as `.html` and matching `*.out.md` as `.md`.
2. Prefix with `go_` and group names (`commonmark`, `table`, `strikethrough`).
3. Run only third-party suite and record failures by category.
4. Mark expected style-only diffs in metadata instead of immediately changing core behavior.

Done criteria:
- At least 50 high-value Go fixture pairs imported.
- Failure report grouped by category is generated.

## Phase 2: .NET CommonMark + Verified Regressions

Source priority:
- `reversemarkdown-net/src/ReverseMarkdown.Test/TestData/commonmark.json`
- Selected `*.verified.md`/`*.verified.txt` cases with real bug value.

Steps:
1. Convert JSON/snapshot fixtures into HTML/MD pairs in `dotnet/`.
2. Skip or tag cases that rely on framework-specific formatting assumptions.
3. Run suite and categorize mismatches.

Done criteria:
- CommonMark subset imported and runnable.
- High-noise snapshot cases clearly tagged.

## Phase 3: Turndown (JS) Conversion Corpus

Source priority:
- `mixmark-io/turndown/test/`

Steps:
1. Extract pure conversion cases first (avoid plugin/rule override tests initially).
2. Convert into pair fixtures under `js/`.
3. Compare output and tag differences in link, list, and escaping behavior.

Done criteria:
- Core Turndown conversion subset imported.
- No regression in existing PHP and Rust suites.

## Phase 4: Ruby Real-World Assets

Source priority:
- `xijo/reverse_markdown/spec/assets`

Steps:
1. Import representative assets (start with short-medium documents).
2. Build expected Markdown from upstream tests where possible.
3. Keep large documents in a separate optional suite if runtime grows.

Done criteria:
- Real-world HTML shapes covered (nested sections, docs-like content, media-heavy blocks).

## Phase 5: Flexmark Long-Tail Specs

Source priority:
- `flexmark-java/flexmark-html2md-converter/src/test/resources`

Steps:
1. Select focused subsets (lists, code blocks, links, tables) before full import.
2. Convert spec formats into pair fixtures.
3. Add a slow suite label if fixture volume becomes large.

Done criteria:
- At least one curated subset imported for each major feature area.

## Normalization Rules (Apply Before Assertion)

1. Normalize line endings to LF.
2. Trim trailing whitespace.
3. Normalize repeated blank lines (bounded policy).
4. Keep entity decoding policy explicit (do not silently over-normalize).
5. Keep Markdown style toggles configurable per suite.

## Mismatch Buckets To Track

- `whitespace`
- `list_shape`
- `emphasis_style`
- `autolink_policy`
- `escaping`
- `table_format`
- `entity_handling`
- `parser_bug`

## Quality Gates Per Phase

Run on every phase:

1. `composer run cs-fix`
2. `composer run tests`
3. `vendor/bin/phpunit --testsuite "Rust Fixtures Suite"`
4. `vendor/bin/phpunit --testsuite "PHP Fixtures Suite"`
5. `vendor/bin/phpunit --testsuite "Utils Suite"`
6. `vendor/bin/phpunit --testsuite "Third Party Fixtures Suite"` (new)

## Attribution Checklist

For each imported fixture group, add to `THIRD_PARTY_FIXTURES.md`:

1. Upstream repository URL
2. Upstream commit SHA or release tag
3. Source file path(s)
4. License
5. Import date
6. Any transformations applied

## Suggested Milestones

- Milestone A: Foundation + Phase 1 complete
- Milestone B: Phase 2 + Phase 3 complete
- Milestone C: Phase 4 + curated Phase 5 complete
- Milestone D: Stabilization pass (reduce expected diffs and convert to true pass cases)
70 changes: 70 additions & 0 deletions plans/go_fixture_import_mismatch_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Go Fixture Import Mismatch Report (Phase 1)

Date: 2026-03-16

## Scope

- Fixture suite: `Third Party Fixtures Suite` (Go only)
- Upstream dataset: `JohannesKaufmann/html-to-markdown` GoldenFiles snapshot
- Imported groups:
- `plugin/commonmark/testdata/GoldenFiles`
- `plugin/table/testdata/GoldenFiles`
- `plugin/strikethrough/testdata/GoldenFiles`

## Execution

- Command run: `vendor/bin/phpunit --testsuite "Third Party Fixtures Suite"`
- Result: suite executed successfully with 15 tests total (14 fixture cases + 1 root-directory smoke test).
- Observation: all 14 fixture cases currently report as risky in PHPUnit because style-only mismatches are intentionally allowed in phase 1 and return before a hard assertion.

## Bucket Summary Counts

| Bucket | Count |
|---|---:|
| `whitespace` | 0 |
| `list_shape` | 0 |
| `emphasis_style` | 0 |
| `autolink_policy` | 0 |
| `escaping` | 0 |
| `table_format` | 0 |
| `entity_handling` | 0 |
| `parser_bug` | 0 |
| `unclassified` | 14 |

## Per-Fixture Mismatch Listing

### `unclassified` (14)

All current mismatches are tagged `style_only: true` in divergence metadata.

- `johanneskaufmann-html-to-markdown/commonmark/blockquote`
- `johanneskaufmann-html-to-markdown/commonmark/bold`
- `johanneskaufmann-html-to-markdown/commonmark/code`
- `johanneskaufmann-html-to-markdown/commonmark/heading`
- `johanneskaufmann-html-to-markdown/commonmark/image`
- `johanneskaufmann-html-to-markdown/commonmark/link`
- `johanneskaufmann-html-to-markdown/commonmark/list`
- `johanneskaufmann-html-to-markdown/commonmark/metadata`
- `johanneskaufmann-html-to-markdown/strikethrough/strikethrough`
- `johanneskaufmann-html-to-markdown/table/basics`
- `johanneskaufmann-html-to-markdown/table/col_row_span`
- `johanneskaufmann-html-to-markdown/table/contents`
- `johanneskaufmann-html-to-markdown/table/email`
- `johanneskaufmann-html-to-markdown/table/parents`

## Parser-Bug Candidate Section

- Current `parser_bug` candidates: none.
- No fixture is presently bucketed as `parser_bug`.

## Style-Only vs Likely Converter Bugs

- Style-only expected diffs: 14
- All known mismatches in this first-pass import are metadata-marked style-only and intentionally non-blocking.
- Likely parser/conversion bugs: 0
- No non-style mismatch remains in this report snapshot.

## Notes for Follow-Up

- Next parity pass should re-bucket each `unclassified` style-only fixture into the most specific bucket when policy is clear.
- If a mismatch is reclassified as non-style (`style_only: false`), it should fail the suite and be tracked as parity work.
112 changes: 112 additions & 0 deletions plans/html_to_markdown_library_research_2026-03-14.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# HTML to Markdown Library Research (2026-03-14)

This file saves the deep-research subagent findings about widely used HTML->Markdown libraries (excluding Python `html2text` and `kreuzberg-dev/html-to-markdown`).

## Executive Summary (Top 5 Sources To Mine Tests From)

1. `JohannesKaufmann/html-to-markdown` (Go)
- Best immediate fixture source: clean golden pairs (`*.in.html` -> `*.out.md`) and plugin-scoped coverage.
2. `mysticmind/reversemarkdown-net` (.NET)
- Strong regression fixture format (`*.verified.*`) plus `commonmark.json` corpus.
3. `mixmark-io/turndown` (JS)
- Very high adoption and broad HTML conversion behavior coverage.
4. `vsch/flexmark-java` (Java)
- Large spec resources and deep edge-case coverage.
5. `xijo/reverse_markdown` (Ruby)
- Mature ecosystem usage and practical real-world HTML assets.

## Candidate Details

| Library | Ecosystem | Popularity/Activity Signals | Test/Fixture Sources | Fixture Shape | License |
|---|---|---|---|---|---|
| `mixmark-io/turndown` | JavaScript/TypeScript | ~10,910 stars, active, npm ~11,895,820 downloads/month, latest v7.2.2 | `test/` | HTML case corpus in test HTML + assertions | MIT |
| `crosstype/node-html-markdown` | JavaScript/TypeScript | ~254 stars, npm ~1,692,106 downloads/month, latest v2.0.0 | `test/` | Unit/integration style fixtures | MIT |
| `thephpleague/html-to-markdown` | PHP | ~1,873 stars, Packagist total ~28,103,235, monthly ~1,028,613 | `tests/` | Unit + conversion expectations | MIT |
| `xijo/reverse_markdown` | Ruby | ~665 stars, RubyGems total ~93,986,433, latest 3.0.2 | `spec/assets` | Real-world input assets with expected outputs in specs | WTFPL |
| `JohannesKaufmann/html-to-markdown` | Go | ~3,488 stars, latest v2.5.0, pkg.go.dev known importers: 60 | `plugin/commonmark/testdata/GoldenFiles`, `plugin/table/testdata/GoldenFiles`, `plugin/strikethrough/testdata/GoldenFiles`, `cli/html2markdown/cmd/testdata/TestExecute` | Golden files (`*.in.html`, `*.out.md`) | MIT |
| `mysticmind/reversemarkdown-net` | .NET/C# | ~372 stars, NuGet total ~4,277,133, latest 5.2.0 | `src/ReverseMarkdown.Test/TestData` | Snapshot/approval (`*.verified.md`, `*.verified.txt`) + `commonmark.json` | MIT |
| `vsch/flexmark-java` | Java/Kotlin | ~2,594 stars, Maven artifact has many releases (188 versions seen) | `flexmark-html2md-converter/src/test/resources` | Spec-style resources (`*_spec.md`) + converter fixtures | BSD-2-Clause |

## Best Fixture Paths To Import First

- Go (`JohannesKaufmann/html-to-markdown`)
- `plugin/commonmark/testdata/GoldenFiles`
- `plugin/table/testdata/GoldenFiles`
- `plugin/strikethrough/testdata/GoldenFiles`
- .NET (`mysticmind/reversemarkdown-net`)
- `src/ReverseMarkdown.Test/TestData/commonmark.json`
- Selected `*.verified.md` and `*.verified.txt`
- JS (`mixmark-io/turndown`)
- `test/` (especially conversion-focused cases)
- Ruby (`xijo/reverse_markdown`)
- `spec/assets`
- Java (`vsch/flexmark-java`)
- `flexmark-html2md-converter/src/test/resources`

## Recommended Ranking For Import Work

1. Go golden pairs (highest ROI, lowest transform effort)
2. .NET `commonmark.json` + selected verified snapshots
3. Turndown conversion corpus
4. Ruby real-world assets
5. Flexmark Java long-tail specs

## Expected Mismatch Categories During Import

- Whitespace (blank lines, trailing spaces, line endings)
- List shape (indent levels, ordered index style, tight vs loose)
- Emphasis style (`*` vs `_`, strong marker differences)
- Link rendering (autolink vs explicit Markdown link)
- Escaping policy (special chars and punctuation)
- Table layout (alignment rows, padding, pipe escaping)

## License and Reuse Note (High Level)

- Most shortlisted sources are permissive (MIT/BSD-2-Clause).
- `reverse_markdown` is WTFPL.
- Keep attribution for imported fixtures in a dedicated third-party fixture note.
- Treat this as engineering guidance, not legal advice.

## Research Sources

### mixmark-io/turndown
- https://api.github.com/repos/mixmark-io/turndown
- https://api.github.com/repos/mixmark-io/turndown/releases/latest
- https://api.npmjs.org/downloads/point/last-month/turndown
- https://registry.npmjs.org/turndown/latest

### crosstype/node-html-markdown
- https://api.github.com/repos/crosstype/node-html-markdown
- https://api.github.com/repos/crosstype/node-html-markdown/releases/latest
- https://api.npmjs.org/downloads/point/last-month/node-html-markdown
- https://registry.npmjs.org/node-html-markdown/latest

### thephpleague/html-to-markdown
- https://api.github.com/repos/thephpleague/html-to-markdown
- https://packagist.org/packages/league/html-to-markdown/stats.json
- https://repo.packagist.org/p2/league/html-to-markdown.json

### xijo/reverse_markdown
- https://api.github.com/repos/xijo/reverse_markdown
- https://rubygems.org/api/v1/gems/reverse_markdown.json

### JohannesKaufmann/html-to-markdown
- https://api.github.com/repos/JohannesKaufmann/html-to-markdown
- https://api.github.com/repos/JohannesKaufmann/html-to-markdown/releases/latest
- https://pkg.go.dev/github.com/JohannesKaufmann/html-to-markdown/v2?tab=importedby
- https://api.github.com/repos/JohannesKaufmann/html-to-markdown/contents/plugin/commonmark/testdata/GoldenFiles
- https://api.github.com/repos/JohannesKaufmann/html-to-markdown/contents/plugin/table/testdata/GoldenFiles
- https://api.github.com/repos/JohannesKaufmann/html-to-markdown/contents/plugin/strikethrough/testdata/GoldenFiles
- https://api.github.com/repos/JohannesKaufmann/html-to-markdown/contents/cli/html2markdown/cmd/testdata/TestExecute

### mysticmind/reversemarkdown-net
- https://api.github.com/repos/mysticmind/reversemarkdown-net
- https://api.github.com/repos/mysticmind/reversemarkdown-net/releases/latest
- https://azuresearch-usnc.nuget.org/query?q=packageid:ReverseMarkdown&prerelease=false
- https://api.nuget.org/v3/registration5-semver1/reversemarkdown/5.2.0.json
- https://api.github.com/repos/mysticmind/reversemarkdown-net/contents/src/ReverseMarkdown.Test/TestData

### vsch/flexmark-java
- https://api.github.com/repos/vsch/flexmark-java
- https://search.maven.org/solrsearch/select?q=g:%22com.vladsch.flexmark%22%20AND%20a:%22flexmark-all%22&rows=20&wt=json
- https://api.github.com/repos/vsch/flexmark-java/contents/flexmark-html2md-converter/src/test/resources
Loading
Loading