CLI: add --source-url flag for site-specific extraction on stdin/file input by kaznak · Pull Request #322 · kepano/defuddle

kaznak · 2026-06-19T00:04:42Z

Problem

The extractor registry uses the document's URL to select site-specific extractors — ClaudeExtractor for claude.ai/chat/*, ChatGPTExtractor for chatgpt.com/c/*, and so on (see extractor-registry.ts:255-261).

When the CLI reads HTML from stdin or a local file, url in parseSource is left undefined and is passed through to Defuddle() as undefined. The registry then calls new URL('').hostname (line 253), which throws, no extractor is selected, and the generic scoring path runs.

For some sites — most visibly claude.ai chat captures — the generic path extracts no content and the CLI exits with:

Error: No content could be extracted from <source>

This affects anyone post-processing already-saved HTML where the original URL is known but the page is no longer fetched live: SingleFile captures, single-file-cli output, ArchiveBox, Obsidian Web Clipper, local crawls, and so on. The corresponding extractor exists and would work — it just never gets a chance to run.

Reproducer

Any locally saved claude.ai/chat/* page works:

$ defuddle parse -m saved-chat.html
Error: No content could be extracted from saved-chat.html
$ defuddle parse -m --source-url https://claude.ai/chat/abc123 saved-chat.html
# extracts the conversation correctly

Fix

A small, opt-in CLI flag:

New --source-url <url> option on defuddle parse (commander auto-derives options.sourceUrl).
parseSource seeds url from options.sourceUrl at the top.
The positional URL form (defuddle parse https://...) continues to set url itself for live fetches, so --source-url only affects stdin / file input. No behavior change without the flag.

Tests

Anonymized fixture tests/fixtures/extractor--claude-conversation.html using the actual selectors the ClaudeExtractor looks for (data-testid="user-message", .font-claude-response, .standard-markdown).
Generated tests/expected/extractor--claude-conversation.md shows the expected **You** / **Claude** author labels.
New tests/cli.test.ts case asserts:
- Without --source-url: output has neither **You** nor **Claude** (generic path).
- With --source-url=https://claude.ai/chat/example-fixture: both labels appear and the conversation content is preserved.
All 371 existing tests still pass.

Verified per repo guidance that the new test fails on main before the fix is applied (the **You**/**Claude** assertions fail because the ClaudeExtractor never runs without a URL).

Compatibility

Strictly additive. Existing invocations are unaffected — url only changes from undefined to a non-empty string when the user passes --source-url. No changes to the library API or to other CLI flags.

tests/cli.test.ts discusses why this lets the existing ClaudeExtractor (and analogous registered extractors) reach their intended inputs.
Similar to but smaller in scope than CLI: add --frontmatter flag for YAML metadata output #313 (--frontmatter); both are pure CLI feature additions on top of the existing library API.

… input The extractor registry uses the document's URL to select site-specific extractors (e.g. ClaudeExtractor for claude.ai/chat/*, ChatGPTExtractor for chatgpt.com/c/*). When the CLI reads HTML from stdin or a local file the URL is left undefined, so the registry's `new URL('').hostname` throws, no extractor is selected, and generic scoring runs instead. For claude.ai chat captures the generic path extracts no content and the CLI exits with "No content could be extracted". Add a `--source-url <url>` option that lets the caller declare the URL the HTML originated from. The positional URL form continues to override it for live fetches. This is useful when post-processing archived web pages (SingleFile, single-file-cli, ArchiveBox, Obsidian Web Clipper output, etc.) where the original URL is known but the HTML is no longer fetched live. Includes a minimal anonymized claude.ai conversation fixture and a CLI test that verifies the **You**/**Claude** author labels (only the ClaudeExtractor emits them) appear only when --source-url is supplied.

kaznak mentioned this pull request Jun 25, 2026

CLI: add --extractor flag and defuddle/extractor public entry #328

Open

Merge branch 'main' into feat/cli-url-flag

7615550

kaznak mentioned this pull request Jun 26, 2026

CLI: surface result.debug via --json field or stderr log #331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: add --source-url flag for site-specific extraction on stdin/file input#322

CLI: add --source-url flag for site-specific extraction on stdin/file input#322
kaznak wants to merge 2 commits into
kepano:mainfrom
kaznak:feat/cli-url-flag

kaznak commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaznak commented Jun 19, 2026

Problem

Reproducer

Fix

Tests

Compatibility

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant