Skip to content

CLI: add --source-url flag for site-specific extraction on stdin/file input#322

Open
kaznak wants to merge 2 commits into
kepano:mainfrom
kaznak:feat/cli-url-flag
Open

CLI: add --source-url flag for site-specific extraction on stdin/file input#322
kaznak wants to merge 2 commits into
kepano:mainfrom
kaznak:feat/cli-url-flag

Conversation

@kaznak

@kaznak kaznak commented Jun 19, 2026

Copy link
Copy Markdown

Problem

The extractor registry uses the document's URL to select site-specific extractors — ClaudeExtractor for claude.ai/chat/*, ChatGPTExtractor for chatgpt.com/c/*, and so on (see extractor-registry.ts:255-261).

When the CLI reads HTML from stdin or a local file, url in parseSource is left undefined and is passed through to Defuddle() as undefined. The registry then calls new URL('').hostname (line 253), which throws, no extractor is selected, and the generic scoring path runs.

For some sites — most visibly claude.ai chat captures — the generic path extracts no content and the CLI exits with:

Error: No content could be extracted from <source>

This affects anyone post-processing already-saved HTML where the original URL is known but the page is no longer fetched live: SingleFile captures, single-file-cli output, ArchiveBox, Obsidian Web Clipper, local crawls, and so on. The corresponding extractor exists and would work — it just never gets a chance to run.

Reproducer

Any locally saved claude.ai/chat/* page works:

$ defuddle parse -m saved-chat.html
Error: No content could be extracted from saved-chat.html
$ defuddle parse -m --source-url https://claude.ai/chat/abc123 saved-chat.html
# extracts the conversation correctly

Fix

A small, opt-in CLI flag:

  • New --source-url <url> option on defuddle parse (commander auto-derives options.sourceUrl).
  • parseSource seeds url from options.sourceUrl at the top.
  • The positional URL form (defuddle parse https://...) continues to set url itself for live fetches, so --source-url only affects stdin / file input. No behavior change without the flag.

Tests

  • Anonymized fixture tests/fixtures/extractor--claude-conversation.html using the actual selectors the ClaudeExtractor looks for (data-testid="user-message", .font-claude-response, .standard-markdown).
  • Generated tests/expected/extractor--claude-conversation.md shows the expected **You** / **Claude** author labels.
  • New tests/cli.test.ts case asserts:
    • Without --source-url: output has neither **You** nor **Claude** (generic path).
    • With --source-url=https://claude.ai/chat/example-fixture: both labels appear and the conversation content is preserved.
  • All 371 existing tests still pass.

Verified per repo guidance that the new test fails on main before the fix is applied (the **You**/**Claude** assertions fail because the ClaudeExtractor never runs without a URL).

Compatibility

Strictly additive. Existing invocations are unaffected — url only changes from undefined to a non-empty string when the user passes --source-url. No changes to the library API or to other CLI flags.

Related

  • tests/cli.test.ts discusses why this lets the existing ClaudeExtractor (and analogous registered extractors) reach their intended inputs.
  • Similar to but smaller in scope than CLI: add --frontmatter flag for YAML metadata output #313 (--frontmatter); both are pure CLI feature additions on top of the existing library API.

… input

The extractor registry uses the document's URL to select site-specific
extractors (e.g. ClaudeExtractor for claude.ai/chat/*, ChatGPTExtractor
for chatgpt.com/c/*). When the CLI reads HTML from stdin or a local file
the URL is left undefined, so the registry's `new URL('').hostname` throws,
no extractor is selected, and generic scoring runs instead. For claude.ai
chat captures the generic path extracts no content and the CLI exits with
"No content could be extracted".

Add a `--source-url <url>` option that lets the caller declare the URL the
HTML originated from. The positional URL form continues to override it for
live fetches.

This is useful when post-processing archived web pages (SingleFile,
single-file-cli, ArchiveBox, Obsidian Web Clipper output, etc.) where the
original URL is known but the HTML is no longer fetched live.

Includes a minimal anonymized claude.ai conversation fixture and a CLI
test that verifies the **You**/**Claude** author labels (only the
ClaudeExtractor emits them) appear only when --source-url is supplied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant