CLI: add --source-url flag for site-specific extraction on stdin/file input#322
Open
kaznak wants to merge 2 commits into
Open
CLI: add --source-url flag for site-specific extraction on stdin/file input#322kaznak wants to merge 2 commits into
kaznak wants to merge 2 commits into
Conversation
… input
The extractor registry uses the document's URL to select site-specific
extractors (e.g. ClaudeExtractor for claude.ai/chat/*, ChatGPTExtractor
for chatgpt.com/c/*). When the CLI reads HTML from stdin or a local file
the URL is left undefined, so the registry's `new URL('').hostname` throws,
no extractor is selected, and generic scoring runs instead. For claude.ai
chat captures the generic path extracts no content and the CLI exits with
"No content could be extracted".
Add a `--source-url <url>` option that lets the caller declare the URL the
HTML originated from. The positional URL form continues to override it for
live fetches.
This is useful when post-processing archived web pages (SingleFile,
single-file-cli, ArchiveBox, Obsidian Web Clipper output, etc.) where the
original URL is known but the HTML is no longer fetched live.
Includes a minimal anonymized claude.ai conversation fixture and a CLI
test that verifies the **You**/**Claude** author labels (only the
ClaudeExtractor emits them) appear only when --source-url is supplied.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The extractor registry uses the document's URL to select site-specific extractors —
ClaudeExtractorforclaude.ai/chat/*,ChatGPTExtractorforchatgpt.com/c/*, and so on (seeextractor-registry.ts:255-261).When the CLI reads HTML from stdin or a local file,
urlinparseSourceis left undefined and is passed through toDefuddle()as undefined. The registry then callsnew URL('').hostname(line 253), which throws, no extractor is selected, and the generic scoring path runs.For some sites — most visibly
claude.aichat captures — the generic path extracts no content and the CLI exits with:This affects anyone post-processing already-saved HTML where the original URL is known but the page is no longer fetched live: SingleFile captures, single-file-cli output, ArchiveBox, Obsidian Web Clipper, local crawls, and so on. The corresponding extractor exists and would work — it just never gets a chance to run.
Reproducer
Any locally saved
claude.ai/chat/*page works:Fix
A small, opt-in CLI flag:
--source-url <url>option ondefuddle parse(commander auto-derivesoptions.sourceUrl).parseSourceseedsurlfromoptions.sourceUrlat the top.defuddle parse https://...) continues to seturlitself for live fetches, so--source-urlonly affects stdin / file input. No behavior change without the flag.Tests
tests/fixtures/extractor--claude-conversation.htmlusing the actual selectors theClaudeExtractorlooks for (data-testid="user-message",.font-claude-response,.standard-markdown).tests/expected/extractor--claude-conversation.mdshows the expected**You**/**Claude**author labels.tests/cli.test.tscase asserts:--source-url: output has neither**You**nor**Claude**(generic path).--source-url=https://claude.ai/chat/example-fixture: both labels appear and the conversation content is preserved.Verified per repo guidance that the new test fails on
mainbefore the fix is applied (the**You**/**Claude**assertions fail because theClaudeExtractornever runs without a URL).Compatibility
Strictly additive. Existing invocations are unaffected —
urlonly changes fromundefinedto a non-empty string when the user passes--source-url. No changes to the library API or to other CLI flags.Related
tests/cli.test.tsdiscusses why this lets the existingClaudeExtractor(and analogous registered extractors) reach their intended inputs.--frontmatter); both are pure CLI feature additions on top of the existing library API.