CLI: add --extractor flag and defuddle/extractor public entry#328
Open
kaznak wants to merge 3 commits into
Open
CLI: add --extractor flag and defuddle/extractor public entry#328kaznak wants to merge 3 commits into
kaznak wants to merge 3 commits into
Conversation
Site-specific extractors currently must live under src/extractors/ and be
registered in src/extractor-registry.ts. Every new site therefore requires
a fork-build-distribute loop, which is too heavy for end-user customization
(e.g. a paginated-article extractor for a regional news site).
Add a runtime-configurable extension surface in two parts.
1. New sub-entry `defuddle/extractor` re-exporting:
- BaseExtractor (the abstract class to subclass)
- ExtractorRegistry (so runtime callers can register without reaching
into internal paths)
- ExtractorMapping, ExtractorConstructor (the registration shape)
- ExtractorOptions, ExtractorResult, ExtractedContent, ExtractorVariables
This is the public API for extractor authors. The existing entries —
`defuddle`, `defuddle/full`, `defuddle/node` — are unchanged, so the
parser API surface and bundle composition stay exactly as before.
ExtractorMapping and ExtractorConstructor in src/extractor-registry.ts
were promoted from file-local types to exported ones; no behavioral
change.
2. New CLI option `--extractor <path>` on the `parse` command (repeatable).
Each value is dynamic-imported via pathToFileURL; the default export must
be `{ patterns: (string | RegExp)[], extractor: class }`. Each mapping is
registered with ExtractorRegistry before parsing runs, so the custom
extractor competes alongside the built-ins for the current URL.
A descriptive error is thrown when the default export does not match
the expected shape, so misconfiguration fails fast at the CLI.
Example extractor file loaded via --extractor:
import { BaseExtractor } from 'defuddle/extractor';
class MyExtractor extends BaseExtractor {
canExtract() { return false; }
canExtractAsync() { return true; }
prefersAsync() { return true; }
async extractAsync() { /* fetch additional pages, return ExtractorResult */ }
}
export default {
patterns: [/^https?:\/\/example\.com\/article\//],
extractor: MyExtractor,
};
Tests added in tests/cli.test.ts:
- option introspection (flag name, repeatable string-array default)
- positive: --extractor registers the mapping with ExtractorRegistry
- negative: malformed default-export rejected with a clear error message
The previous commit used a bare `await import(specifier)` in cli.ts, which
TypeScript's CommonJS lowering rewrites into a `require(specifier)`
equivalent. `require()` does not accept `file://` URLs and cannot load
`.mjs` modules, so end-to-end use of `--extractor` against the npm-published
bin would fail at runtime with:
Error: Cannot find module 'file:///.../extractor.mjs'
Hide the dynamic import inside a `new Function()` so tsc does not transform
it — the import expression is then parsed at runtime by Node, which honours
native ESM dynamic import for both `file://` URLs and `.mjs` files.
Vitest runs the TS source through esbuild (which preserves `import()`) but
its module graph does not install the dynamic-import callback that the
Function()-eval shortcut needs, so it would trip on "A dynamic import
callback was not specified." Detect that environment via
`globalThis.__vitest_worker__` and use the in-source `import()` form there
so vitest can hook the call normally. All 13 cli tests pass under vitest;
the production CJS path was smoke-tested by loading an .mjs extractor
against a live URL.
This was referenced Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Site-specific extractors must live under
src/extractors/and be registered insrc/extractor-registry.ts. Adding one for a new site requires a fork → build → distribute loop, which is heavier than it needs to be when a user just wants to handle a paginated regional news site or an in-house CMS. There is also no public API for extractor authors:BaseExtractoris not exported from any package entry.Proposal
Two changes that together let users add a site-specific extractor at run time without forking.
1. New sub-entry
defuddle/extractor— re-exports the API that extractor authors need:BaseExtractor(the abstract class to subclass)ExtractorRegistry(so runtime callers can register without reaching into internal paths)ExtractorMapping,ExtractorConstructor(the registration shape)ExtractorOptions,ExtractorResult,ExtractedContent,ExtractorVariablesThe existing entries (
defuddle,defuddle/full,defuddle/node) are unchanged, so the parser API surface and bundle composition stay exactly as before.ExtractorMappingandExtractorConstructorare promoted from file-local types to exported ones; no behavioral change.2. New CLI option
--extractor <path>on theparsecommand (repeatable):Each value is dynamic-imported via
pathToFileURL. The default export must be{ patterns: (string | RegExp)[], extractor: class }. Each mapping is registered withExtractorRegistrybefore parsing runs, so the custom extractor competes alongside the built-ins for the current URL. A descriptive error is thrown when the default export does not match the expected shape, so misconfiguration fails fast at the CLI.Example extractor file:
Usage:
defuddle parse 'https://example.com/article/foo' \ --markdown \ --extractor ./my-extractor.mjsTests
tests/cli.test.tsadds:[])--extractorloads the supplied file and registers the mapping withExtractorRegistryTwo fixtures under
tests/fixtures/back the integration cases.Notes
fix --extractor loading of .mjs files in compiled CJS) is a follow-up that fixes a runtime regression I hit when smoke-testing the published-bin code path. TypeScript's CommonJS lowering turnsawait import(specifier)into arequire(specifier)equivalent, which cannot load.mjsmodules fromfile://URLs. The fix hides the dynamic import inside anew Function()so tsc does not transform it. Vitest needs the in-sourceimport()form (its module graph cannot service the eval'd one), so the two paths are switched onglobalThis.__vitest_worker__. Happy to split this commit out or document it differently if you'd prefer.--source-url), which is the natural companion when reading from stdin/file — together they let you run a custom extractor against a saved-page snapshot — but neither PR depends on the other.