Skip to content

CLI: add --extractor flag and defuddle/extractor public entry#328

Open
kaznak wants to merge 3 commits into
kepano:mainfrom
kaznak:feat/cli-extractor-arg
Open

CLI: add --extractor flag and defuddle/extractor public entry#328
kaznak wants to merge 3 commits into
kepano:mainfrom
kaznak:feat/cli-extractor-arg

Conversation

@kaznak

@kaznak kaznak commented Jun 25, 2026

Copy link
Copy Markdown

Problem

Site-specific extractors must live under src/extractors/ and be registered in src/extractor-registry.ts. Adding one for a new site requires a fork → build → distribute loop, which is heavier than it needs to be when a user just wants to handle a paginated regional news site or an in-house CMS. There is also no public API for extractor authors: BaseExtractor is not exported from any package entry.

Proposal

Two changes that together let users add a site-specific extractor at run time without forking.

1. New sub-entry defuddle/extractor — re-exports the API that extractor authors need:

  • BaseExtractor (the abstract class to subclass)
  • ExtractorRegistry (so runtime callers can register without reaching into internal paths)
  • ExtractorMapping, ExtractorConstructor (the registration shape)
  • ExtractorOptions, ExtractorResult, ExtractedContent, ExtractorVariables

The existing entries (defuddle, defuddle/full, defuddle/node) are unchanged, so the parser API surface and bundle composition stay exactly as before. ExtractorMapping and ExtractorConstructor are promoted from file-local types to exported ones; no behavioral change.

2. New CLI option --extractor <path> on the parse command (repeatable):

Each value is dynamic-imported via pathToFileURL. The default export must be { patterns: (string | RegExp)[], extractor: class }. Each mapping is registered with ExtractorRegistry before parsing runs, so the custom extractor competes alongside the built-ins for the current URL. A descriptive error is thrown when the default export does not match the expected shape, so misconfiguration fails fast at the CLI.

Example extractor file:

import { BaseExtractor } from 'defuddle/extractor';

class MyExtractor extends BaseExtractor {
    canExtract() { return false; }
    canExtractAsync() { return true; }
    prefersAsync() { return true; }
    async extractAsync() { /* fetch additional pages, return ExtractorResult */ }
}

export default {
    patterns: [/^https?:\/\/example\.com\/article\//],
    extractor: MyExtractor,
};

Usage:

defuddle parse 'https://example.com/article/foo' \
    --markdown \
    --extractor ./my-extractor.mjs

Tests

tests/cli.test.ts adds:

  • option introspection (flag is registered, repeatable, default [])
  • positive: --extractor loads the supplied file and registers the mapping with ExtractorRegistry
  • negative: malformed default-export is rejected with a clear error message

Two fixtures under tests/fixtures/ back the integration cases.

Notes

  • The second commit (fix --extractor loading of .mjs files in compiled CJS) is a follow-up that fixes a runtime regression I hit when smoke-testing the published-bin code path. TypeScript's CommonJS lowering turns await import(specifier) into a require(specifier) equivalent, which cannot load .mjs modules from file:// URLs. The fix hides the dynamic import inside a new Function() so tsc does not transform it. Vitest needs the in-source import() form (its module graph cannot service the eval'd one), so the two paths are switched on globalThis.__vitest_worker__. Happy to split this commit out or document it differently if you'd prefer.
  • Independent of CLI: add --source-url flag for site-specific extraction on stdin/file input #322 (--source-url), which is the natural companion when reading from stdin/file — together they let you run a custom extractor against a saved-page snapshot — but neither PR depends on the other.

kaznak added 3 commits June 25, 2026 10:48
Site-specific extractors currently must live under src/extractors/ and be
registered in src/extractor-registry.ts. Every new site therefore requires
a fork-build-distribute loop, which is too heavy for end-user customization
(e.g. a paginated-article extractor for a regional news site).

Add a runtime-configurable extension surface in two parts.

1. New sub-entry `defuddle/extractor` re-exporting:

   - BaseExtractor (the abstract class to subclass)
   - ExtractorRegistry (so runtime callers can register without reaching
     into internal paths)
   - ExtractorMapping, ExtractorConstructor (the registration shape)
   - ExtractorOptions, ExtractorResult, ExtractedContent, ExtractorVariables

   This is the public API for extractor authors. The existing entries —
   `defuddle`, `defuddle/full`, `defuddle/node` — are unchanged, so the
   parser API surface and bundle composition stay exactly as before.

   ExtractorMapping and ExtractorConstructor in src/extractor-registry.ts
   were promoted from file-local types to exported ones; no behavioral
   change.

2. New CLI option `--extractor <path>` on the `parse` command (repeatable).

   Each value is dynamic-imported via pathToFileURL; the default export must
   be `{ patterns: (string | RegExp)[], extractor: class }`. Each mapping is
   registered with ExtractorRegistry before parsing runs, so the custom
   extractor competes alongside the built-ins for the current URL.

   A descriptive error is thrown when the default export does not match
   the expected shape, so misconfiguration fails fast at the CLI.

Example extractor file loaded via --extractor:

    import { BaseExtractor } from 'defuddle/extractor';

    class MyExtractor extends BaseExtractor {
        canExtract() { return false; }
        canExtractAsync() { return true; }
        prefersAsync() { return true; }
        async extractAsync() { /* fetch additional pages, return ExtractorResult */ }
    }

    export default {
        patterns: [/^https?:\/\/example\.com\/article\//],
        extractor: MyExtractor,
    };

Tests added in tests/cli.test.ts:

- option introspection (flag name, repeatable string-array default)
- positive: --extractor registers the mapping with ExtractorRegistry
- negative: malformed default-export rejected with a clear error message
The previous commit used a bare `await import(specifier)` in cli.ts, which
TypeScript's CommonJS lowering rewrites into a `require(specifier)`
equivalent. `require()` does not accept `file://` URLs and cannot load
`.mjs` modules, so end-to-end use of `--extractor` against the npm-published
bin would fail at runtime with:

    Error: Cannot find module 'file:///.../extractor.mjs'

Hide the dynamic import inside a `new Function()` so tsc does not transform
it — the import expression is then parsed at runtime by Node, which honours
native ESM dynamic import for both `file://` URLs and `.mjs` files.

Vitest runs the TS source through esbuild (which preserves `import()`) but
its module graph does not install the dynamic-import callback that the
Function()-eval shortcut needs, so it would trip on "A dynamic import
callback was not specified." Detect that environment via
`globalThis.__vitest_worker__` and use the in-source `import()` form there
so vitest can hook the call normally. All 13 cli tests pass under vitest;
the production CJS path was smoke-tested by loading an .mjs extractor
against a live URL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant