Skip to content

CLI: add per-pass removal toggles (--no-content-patterns etc.)#332

Open
kaznak wants to merge 2 commits into
kepano:mainfrom
kaznak:feat/cli-remove-toggles
Open

CLI: add per-pass removal toggles (--no-content-patterns etc.)#332
kaznak wants to merge 2 commits into
kepano:mainfrom
kaznak:feat/cli-remove-toggles

Conversation

@kaznak

@kaznak kaznak commented Jun 26, 2026

Copy link
Copy Markdown

Problem

Defuddle exposes a set of remove* options in its JS API (DefuddleOptions in src/types.ts):

  • removeExactSelectors (default true)
  • removePartialSelectors (default true)
  • removeHiddenElements (default true)
  • removeLowScoring (default true)
  • removeSmallImages (default true)
  • removeContentPatterns (default true)
  • removeImages (default false)

The CLI never wires any of them up. src/cli.ts builds defuddleOpts with only { debug, markdown, separateMarkdown, language }, so the CLI runs with the full pipeline locked on and there is no way to disable a single step without forking.

For users developing site-specific extractors this is the recurring question: which removal pass dropped this element? The cleanest debugging workflow is "run with that pass off, diff against the default output." Without per-pass toggles, the only way to answer that is to fork defuddle and add console.log calls.

Proposal

Add seven CLI flags on parse, each binding 1:1 to an existing remove* API option. Six are --no-* form (matching the API default of true); one is positive (removeImages defaults to false):

--no-content-patterns    Keep boilerplate patterns (read time, breadcrumb, metadata lists, newsletter signups, …)
--no-low-scoring         Keep low-scoring elements
--no-exact-selectors     Keep EXACT_SELECTORS matches (ads, scripts, JW Player, …)
--no-partial-selectors   Keep PARTIAL_SELECTORS matches (class/id containing "ad-", "sidebar", "comment", …)
--no-hidden-elements     Keep CSS-hidden elements (display:none, visibility:hidden, opacity:0)
--no-small-images        Keep small images
--remove-images          Strip <img>/<picture>/<figure>

Example workflow for extractor development:

# baseline: see everything (use sister --debug PR to log what would be removed)
defuddle parse 'https://example.com/article' \
    --markdown \
    --no-content-patterns \
    --no-low-scoring > baseline.md

# default pipeline output
defuddle parse 'https://example.com/article' \
    --markdown > pipeline.md

# diff tells you what the pipeline stripped
diff baseline.md pipeline.md

Implementation note

Defuddle merges options as:

const options = {
    removeExactSelectors: true,
    /* ...other defaults... */
    ...this.options,
    ...overrideOptions
};

So an explicit undefined from this.options shadows the default true and silently disables the removal pass. The CLI therefore only inserts a remove* key when the user actually set the flag — done via a small maybe() helper that emits an empty object for undefined values, instead of { key: undefined }. This keeps both call paths working:

  • via commander (--no-content-patterns absent → options.contentPatterns === trueremoveContentPatterns: true set, but explicitly matching the default — no-op)
  • via direct API import (parseSource('-', {})options.contentPatterns === undefined → key omitted → default true applies)

Tests

tests/cli.test.ts adds:

  • Option introspection: all seven flags are registered on parse.
  • Behavior: --no-content-patterns keeps a 2-item trailing <ul> that the metadata-list heuristic would otherwise strip.
  • Behavior: --remove-images strips an inline <img> that the default pipeline keeps.

All 371 existing tests continue to pass.

Notes

  • Combines naturally with --debug (sister PR feat/cli-debug-output): turn a single pass off and use the debug log to see what the rest of the pipeline still removes.
  • Combines naturally with --extractor (CLI: add --extractor flag and defuddle/extractor public entry #328): inspect what a custom extractor's output looks like before any removal pass touches it (--no-content-patterns --no-low-scoring --no-exact-selectors --no-partial-selectors).
  • The flags map 1:1 to API options; no new behavior is added to the library.

kaznak added 2 commits June 26, 2026 13:51
Add CLI flags that surface the `remove*` options Defuddle already
exposes in its JS API but never accepted on the command line.
Motivation: when developing a site-specific extractor, the first
question is "which removal pass dropped this element?". Without
per-pass toggles the only way to find out is to fork defuddle and
add console.logs.

New flags (all on the `parse` command):

- `--no-content-patterns` — keep boilerplate patterns the
  removeByContentPattern step would otherwise drop (read time,
  breadcrumb, blog metadata lists, newsletter signup, hero header
  duplicates, eyebrow labels).
- `--no-low-scoring` — keep elements the content-scoring pass would
  drop.
- `--no-exact-selectors` / `--no-partial-selectors` — keep elements
  matched by the EXACT_SELECTORS / PARTIAL_SELECTORS denylists.
- `--no-hidden-elements` — keep CSS-hidden elements.
- `--no-small-images` — keep small images.
- `--remove-images` — strip <img>/<picture>/<figure>. The API
  default is `false` so this is the only flag in positive form.

Three tests cover registration plus behavior change for two of the
toggles (the metadata-list bypass for --no-content-patterns and the
image strip for --remove-images).

Implementation note: Defuddle merges options as
`{ ...defaults, ...this.options }`. An explicit `undefined` value
shadows the default `true`, so the CLI only sets a key when the
caller asked for it — done via a small `maybe()` helper that emits
an empty object for `undefined` values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant