CLI: add per-pass removal toggles (--no-content-patterns etc.)#332
Open
kaznak wants to merge 2 commits into
Open
CLI: add per-pass removal toggles (--no-content-patterns etc.)#332kaznak wants to merge 2 commits into
kaznak wants to merge 2 commits into
Conversation
Add CLI flags that surface the `remove*` options Defuddle already
exposes in its JS API but never accepted on the command line.
Motivation: when developing a site-specific extractor, the first
question is "which removal pass dropped this element?". Without
per-pass toggles the only way to find out is to fork defuddle and
add console.logs.
New flags (all on the `parse` command):
- `--no-content-patterns` — keep boilerplate patterns the
removeByContentPattern step would otherwise drop (read time,
breadcrumb, blog metadata lists, newsletter signup, hero header
duplicates, eyebrow labels).
- `--no-low-scoring` — keep elements the content-scoring pass would
drop.
- `--no-exact-selectors` / `--no-partial-selectors` — keep elements
matched by the EXACT_SELECTORS / PARTIAL_SELECTORS denylists.
- `--no-hidden-elements` — keep CSS-hidden elements.
- `--no-small-images` — keep small images.
- `--remove-images` — strip <img>/<picture>/<figure>. The API
default is `false` so this is the only flag in positive form.
Three tests cover registration plus behavior change for two of the
toggles (the metadata-list bypass for --no-content-patterns and the
image strip for --remove-images).
Implementation note: Defuddle merges options as
`{ ...defaults, ...this.options }`. An explicit `undefined` value
shadows the default `true`, so the CLI only sets a key when the
caller asked for it — done via a small `maybe()` helper that emits
an empty object for `undefined` values.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Defuddleexposes a set ofremove*options in its JS API (DefuddleOptionsinsrc/types.ts):removeExactSelectors(defaulttrue)removePartialSelectors(defaulttrue)removeHiddenElements(defaulttrue)removeLowScoring(defaulttrue)removeSmallImages(defaulttrue)removeContentPatterns(defaulttrue)removeImages(defaultfalse)The CLI never wires any of them up.
src/cli.tsbuildsdefuddleOptswith only{ debug, markdown, separateMarkdown, language }, so the CLI runs with the full pipeline locked on and there is no way to disable a single step without forking.For users developing site-specific extractors this is the recurring question: which removal pass dropped this element? The cleanest debugging workflow is "run with that pass off, diff against the default output." Without per-pass toggles, the only way to answer that is to fork defuddle and add
console.logcalls.Proposal
Add seven CLI flags on
parse, each binding 1:1 to an existingremove*API option. Six are--no-*form (matching the API default oftrue); one is positive (removeImagesdefaults tofalse):Example workflow for extractor development:
Implementation note
Defuddle merges options as:
So an explicit
undefinedfromthis.optionsshadows the defaulttrueand silently disables the removal pass. The CLI therefore only inserts aremove*key when the user actually set the flag — done via a smallmaybe()helper that emits an empty object forundefinedvalues, instead of{ key: undefined }. This keeps both call paths working:--no-content-patternsabsent →options.contentPatterns === true→removeContentPatterns: trueset, but explicitly matching the default — no-op)parseSource('-', {})→options.contentPatterns === undefined→ key omitted → defaulttrueapplies)Tests
tests/cli.test.tsadds:parse.--no-content-patternskeeps a 2-item trailing<ul>that the metadata-list heuristic would otherwise strip.--remove-imagesstrips an inline<img>that the default pipeline keeps.All 371 existing tests continue to pass.
Notes
--debug(sister PRfeat/cli-debug-output): turn a single pass off and use the debug log to see what the rest of the pipeline still removes.--extractor(CLI: add --extractor flag and defuddle/extractor public entry #328): inspect what a custom extractor's output looks like before any removal pass touches it (--no-content-patterns --no-low-scoring --no-exact-selectors --no-partial-selectors).