Skip to content

feat(source): add ‘source clean’ command#261

Open
Flosters wants to merge 3 commits intoteng-lin:mainfrom
Flosters:feat/source-clean
Open

feat(source): add ‘source clean’ command#261
Flosters wants to merge 3 commits intoteng-lin:mainfrom
Flosters:feat/source-clean

Conversation

@Flosters
Copy link
Copy Markdown

@Flosters Flosters commented Apr 9, 2026

Summary

After failed deep-research runs, bulk imports, or bot-blocked crawls, notebooks can accumulate junk sources: errored ones, gateway/anti-bot pages (Cloudflare, 403, 404), and duplicates. There is currently no way to clean these up in bulk — users must delete them one by one.

This PR adds notebooklm source clean, which automatically identifies and removes junk sources in one command.

What it removes

Category Criteria
Error/unknown status is error or unknown
Gateway / anti-bot Title matches patterns: 403, 404, Forbidden, Access Denied, Just a Moment, Attention Required, Security Check, CAPTCHA
Duplicates Same URL already seen (normalized: scheme+host+path, no query/fragment); keeps the oldest copy

Flags

notebooklm source clean             # interactive confirmation
notebooklm source clean --dry-run   # preview without deleting
notebooklm source clean -y          # skip confirmation
notebooklm source clean -n <id>     # target specific notebook

Implementation notes

  • Sources are sorted oldest-first before deduplication so the earliest copy is always kept
  • Deletions are chunked in batches of 10 with a 0.5s inter-chunk delay to avoid rate limiting on large notebooks
  • Uses existing asyncio.gather pattern consistent with other bulk operations in the codebase

Test plan

  • --dry-run prints count and exits without deleting
  • Sources with status=error are removed
  • Duplicate URLs: only the newest copy is deleted
  • Gateway-titled sources (e.g. title "403 Forbidden") are removed
  • Clean notebook prints "Notebook is already clean"
  • -y skips confirmation prompt
  • Deletions are batched (no 429 on 50+ junk sources)

Summary by CodeRabbit

  • New Features
    • Added notebooklm source clean CLI command to automatically remove problematic sources from notebooks.
    • Detects sources with error/unknown status, gateway/security-block patterns (e.g., access denied, captcha), and duplicate URLs.
    • Performs batched deletions with progress counts; reports when a notebook is already clean.
    • Includes --dry-run to preview actions and --yes to skip confirmation.

agustinsilvazambrano added 2 commits April 8, 2026 21:27
Adds `notebooklm source clean` to automatically remove junk sources from
a notebook in bulk. Useful after failed deep-research runs, bot-blocked
crawls, or bulk imports that left behind error or duplicate sources.

Removes sources that match any of:
- Status is 'error' or 'unknown'
- Title matches gateway/anti-bot patterns (403, 404, Access Denied,
  Cloudflare 'Just a Moment', CAPTCHA, etc.)
- URL is a duplicate of an already-seen source (keeps oldest)

Flags:
  --dry-run   Preview what would be deleted without deleting
  -y/--yes    Skip confirmation prompt
  -n          Target a specific notebook

Deletions are batched in chunks of 10 with a 0.5s delay to avoid
hitting rate limits on large notebooks.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c3749ae-9f35-47e6-a705-f8cdb05a69f2

📥 Commits

Reviewing files that changed from the base of the PR and between da0b8bd and 1f47282.

📒 Files selected for processing (1)
  • src/notebooklm/cli/source.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/notebooklm/cli/source.py

📝 Walkthrough

Walkthrough

Added a new source clean CLI subcommand that resolves a notebook, lists sources, identifies candidates (error/unknown status, gateway/anti-bot title patterns, or duplicate URLs via normalization), and optionally deletes them in asynchronous batches with dry-run and confirmation controls.

Changes

Cohort / File(s) Summary
New Source Clean Command
src/notebooklm/cli/source.py
Introduces source_clean CLI subcommand (source clean) with notebook resolution, source listing, candidate selection (error/unknown status, gateway/anti-bot title regexes, normalized-URL duplicate detection), --dry-run and --yes/-y flags, and chunked async deletion (batches of 10, short inter-batch delay). Adds local URL normalization imports (urlparse, urlunparse).

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as "CLI Handler"
    participant Resolver as "Notebook Resolver"
    participant SourceSvc as "Source Lister/Service"
    participant Selector as "Candidate Selector"
    participant Deleter as "Chunked Deleter"
    participant Feedback as "User Feedback"

    User->>CLI: source clean --notebook X [--dry-run] [--yes]
    CLI->>Resolver: resolve_notebook_id(X)
    Resolver-->>CLI: notebook_id
    CLI->>SourceSvc: list_sources(notebook_id)
    SourceSvc-->>CLI: sources[]
    CLI->>Selector: identify_candidates(sources)
    Note over Selector: Filter by status, title patterns,\nnormalized URL duplicates
    Selector-->>CLI: candidate_ids[]
    alt dry-run
        CLI->>Feedback: report candidate count
    else proceed
        alt no --yes
            CLI->>User: confirm deletion?
            User-->>CLI: approval
        end
        CLI->>Deleter: delete_in_batches(candidate_ids, batch_size=10)
        loop per batch
            Deleter->>SourceSvc: delete_source(id)
            SourceSvc-->>Deleter: success / error
            Deleter->>Deleter: wait(short_delay)
        end
        Deleter-->>CLI: report successes & failures
        CLI->>Feedback: final report
    end
    Feedback-->>User: operation complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I hopped through links both broken and spare,
Found gateways, duplicates, errors laid bare.
In batches I nibbled the rot away,
Dry-run to peek, or confirm — then sway.
Clean notebooks at dusk — a rabbit's delight. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(source): add 'source clean' command' directly and clearly summarizes the main change: a new CLI command is being added to the source module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/source.py`:
- Around line 982-992: The deletion loop currently swallows all errors by
calling asyncio.gather(..., return_exceptions=True) and always prints success;
update the loop around client.sources.delete / asyncio.gather to inspect the
gathered results, count actual successful deletions vs exceptions, and log or
print any failures (include exception messages and the corresponding sid) as
they occur; after the loop, change the console.print call to report the real
number of successful deletions (and optionally failed count) using the counted
successes so users aren’t shown a false “Successfully cleaned” message for
delete_list, chunk_size, delete_tasks, and nb_id_resolved.
- Around line 945-950: The code incorrectly converts the enum s.status to a
string with str(...).lower(), so comparisons never match; replace that
conversion with the helper source_status_to_str(s.status) (handle None if
needed) and use that result to check if status is "error" or "unknown", then add
s.id to to_delete as before (look at variables s, to_delete, and function
source_status_to_str for where to change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ae7d06c-d332-4bb1-9a3a-f0412794bf36

📥 Commits

Reviewing files that changed from the base of the PR and between a997718 and da0b8bd.

📒 Files selected for processing (1)
  • src/notebooklm/cli/source.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new source clean command to the CLI, designed to automatically remove duplicate, error, and access-blocked sources from a NotebookLM notebook. The command identifies sources based on their status, title, and normalized URL, and includes options for dry-run and confirmation. A critical issue was identified in the status checking logic: the s.status enum is incorrectly converted to a string, which prevents sources in an error or unknown state from being properly detected and cleaned. The suggested fix is to use the source_status_to_str() helper function.


for s in sorted_sources:
title = (s.title or "").strip()
status = str(s.status).lower() if s.status else "unknown"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for checking the source status is incorrect. The s.status attribute is an integer from the SourceStatus enum. Converting it to a string via str(s.status) will produce a digit string (e.g., '3'), not the word 'error'. As a result, the check status in ["error", "unknown"] will never be true, and sources in an error state won't be cleaned up.

You should use the source_status_to_str() helper function, which is designed for this purpose and already used elsewhere in the file.

Suggested change
status = str(s.status).lower() if s.status else "unknown"
status = source_status_to_str(s.status)

- Use source_status_to_str() instead of str().lower() so error/unknown
  status sources are correctly identified and removed
- Track per-result success/failure from asyncio.gather so the output
  reflects what actually happened instead of always reporting success

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant