feat(source): add ‘source clean’ command by Flosters · Pull Request #261 · teng-lin/notebooklm-py

Flosters · 2026-04-09T01:54:05Z

Summary

After failed deep-research runs, bulk imports, or bot-blocked crawls, notebooks can accumulate junk sources: errored ones, gateway/anti-bot pages (Cloudflare, 403, 404), and duplicates. There is currently no way to clean these up in bulk — users must delete them one by one.

This PR adds notebooklm source clean, which automatically identifies and removes junk sources in one command.

What it removes

Category	Criteria
Error/unknown	`status` is `error` or `unknown`
Gateway / anti-bot	Title matches patterns: `403`, `404`, `Forbidden`, `Access Denied`, `Just a Moment`, `Attention Required`, `Security Check`, `CAPTCHA`
Duplicates	Same URL already seen (normalized: scheme+host+path, no query/fragment); keeps the oldest copy

Flags

notebooklm source clean             # interactive confirmation
notebooklm source clean --dry-run   # preview without deleting
notebooklm source clean -y          # skip confirmation
notebooklm source clean -n <id>     # target specific notebook

Implementation notes

Sources are sorted oldest-first before deduplication so the earliest copy is always kept
Deletions are chunked in batches of 10 with a 0.5s inter-chunk delay to avoid rate limiting on large notebooks
Uses existing asyncio.gather pattern consistent with other bulk operations in the codebase

Test plan

--dry-run prints count and exits without deleting
Sources with status=error are removed
Duplicate URLs: only the newest copy is deleted
Gateway-titled sources (e.g. title "403 Forbidden") are removed
Clean notebook prints "Notebook is already clean"
-y skips confirmation prompt
Deletions are batched (no 429 on 50+ junk sources)

Summary by CodeRabbit

New Features
- Added notebooklm source clean CLI command to automatically remove problematic sources from notebooks.
- Detects sources with error/unknown status, gateway/security-block patterns (e.g., access denied, captcha), and duplicate URLs.
- Performs batched deletions with progress counts; reports when a notebook is already clean.
- Includes --dry-run to preview actions and --yes to skip confirmation.

Adds `notebooklm source clean` to automatically remove junk sources from a notebook in bulk. Useful after failed deep-research runs, bot-blocked crawls, or bulk imports that left behind error or duplicate sources. Removes sources that match any of: - Status is 'error' or 'unknown' - Title matches gateway/anti-bot patterns (403, 404, Access Denied, Cloudflare 'Just a Moment', CAPTCHA, etc.) - URL is a duplicate of an already-seen source (keeps oldest) Flags: --dry-run Preview what would be deleted without deleting -y/--yes Skip confirmation prompt -n Target a specific notebook Deletions are batched in chunks of 10 with a 0.5s delay to avoid hitting rate limits on large notebooks.

coderabbitai · 2026-04-09T01:54:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c3749ae-9f35-47e6-a705-f8cdb05a69f2

📥 Commits

Reviewing files that changed from the base of the PR and between da0b8bd and 1f47282.

📒 Files selected for processing (1)

src/notebooklm/cli/source.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/notebooklm/cli/source.py

📝 Walkthrough

Walkthrough

Added a new source clean CLI subcommand that resolves a notebook, lists sources, identifies candidates (error/unknown status, gateway/anti-bot title patterns, or duplicate URLs via normalization), and optionally deletes them in asynchronous batches with dry-run and confirmation controls.

Changes

Cohort / File(s)	Summary
New Source Clean Command `src/notebooklm/cli/source.py`	Introduces `source_clean` CLI subcommand (`source clean`) with notebook resolution, source listing, candidate selection (error/unknown status, gateway/anti-bot title regexes, normalized-URL duplicate detection), `--dry-run` and `--yes/-y` flags, and chunked async deletion (batches of 10, short inter-batch delay). Adds local URL normalization imports (`urlparse`, `urlunparse`).

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as "CLI Handler"
    participant Resolver as "Notebook Resolver"
    participant SourceSvc as "Source Lister/Service"
    participant Selector as "Candidate Selector"
    participant Deleter as "Chunked Deleter"
    participant Feedback as "User Feedback"

    User->>CLI: source clean --notebook X [--dry-run] [--yes]
    CLI->>Resolver: resolve_notebook_id(X)
    Resolver-->>CLI: notebook_id
    CLI->>SourceSvc: list_sources(notebook_id)
    SourceSvc-->>CLI: sources[]
    CLI->>Selector: identify_candidates(sources)
    Note over Selector: Filter by status, title patterns,\nnormalized URL duplicates
    Selector-->>CLI: candidate_ids[]
    alt dry-run
        CLI->>Feedback: report candidate count
    else proceed
        alt no --yes
            CLI->>User: confirm deletion?
            User-->>CLI: approval
        end
        CLI->>Deleter: delete_in_batches(candidate_ids, batch_size=10)
        loop per batch
            Deleter->>SourceSvc: delete_source(id)
            SourceSvc-->>Deleter: success / error
            Deleter->>Deleter: wait(short_delay)
        end
        Deleter-->>CLI: report successes & failures
        CLI->>Feedback: final report
    end
    Feedback-->>User: operation complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I hopped through links both broken and spare,
Found gateways, duplicates, errors laid bare.
In batches I nibbled the rot away,
Dry-run to peek, or confirm — then sway.
Clean notebooks at dusk — a rabbit's delight. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(source): add 'source clean' command' directly and clearly summarizes the main change: a new CLI command is being added to the source module.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/source.py`:
- Around line 982-992: The deletion loop currently swallows all errors by
calling asyncio.gather(..., return_exceptions=True) and always prints success;
update the loop around client.sources.delete / asyncio.gather to inspect the
gathered results, count actual successful deletions vs exceptions, and log or
print any failures (include exception messages and the corresponding sid) as
they occur; after the loop, change the console.print call to report the real
number of successful deletions (and optionally failed count) using the counted
successes so users aren’t shown a false “Successfully cleaned” message for
delete_list, chunk_size, delete_tasks, and nb_id_resolved.
- Around line 945-950: The code incorrectly converts the enum s.status to a
string with str(...).lower(), so comparisons never match; replace that
conversion with the helper source_status_to_str(s.status) (handle None if
needed) and use that result to check if status is "error" or "unknown", then add
s.id to to_delete as before (look at variables s, to_delete, and function
source_status_to_str for where to change).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ae7d06c-d332-4bb1-9a3a-f0412794bf36

📥 Commits

Reviewing files that changed from the base of the PR and between a997718 and da0b8bd.

📒 Files selected for processing (1)

src/notebooklm/cli/source.py

src/notebooklm/cli/source.py

gemini-code-assist

Code Review

This pull request introduces a new source clean command to the CLI, designed to automatically remove duplicate, error, and access-blocked sources from a NotebookLM notebook. The command identifies sources based on their status, title, and normalized URL, and includes options for dry-run and confirmation. A critical issue was identified in the status checking logic: the s.status enum is incorrectly converted to a string, which prevents sources in an error or unknown state from being properly detected and cleaned. The suggested fix is to use the source_status_to_str() helper function.

gemini-code-assist · 2026-04-09T02:10:55Z

src/notebooklm/cli/source.py

+
+            for s in sorted_sources:
+                title = (s.title or "").strip()
+                status = str(s.status).lower() if s.status else "unknown"


The current logic for checking the source status is incorrect. The s.status attribute is an integer from the SourceStatus enum. Converting it to a string via str(s.status) will produce a digit string (e.g., '3'), not the word 'error'. As a result, the check status in ["error", "unknown"] will never be true, and sources in an error state won't be cleaned up.

You should use the source_status_to_str() helper function, which is designed for this purpose and already used elsewhere in the file.

Suggested change

status = str(s.status).lower() if s.status else "unknown"

status = source_status_to_str(s.status)

- Use source_status_to_str() instead of str().lower() so error/unknown status sources are correctly identified and removed - Track per-result success/failure from asyncio.gather so the output reflects what actually happened instead of always reporting success Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

agustinsilvazambrano added 2 commits April 8, 2026 21:27

style: ruff format source.py

da0b8bd

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

src/notebooklm/cli/source.py Outdated Show resolved Hide resolved

src/notebooklm/cli/source.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(source): add ‘source clean’ command#261

feat(source): add ‘source clean’ command#261
Flosters wants to merge 3 commits intoteng-lin:mainfrom
Flosters:feat/source-clean

Flosters commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	status = str(s.status).lower() if s.status else "unknown"
	status = source_status_to_str(s.status)

Conversation

Flosters commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it removes

Flags

Implementation notes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Flosters commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading