feat(source): add ‘source clean’ command#261
Conversation
Adds `notebooklm source clean` to automatically remove junk sources from a notebook in bulk. Useful after failed deep-research runs, bot-blocked crawls, or bulk imports that left behind error or duplicate sources. Removes sources that match any of: - Status is 'error' or 'unknown' - Title matches gateway/anti-bot patterns (403, 404, Access Denied, Cloudflare 'Just a Moment', CAPTCHA, etc.) - URL is a duplicate of an already-seen source (keeps oldest) Flags: --dry-run Preview what would be deleted without deleting -y/--yes Skip confirmation prompt -n Target a specific notebook Deletions are batched in chunks of 10 with a 0.5s delay to avoid hitting rate limits on large notebooks.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdded a new Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as "CLI Handler"
participant Resolver as "Notebook Resolver"
participant SourceSvc as "Source Lister/Service"
participant Selector as "Candidate Selector"
participant Deleter as "Chunked Deleter"
participant Feedback as "User Feedback"
User->>CLI: source clean --notebook X [--dry-run] [--yes]
CLI->>Resolver: resolve_notebook_id(X)
Resolver-->>CLI: notebook_id
CLI->>SourceSvc: list_sources(notebook_id)
SourceSvc-->>CLI: sources[]
CLI->>Selector: identify_candidates(sources)
Note over Selector: Filter by status, title patterns,\nnormalized URL duplicates
Selector-->>CLI: candidate_ids[]
alt dry-run
CLI->>Feedback: report candidate count
else proceed
alt no --yes
CLI->>User: confirm deletion?
User-->>CLI: approval
end
CLI->>Deleter: delete_in_batches(candidate_ids, batch_size=10)
loop per batch
Deleter->>SourceSvc: delete_source(id)
SourceSvc-->>Deleter: success / error
Deleter->>Deleter: wait(short_delay)
end
Deleter-->>CLI: report successes & failures
CLI->>Feedback: final report
end
Feedback-->>User: operation complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/notebooklm/cli/source.py`:
- Around line 982-992: The deletion loop currently swallows all errors by
calling asyncio.gather(..., return_exceptions=True) and always prints success;
update the loop around client.sources.delete / asyncio.gather to inspect the
gathered results, count actual successful deletions vs exceptions, and log or
print any failures (include exception messages and the corresponding sid) as
they occur; after the loop, change the console.print call to report the real
number of successful deletions (and optionally failed count) using the counted
successes so users aren’t shown a false “Successfully cleaned” message for
delete_list, chunk_size, delete_tasks, and nb_id_resolved.
- Around line 945-950: The code incorrectly converts the enum s.status to a
string with str(...).lower(), so comparisons never match; replace that
conversion with the helper source_status_to_str(s.status) (handle None if
needed) and use that result to check if status is "error" or "unknown", then add
s.id to to_delete as before (look at variables s, to_delete, and function
source_status_to_str for where to change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4ae7d06c-d332-4bb1-9a3a-f0412794bf36
📒 Files selected for processing (1)
src/notebooklm/cli/source.py
There was a problem hiding this comment.
Code Review
This pull request introduces a new source clean command to the CLI, designed to automatically remove duplicate, error, and access-blocked sources from a NotebookLM notebook. The command identifies sources based on their status, title, and normalized URL, and includes options for dry-run and confirmation. A critical issue was identified in the status checking logic: the s.status enum is incorrectly converted to a string, which prevents sources in an error or unknown state from being properly detected and cleaned. The suggested fix is to use the source_status_to_str() helper function.
src/notebooklm/cli/source.py
Outdated
|
|
||
| for s in sorted_sources: | ||
| title = (s.title or "").strip() | ||
| status = str(s.status).lower() if s.status else "unknown" |
There was a problem hiding this comment.
The current logic for checking the source status is incorrect. The s.status attribute is an integer from the SourceStatus enum. Converting it to a string via str(s.status) will produce a digit string (e.g., '3'), not the word 'error'. As a result, the check status in ["error", "unknown"] will never be true, and sources in an error state won't be cleaned up.
You should use the source_status_to_str() helper function, which is designed for this purpose and already used elsewhere in the file.
| status = str(s.status).lower() if s.status else "unknown" | |
| status = source_status_to_str(s.status) |
- Use source_status_to_str() instead of str().lower() so error/unknown status sources are correctly identified and removed - Track per-result success/failure from asyncio.gather so the output reflects what actually happened instead of always reporting success Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
After failed deep-research runs, bulk imports, or bot-blocked crawls, notebooks can accumulate junk sources: errored ones, gateway/anti-bot pages (Cloudflare, 403, 404), and duplicates. There is currently no way to clean these up in bulk — users must delete them one by one.
This PR adds
notebooklm source clean, which automatically identifies and removes junk sources in one command.What it removes
statusiserrororunknown403,404,Forbidden,Access Denied,Just a Moment,Attention Required,Security Check,CAPTCHAFlags
Implementation notes
asyncio.gatherpattern consistent with other bulk operations in the codebaseTest plan
--dry-runprints count and exits without deletingstatus=errorare removed"403 Forbidden") are removed"Notebook is already clean"-yskips confirmation promptSummary by CodeRabbit
notebooklm source cleanCLI command to automatically remove problematic sources from notebooks.--dry-runto preview actions and--yesto skip confirmation.