RepoScan helps you build and grow repository datasets across Git hosting platforms.
It discovers repositories from search queries, imports repository URLs, clones results locally, and expands discovery from existing repositories with Claude-assisted topic and query generation.
RepoScan currently supports GitHub, GitLab, and Gitea.
It is designed for engineers and researchers who want a reproducible way to:
- build repository discovery pipelines
- collect and curate open-source project datasets
- expand a repository corpus from related topics and search queries
- enrich discovery runs with user-defined topic hints
RepoScan does not run vulnerability scans.
Most repository collection workflows start simple and become messy quickly:
- search queries drift over time
- imported URLs and cloned repositories spread across ad hoc scripts
- repeated search runs waste API budget
- "related project discovery" becomes manual and inconsistent
RepoScan gives that workflow a single, scriptable entry point. It keeps discovery, import, clone, and Claude-assisted expansion in one place, while staying friendly to local automation and repeatable command-line use.
- Multi-platform discovery: search GitHub, GitLab, and Gitea from one CLI
- URL import pipeline: ingest curated repository lists from plain text files
- Clone queue worker: process repositories in batches or run a long-lived foreground worker
- Claude-assisted expansion: analyze already cloned repositories and generate new topics, search queries, and seed repositories
- User-defined topic expansion: merge your own topic hints into
agent-expandwith--topicor--topic-file - Reproducible workflow: use
uvfor install, execution, and verification
RepoScan is usable today for repository discovery, import, clone management, and Claude-assisted expansion.
The public repository is actively being shaped into a long-term open-source project. Expect iteration in documentation, packaging, and automation, but the command-line workflows described here are intended to stay practical and scriptable.
Install dependencies:
uv sync
uv sync --extra claudeDiscover repositories:
uv run reposcan discover --platform github --query "topic:rag stars:>=1000"Import curated URLs:
uv run reposcan import-urls --file mixed_urls.txt --label importedClone discovered repositories:
uv run reposcan clone-pending --limit 20Expand discovery from existing cloned repositories:
uv run --extra claude reposcan agent-expand --limit 3 --topic rag --topic ai-agentInspect the current CLI surface at any time:
uv run reposcan --help- Read CONTRIBUTING.md before opening pull requests.
- Review CODE_OF_CONDUCT.md for expected community behavior.
- Report security-sensitive issues through the process described in SECURITY.md.
- See CHANGELOG.md for public repository release notes.
The current project exposes these commands:
discoverimport-urlsclone-pendingagent-expandagent-check
RepoScan uses uv for dependency management and execution.
uv syncInstall the optional Claude integration when using agent-expand:
uv sync --extra claudeIf dependencies change:
uv lockRepoScan loads configuration from config/reposcan.toml by default.
You can also point to another file with --config or REPOSCAN_CONFIG.
Use the example file as a starting point:
cp config/reposcan.toml.example config/reposcan.tomlUse the standalone template when you want an independent database:
cp config/standalone.toml config/reposcan.tomlOr pass it explicitly:
uv run reposcan --config config/standalone.toml discover --platform github --query "topic:rag stars:>=1000"Create a local .env from the template:
cp .env.example .envTypical variables:
export DB_PASSWORD="..."
export GITHUB_TOKEN="..."
export GITHUB_TOKENS="token1,token2"
export GITLAB_TOKEN="..."
export GITEA_TOKEN="..."
export ANTHROPIC_API_KEY="..."
export ANTHROPIC_BASE_URL="..."
export CLAUDE_MODEL="..."
export CLAUDE_MAX_TURNS="30"
export CLAUDE_PERMISSION_MODE="bypassPermissions"Notes:
.envis loaded automatically from the project root.GITHUB_TOKENStakes precedence overGITHUB_TOKENwhen set.CLAUDE_MODELis applied through Claude Code model environment aliases as well.logs/and.envshould stay out of version control.
uv run reposcan discover --platform github --query "topic:rag stars:>=1000"
uv run reposcan discover --platform gitlab --query "ai agent"
uv run reposcan discover --platform gitea --query "rag"uv run reposcan import-urls --file mixed_urls.txt --label importedExample mixed_urls.txt:
https://github.com/owner/repo
https://gitlab.com/group/project
https://gitea.example.com/owner/repo
git@github.com:owner/repo.git
Run one batch:
uv run reposcan clone-pending --limit 20Run as a foreground worker:
uv run reposcan clone-pendingWithout --limit, RepoScan checks immediately, drains the current queue, sleeps for one hour, and checks again until stopped. On SIGINT or SIGTERM, it finishes the current repository and exits cleanly.
By default, cloned repositories are stored under ./data/repos/.
agent-expand runs Claude in read-only mode over already cloned repositories.
Claude generates:
topicssearch_queriesseed_repositories
RepoScan then:
- verifies seed repositories through the GitHub API before import
- normalizes generated search queries to the configured
--min-starsfloor - records GitHub query runs in
reposcan_github_query_runs
Basic usage:
uv run --extra claude reposcan agent-expand --limit 3 --dry-run
uv run --extra claude reposcan agent-expand --limit 3
uv run --extra claude reposcan agent-expand --limit 3 --per-topic 20 --per-query 20
uv run --extra claude reposcan agent-expand --limit 3 --force
uv run --extra claude reposcan --log-level DEBUG agent-expand --limit 1RepoScan now supports merging user-defined topics into the Claude expansion flow while still keeping the repository-driven analysis intact.
Use repeatable --topic flags:
uv run --extra claude reposcan agent-expand --limit 3 --topic rag --topic ai-agentOr load topics from a file:
uv run --extra claude reposcan agent-expand --limit 3 --topic-file custom_topics.txtBehavior:
- RepoScan still analyzes each cloned repository first.
- Claude still generates topics, queries, and seed repositories from repository contents.
- User topics are passed into the discovery prompt as extra guidance.
- Claude-generated topics and user topics are merged with case-insensitive de-duplication.
- Topic-based GitHub searches run against the merged set.
RepoScan uses loguru for terminal output and JSONL debug logs.
- normal runs show concise progress
--log-level DEBUGexposes more diagnostics in the terminal- every CLI run writes a debug log file under
logs/
Example:
uv run --extra claude reposcan --log-level DEBUG agent-expand --limit 1Recommended local verification commands:
uv run python -m unittest discover -s tests
uv run python -m compileall -q src tests
uv run reposcan --helpYou can also inspect the command-specific interface:
uv run reposcan agent-expand --helpThis public repository intentionally excludes local agent-instruction files, private environment files, runtime logs, and internal workflow artifacts.
RepoScan is released under the MIT License.