Skip to content

3em0/reposcan-public

RepoScan

RepoScan helps you build and grow repository datasets across Git hosting platforms.

It discovers repositories from search queries, imports repository URLs, clones results locally, and expands discovery from existing repositories with Claude-assisted topic and query generation.

RepoScan currently supports GitHub, GitLab, and Gitea.

It is designed for engineers and researchers who want a reproducible way to:

  • build repository discovery pipelines
  • collect and curate open-source project datasets
  • expand a repository corpus from related topics and search queries
  • enrich discovery runs with user-defined topic hints

RepoScan does not run vulnerability scans.

Why RepoScan

Most repository collection workflows start simple and become messy quickly:

  • search queries drift over time
  • imported URLs and cloned repositories spread across ad hoc scripts
  • repeated search runs waste API budget
  • "related project discovery" becomes manual and inconsistent

RepoScan gives that workflow a single, scriptable entry point. It keeps discovery, import, clone, and Claude-assisted expansion in one place, while staying friendly to local automation and repeatable command-line use.

Highlights

  • Multi-platform discovery: search GitHub, GitLab, and Gitea from one CLI
  • URL import pipeline: ingest curated repository lists from plain text files
  • Clone queue worker: process repositories in batches or run a long-lived foreground worker
  • Claude-assisted expansion: analyze already cloned repositories and generate new topics, search queries, and seed repositories
  • User-defined topic expansion: merge your own topic hints into agent-expand with --topic or --topic-file
  • Reproducible workflow: use uv for install, execution, and verification

Project Status

RepoScan is usable today for repository discovery, import, clone management, and Claude-assisted expansion.

The public repository is actively being shaped into a long-term open-source project. Expect iteration in documentation, packaging, and automation, but the command-line workflows described here are intended to stay practical and scriptable.

Quick Start

Install dependencies:

uv sync
uv sync --extra claude

Discover repositories:

uv run reposcan discover --platform github --query "topic:rag stars:>=1000"

Import curated URLs:

uv run reposcan import-urls --file mixed_urls.txt --label imported

Clone discovered repositories:

uv run reposcan clone-pending --limit 20

Expand discovery from existing cloned repositories:

uv run --extra claude reposcan agent-expand --limit 3 --topic rag --topic ai-agent

Inspect the current CLI surface at any time:

uv run reposcan --help

Community

CLI Surface

The current project exposes these commands:

  • discover
  • import-urls
  • clone-pending
  • agent-expand
  • agent-check

Install

RepoScan uses uv for dependency management and execution.

uv sync

Install the optional Claude integration when using agent-expand:

uv sync --extra claude

If dependencies change:

uv lock

Configuration

RepoScan loads configuration from config/reposcan.toml by default. You can also point to another file with --config or REPOSCAN_CONFIG.

AgentScan-style config

Use the example file as a starting point:

cp config/reposcan.toml.example config/reposcan.toml

Standalone config

Use the standalone template when you want an independent database:

cp config/standalone.toml config/reposcan.toml

Or pass it explicitly:

uv run reposcan --config config/standalone.toml discover --platform github --query "topic:rag stars:>=1000"

Environment

Create a local .env from the template:

cp .env.example .env

Typical variables:

export DB_PASSWORD="..."
export GITHUB_TOKEN="..."
export GITHUB_TOKENS="token1,token2"
export GITLAB_TOKEN="..."
export GITEA_TOKEN="..."
export ANTHROPIC_API_KEY="..."
export ANTHROPIC_BASE_URL="..."
export CLAUDE_MODEL="..."
export CLAUDE_MAX_TURNS="30"
export CLAUDE_PERMISSION_MODE="bypassPermissions"

Notes:

  • .env is loaded automatically from the project root.
  • GITHUB_TOKENS takes precedence over GITHUB_TOKEN when set.
  • CLAUDE_MODEL is applied through Claude Code model environment aliases as well.
  • logs/ and .env should stay out of version control.

Core Workflows

Discover repositories

uv run reposcan discover --platform github --query "topic:rag stars:>=1000"
uv run reposcan discover --platform gitlab --query "ai agent"
uv run reposcan discover --platform gitea --query "rag"

Import repository URLs

uv run reposcan import-urls --file mixed_urls.txt --label imported

Example mixed_urls.txt:

https://github.com/owner/repo
https://gitlab.com/group/project
https://gitea.example.com/owner/repo
git@github.com:owner/repo.git

Clone repositories

Run one batch:

uv run reposcan clone-pending --limit 20

Run as a foreground worker:

uv run reposcan clone-pending

Without --limit, RepoScan checks immediately, drains the current queue, sleeps for one hour, and checks again until stopped. On SIGINT or SIGTERM, it finishes the current repository and exits cleanly.

By default, cloned repositories are stored under ./data/repos/.

Expand discovery from existing cloned repositories

agent-expand runs Claude in read-only mode over already cloned repositories. Claude generates:

  • topics
  • search_queries
  • seed_repositories

RepoScan then:

  • verifies seed repositories through the GitHub API before import
  • normalizes generated search queries to the configured --min-stars floor
  • records GitHub query runs in reposcan_github_query_runs

Basic usage:

uv run --extra claude reposcan agent-expand --limit 3 --dry-run
uv run --extra claude reposcan agent-expand --limit 3
uv run --extra claude reposcan agent-expand --limit 3 --per-topic 20 --per-query 20
uv run --extra claude reposcan agent-expand --limit 3 --force
uv run --extra claude reposcan --log-level DEBUG agent-expand --limit 1

Add user-defined topics to agent-expand

RepoScan now supports merging user-defined topics into the Claude expansion flow while still keeping the repository-driven analysis intact.

Use repeatable --topic flags:

uv run --extra claude reposcan agent-expand --limit 3 --topic rag --topic ai-agent

Or load topics from a file:

uv run --extra claude reposcan agent-expand --limit 3 --topic-file custom_topics.txt

Behavior:

  • RepoScan still analyzes each cloned repository first.
  • Claude still generates topics, queries, and seed repositories from repository contents.
  • User topics are passed into the discovery prompt as extra guidance.
  • Claude-generated topics and user topics are merged with case-insensitive de-duplication.
  • Topic-based GitHub searches run against the merged set.

Logging

RepoScan uses loguru for terminal output and JSONL debug logs.

  • normal runs show concise progress
  • --log-level DEBUG exposes more diagnostics in the terminal
  • every CLI run writes a debug log file under logs/

Example:

uv run --extra claude reposcan --log-level DEBUG agent-expand --limit 1

Verification

Recommended local verification commands:

uv run python -m unittest discover -s tests
uv run python -m compileall -q src tests
uv run reposcan --help

You can also inspect the command-specific interface:

uv run reposcan agent-expand --help

Repository Notes

This public repository intentionally excludes local agent-instruction files, private environment files, runtime logs, and internal workflow artifacts.

License

RepoScan is released under the MIT License.

About

Repository discovery, import, clone, and Claude-assisted expansion across Git hosting platforms.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages