Skip to content

teabranch/agentic-code-indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

image

πŸ€– Agentic Code Indexer

An intelligent code analysis and graph-based indexing system that creates a comprehensive, searchable representation of your codebase using Neo4j, LLMs, and semantic embeddings.

✨ Features

  • 🌐 Multi-language Support: Python, C#, JavaScript/TypeScript
  • πŸ“Š Graph-based Representation: Rich code relationships in Neo4j
  • 🧠 LLM-powered Summarization: Hierarchical code summaries using Claude
  • πŸ” Hybrid Search System: Vector similarity + entity lookup + graph context expansion
  • 🎯 Natural Language Queries: "Find authentication methods" or "PaymentService class"
  • πŸ•ΈοΈ GraphRAG Context: Expand search results with related code relationships
  • 🌐 REST API: FastAPI-based search API with interactive documentation
  • ⚑ Incremental Processing: Change detection with SHA-256 checksums
  • πŸš€ Concurrent Processing: Async/await for high-performance indexing
  • 🎨 Beautiful CLI: Rich terminal interface with progress tracking

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   File Traversal    │───▢│  Language Chunkers  │───▢│   Graph Ingestion   β”‚
β”‚  & Change Detection β”‚    β”‚  (Python/C#/JS/TS) β”‚    β”‚     (Neo4j)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚  Embedding Gen.     │◀───│  LLM Summarization  β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (Jina Embeddings)   β”‚    β”‚  (Anthropic Claude) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Node.js 18+
  • .NET 6+
  • Neo4j Database
  • Anthropic API Key (optional, for LLM features)

Installation

  1. Clone the repository
git clone https://github.com/your-org/agentic-code-indexer.git
cd agentic-code-indexer
  1. Start Neo4j Database
docker-compose up -d
  1. Install Python dependencies
cd src/agentic_code_indexer
pip install -r requirements.txt
  1. Set up environment variables
export ANTHROPIC_API_KEY="your-api-key-here"  # Optional
export NEO4J_PASSWORD="your-neo4j-password"

Basic Usage

  1. Index your codebase
# Index current directory with database initialization
python -m agentic_code_indexer index . --init-db

# Index specific directory
python -m agentic_code_indexer index /path/to/your/project

# Skip LLM features (faster, no API required)
python -m agentic_code_indexer index . --skip-llm
  1. Check indexing status
python -m agentic_code_indexer status
  1. Generate summaries and embeddings
python -m agentic_code_indexer summarize
  1. Search your codebase
# Natural language search
python -m agentic_code_indexer search "authentication methods"
python -m agentic_code_indexer search "PaymentService class" --types Class
python -m agentic_code_indexer search "error handling" --context --code

# Explain how a query would be processed
python -m agentic_code_indexer explain "user authentication"

# Start the search API server
python -m agentic_code_indexer api --host 0.0.0.0 --port 8000
# Then visit http://localhost:8000/docs for interactive API documentation

πŸ“š Component Overview

Phase 1: Foundation (βœ… Complete)

  • πŸ—ƒοΈ Neo4j Database Setup: Comprehensive schema with constraints and vector indexes
  • 🐍 Python Chunker: LibCST-based AST analysis with scope resolution
  • πŸ”· C# Chunker: Microsoft.CodeAnalysis with semantic symbol resolution
  • 🟨 JavaScript/TypeScript Chunker: TypeScript Compiler API + Acorn parser
  • πŸ“‹ Common Data Format: Pydantic models for cross-language compatibility

Phase 2: Main Pipeline (βœ… Complete)

  • πŸ“ File Traversal: Recursive directory scanning with change detection
  • πŸ”„ Chunker Orchestration: Coordinates all language-specific chunkers
  • πŸ“Š Graph Ingestion: Efficient batched Neo4j operations with MERGE clauses
  • 🧠 Hierarchical Summarization: Bottom-up LLM processing (Parameters β†’ Variables β†’ Methods β†’ Classes β†’ Files)
  • πŸ” Embedding Generation: Local vector generation using Jina embeddings
  • βš™οΈ Transaction Management: Error handling, retry mechanisms, and batch optimization

Phase 3: Retrieval System (βœ… Complete)

  • πŸ” Vector Search Engine: Semantic similarity search using Neo4j vector indexes
  • πŸ•ΈοΈ Graph Traversal Engine: GraphRAG-style context expansion with relationship following
  • 🎯 Hybrid Search System: Combines vector similarity, entity lookup, and graph context
  • πŸ€– Query Intent Parsing: Intelligent analysis of natural language queries
  • πŸ“Š Call & Inheritance Hierarchy: Analyze method calls and class inheritance patterns
  • 🌐 REST API: FastAPI-based search API with comprehensive endpoints
  • πŸ’» Interactive CLI: Rich terminal search interface with explanations

πŸ› οΈ Advanced Usage

Custom Configuration

# Use different Neo4j instance
python -m agentic_code_indexer index . \
  --neo4j-uri bolt://your-server:7687 \
  --neo4j-user your-username \
  --neo4j-password your-password

# Adjust performance settings
python -m agentic_code_indexer index . \
  --max-concurrent 10 \
  --batch-size 2000

# Verbose logging
python -m agentic_code_indexer index . --verbose

Recovery Operations

# Reset processing status (if interrupted)
python -m agentic_code_indexer reset --confirm

# Re-run just summarization
python -m agentic_code_indexer summarize --batch-size 50

πŸ“Š Database Schema

The system creates a rich graph model in Neo4j:

Node Types

  • File: Source code files with checksums and metadata
  • Class/Interface: Type definitions with inheritance relationships
  • Method/Function: Callable code elements with parameters
  • Variable/Parameter: Data elements with type information
  • Import: Dependency declarations

Relationships

  • CONTAINS: Hierarchical containment (File β†’ Class β†’ Method)
  • DEFINES: Definition relationships (Class β†’ Method)
  • CALLS: Function/method invocations
  • EXTENDS/IMPLEMENTS: Inheritance relationships
  • IMPORTS: Module dependencies

Vector Indexes

  • 768-dimensional embeddings on all major node types
  • Cosine similarity for semantic search
  • Optimized for jina-embeddings-v2-base-code model

πŸ” Search Examples

Natural Language Search

# Find authentication-related code
python -m agentic_code_indexer search "user authentication login"

# Search for specific classes
python -m agentic_code_indexer search "PaymentService" --types Class

# Find error handling patterns
python -m agentic_code_indexer search "exception handling try catch" --context

# Search with source code included
python -m agentic_code_indexer search "database connection" --code --verbose

# Explain search strategy
python -m agentic_code_indexer explain "API rate limiting middleware"

REST API Examples

# Start the API server
python -m agentic_code_indexer api

# Search via HTTP
curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{"query": "authentication methods", "max_results": 5, "include_context": true}'

# Get call hierarchy for a method
curl -X POST "http://localhost:8000/hierarchy/call" \
  -H "Content-Type: application/json" \
  -d '{"node_id": "method_123", "direction": "both", "max_depth": 2}'

# Get inheritance hierarchy for a class
curl -X POST "http://localhost:8000/hierarchy/inheritance" \
  -H "Content-Type: application/json" \
  -d '{"node_id": "class_456"}'

Example Cypher Queries

// Find all classes that implement a specific interface
MATCH (c:Class)-[:IMPLEMENTS]->(i:Interface {name: "IUserRepository"})
RETURN c.name, c.generated_summary

// Semantic search for payment-related code
CALL db.index.vector.queryNodes('embedding_index', 10, $payment_embedding)
YIELD node, score
RETURN node.name, node.generated_summary, score

// Find complex methods (high cyclomatic complexity)
MATCH (m:Method)
WHERE m.raw_code CONTAINS "if" AND m.raw_code CONTAINS "for"
RETURN m.full_name, m.generated_summary

πŸ§ͺ Testing

Each chunker includes comprehensive test coverage:

# Test Python chunker
cd src/python-chunker && python -m pytest

# Test C# chunker  
cd src/csharp-chunker/CSharpChunker && dotnet test

# Test Node.js chunker
cd src/nodejs-chunker && npm test

πŸ“ˆ Performance

  • Throughput: ~50-100 files/second (depends on file size and complexity)
  • Concurrency: Configurable concurrent processing (default: 5 workers)
  • Memory: Efficient streaming with batched database operations
  • Incremental: Only processes changed files using SHA-256 checksums

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Microsoft CodeAnalysis for C# semantic analysis
  • LibCST for Python concrete syntax trees
  • TypeScript Compiler API for JavaScript/TypeScript analysis
  • Neo4j for graph database capabilities
  • Anthropic Claude for intelligent code summarization
  • Jina AI for state-of-the-art code embeddings

Cite this project

Code citation

@software{agentic-code-indexer,
  author = {TeaBranch},
  title = {agentic-code-indexer: An intelligent code analysis and graph-based indexing system that creates a comprehensive, searchable representation of your codebase using Neo4j, LLMs, and semantic embeddings.},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/teabranch/agentic-code-indexer}},
  commit = {use the commit hash you’re working with}
}

Text citation

TeaBranch. (2025). agentic-code-indexer: An intelligent code analysis and graph-based indexing system that creates a comprehensive, searchable representation of your codebase using Neo4j, LLMs, and semantic embeddings. [Computer software]. GitHub. https://github.com/teabranch/agentic-code-indexer


Ready to explore your codebase like never before? πŸš€

Get started with the Agentic Code Indexer and unlock the full potential of graph-based code analysis!

About

A library or cli to index a folder with code using llms and embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published