An intelligent code analysis and graph-based indexing system that creates a comprehensive, searchable representation of your codebase using Neo4j, LLMs, and semantic embeddings.
- π Multi-language Support: Python, C#, JavaScript/TypeScript
- π Graph-based Representation: Rich code relationships in Neo4j
- π§ LLM-powered Summarization: Hierarchical code summaries using Claude
- π Hybrid Search System: Vector similarity + entity lookup + graph context expansion
- π― Natural Language Queries: "Find authentication methods" or "PaymentService class"
- πΈοΈ GraphRAG Context: Expand search results with related code relationships
- π REST API: FastAPI-based search API with interactive documentation
- β‘ Incremental Processing: Change detection with SHA-256 checksums
- π Concurrent Processing: Async/await for high-performance indexing
- π¨ Beautiful CLI: Rich terminal interface with progress tracking
βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ
β File Traversal βββββΆβ Language Chunkers βββββΆβ Graph Ingestion β
β & Change Detection β β (Python/C#/JS/TS) β β (Neo4j) β
βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ
β
βββββββββββββββββββββββ βββββββββββββββββββββββ β
β Embedding Gen. ββββββ LLM Summarization ββββββββββββββββββ
β (Jina Embeddings) β β (Anthropic Claude) β
βββββββββββββββββββββββ βββββββββββββββββββββββ
- Python 3.8+
- Node.js 18+
- .NET 6+
- Neo4j Database
- Anthropic API Key (optional, for LLM features)
- Clone the repository
git clone https://github.com/your-org/agentic-code-indexer.git
cd agentic-code-indexer- Start Neo4j Database
docker-compose up -d- Install Python dependencies
cd src/agentic_code_indexer
pip install -r requirements.txt- Set up environment variables
export ANTHROPIC_API_KEY="your-api-key-here" # Optional
export NEO4J_PASSWORD="your-neo4j-password"- Index your codebase
# Index current directory with database initialization
python -m agentic_code_indexer index . --init-db
# Index specific directory
python -m agentic_code_indexer index /path/to/your/project
# Skip LLM features (faster, no API required)
python -m agentic_code_indexer index . --skip-llm- Check indexing status
python -m agentic_code_indexer status- Generate summaries and embeddings
python -m agentic_code_indexer summarize- Search your codebase
# Natural language search
python -m agentic_code_indexer search "authentication methods"
python -m agentic_code_indexer search "PaymentService class" --types Class
python -m agentic_code_indexer search "error handling" --context --code
# Explain how a query would be processed
python -m agentic_code_indexer explain "user authentication"
# Start the search API server
python -m agentic_code_indexer api --host 0.0.0.0 --port 8000
# Then visit http://localhost:8000/docs for interactive API documentation- ποΈ Neo4j Database Setup: Comprehensive schema with constraints and vector indexes
- π Python Chunker: LibCST-based AST analysis with scope resolution
- π· C# Chunker: Microsoft.CodeAnalysis with semantic symbol resolution
- π¨ JavaScript/TypeScript Chunker: TypeScript Compiler API + Acorn parser
- π Common Data Format: Pydantic models for cross-language compatibility
- π File Traversal: Recursive directory scanning with change detection
- π Chunker Orchestration: Coordinates all language-specific chunkers
- π Graph Ingestion: Efficient batched Neo4j operations with MERGE clauses
- π§ Hierarchical Summarization: Bottom-up LLM processing (Parameters β Variables β Methods β Classes β Files)
- π Embedding Generation: Local vector generation using Jina embeddings
- βοΈ Transaction Management: Error handling, retry mechanisms, and batch optimization
- π Vector Search Engine: Semantic similarity search using Neo4j vector indexes
- πΈοΈ Graph Traversal Engine: GraphRAG-style context expansion with relationship following
- π― Hybrid Search System: Combines vector similarity, entity lookup, and graph context
- π€ Query Intent Parsing: Intelligent analysis of natural language queries
- π Call & Inheritance Hierarchy: Analyze method calls and class inheritance patterns
- π REST API: FastAPI-based search API with comprehensive endpoints
- π» Interactive CLI: Rich terminal search interface with explanations
# Use different Neo4j instance
python -m agentic_code_indexer index . \
--neo4j-uri bolt://your-server:7687 \
--neo4j-user your-username \
--neo4j-password your-password
# Adjust performance settings
python -m agentic_code_indexer index . \
--max-concurrent 10 \
--batch-size 2000
# Verbose logging
python -m agentic_code_indexer index . --verbose# Reset processing status (if interrupted)
python -m agentic_code_indexer reset --confirm
# Re-run just summarization
python -m agentic_code_indexer summarize --batch-size 50The system creates a rich graph model in Neo4j:
- File: Source code files with checksums and metadata
- Class/Interface: Type definitions with inheritance relationships
- Method/Function: Callable code elements with parameters
- Variable/Parameter: Data elements with type information
- Import: Dependency declarations
- CONTAINS: Hierarchical containment (File β Class β Method)
- DEFINES: Definition relationships (Class β Method)
- CALLS: Function/method invocations
- EXTENDS/IMPLEMENTS: Inheritance relationships
- IMPORTS: Module dependencies
- 768-dimensional embeddings on all major node types
- Cosine similarity for semantic search
- Optimized for
jina-embeddings-v2-base-codemodel
# Find authentication-related code
python -m agentic_code_indexer search "user authentication login"
# Search for specific classes
python -m agentic_code_indexer search "PaymentService" --types Class
# Find error handling patterns
python -m agentic_code_indexer search "exception handling try catch" --context
# Search with source code included
python -m agentic_code_indexer search "database connection" --code --verbose
# Explain search strategy
python -m agentic_code_indexer explain "API rate limiting middleware"# Start the API server
python -m agentic_code_indexer api
# Search via HTTP
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "authentication methods", "max_results": 5, "include_context": true}'
# Get call hierarchy for a method
curl -X POST "http://localhost:8000/hierarchy/call" \
-H "Content-Type: application/json" \
-d '{"node_id": "method_123", "direction": "both", "max_depth": 2}'
# Get inheritance hierarchy for a class
curl -X POST "http://localhost:8000/hierarchy/inheritance" \
-H "Content-Type: application/json" \
-d '{"node_id": "class_456"}'// Find all classes that implement a specific interface
MATCH (c:Class)-[:IMPLEMENTS]->(i:Interface {name: "IUserRepository"})
RETURN c.name, c.generated_summary
// Semantic search for payment-related code
CALL db.index.vector.queryNodes('embedding_index', 10, $payment_embedding)
YIELD node, score
RETURN node.name, node.generated_summary, score
// Find complex methods (high cyclomatic complexity)
MATCH (m:Method)
WHERE m.raw_code CONTAINS "if" AND m.raw_code CONTAINS "for"
RETURN m.full_name, m.generated_summaryEach chunker includes comprehensive test coverage:
# Test Python chunker
cd src/python-chunker && python -m pytest
# Test C# chunker
cd src/csharp-chunker/CSharpChunker && dotnet test
# Test Node.js chunker
cd src/nodejs-chunker && npm test- Throughput: ~50-100 files/second (depends on file size and complexity)
- Concurrency: Configurable concurrent processing (default: 5 workers)
- Memory: Efficient streaming with batched database operations
- Incremental: Only processes changed files using SHA-256 checksums
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Microsoft CodeAnalysis for C# semantic analysis
- LibCST for Python concrete syntax trees
- TypeScript Compiler API for JavaScript/TypeScript analysis
- Neo4j for graph database capabilities
- Anthropic Claude for intelligent code summarization
- Jina AI for state-of-the-art code embeddings
@software{agentic-code-indexer,
author = {TeaBranch},
title = {agentic-code-indexer: An intelligent code analysis and graph-based indexing system that creates a comprehensive, searchable representation of your codebase using Neo4j, LLMs, and semantic embeddings.},
year = {2025},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/teabranch/agentic-code-indexer}},
commit = {use the commit hash youβre working with}
}
TeaBranch. (2025). agentic-code-indexer: An intelligent code analysis and graph-based indexing system that creates a comprehensive, searchable representation of your codebase using Neo4j, LLMs, and semantic embeddings. [Computer software]. GitHub. https://github.com/teabranch/agentic-code-indexer
Ready to explore your codebase like never before? π
Get started with the Agentic Code Indexer and unlock the full potential of graph-based code analysis!
