diff --git a/.env.example b/.env.example index fd6ad8a..2d01fc5 100644 --- a/.env.example +++ b/.env.example @@ -24,8 +24,8 @@ EMBEDDING_MODEL=text-embedding-3-large # Get your API key from: https://console.anthropic.com/settings/keys ANTHROPIC_API_KEY=sk-ant-your-anthropic-api-key-here -# Claude model for summarization (default: claude-3-5-haiku-20241022) -CLAUDE_MODEL=claude-3-5-haiku-20241022 +# Claude model for summarization (default: claude-haiku-4-5-20251001) +CLAUDE_MODEL=claude-haiku-4-5-20251001 # ============================================================================= # Chroma Vector Store Configuration diff --git a/.github/workflows/pr-qa-gate.yml b/.github/workflows/pr-qa-gate.yml index 4646b0b..54a1491 100644 --- a/.github/workflows/pr-qa-gate.yml +++ b/.github/workflows/pr-qa-gate.yml @@ -29,7 +29,7 @@ jobs: - name: Install Task uses: arduino/setup-task@v2 with: - version: 3.x + version: 3.43.3 - name: Install Poetry uses: snok/install-poetry@v1 diff --git a/AGENTS.md b/AGENTS.md index ca2a54c..ba6e699 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -162,7 +162,9 @@ task before-push This runs format, lint, typecheck, and tests with coverage. -**MANDATORY**: You MUST run `task pr-qa-gate` before checking in or pushing any changes. You should also run `task pr-qa-gate` whenever checking project status or SDD status. +**IMPORTANT**: You MUST run `task pr-qa-gate` before checking in or pushing any changes. You should also run `task pr-qa-gate` whenever checking project status or SDD status. + +**MANDATORY**: Any feature or task is not considered done unless `task pr-qa-gate` passes successfully. ## Git Workflow diff --git a/CLAUDE.md b/CLAUDE.md index 7bb3b9d..b6b52a8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -198,6 +198,10 @@ Do NOT push code that fails `task before-push`. ## Active Technologies - Python 3.10+ + FastAPI, LlamaIndex, ChromaDB, OpenAI, rank-bm25 (100-bm25-hybrid-retrieval) - ChromaDB (Vector Store), Local Persistent BM25 Index (LlamaIndex) (100-bm25-hybrid-retrieval) +- Python 3.10+ + LlamaIndex (CodeSplitter, SummaryExtractor), tree-sitter parsers, ChromaDB (101-code-ingestion) +- ChromaDB (unified vector store), Disk-based BM25 index (101-code-ingestion) +- Python 3.10+ + LlamaIndex (CodeSplitter, SummaryExtractor), tree-sitter (AST parsing), OpenAI/Anthropic (embeddings/summaries) (101-code-ingestion) +- ChromaDB vector store (existing) (101-code-ingestion) ## Recent Changes - 100-bm25-hybrid-retrieval: Added Python 3.10+ + FastAPI, LlamaIndex, ChromaDB, OpenAI, rank-bm25 diff --git a/README.md b/README.md index 85aacd9..3338635 100644 --- a/README.md +++ b/README.md @@ -12,13 +12,41 @@ Doc-Serve is a monorepo containing three packages: | **doc-svr-ctl** | Command-line interface for managing the server | | **doc-serve-skill** | Claude Code skill for AI-powered documentation queries | +## Code Ingestion & Search + +Doc-Serve now supports unified search across documentation and source code: + +- **10 Programming Languages**: Python, TypeScript, JavaScript, Java, Kotlin, C, C++, Go, Rust, Swift +- **AST-Aware Chunking**: Intelligent code parsing and chunking using tree-sitter +- **Cross-Reference Queries**: Search across docs and code simultaneously +- **Language Filtering**: Filter results by programming language +- **Source Type Filtering**: Separate results by documentation vs. source code +- **LLM Code Summaries**: AI-generated summaries improve semantic search quality + +### Example: Code-Aware Search +```bash +# Index both docs and code +doc-svr-ctl index ./my-project --include-code + +# Search across everything +doc-svr-ctl query "authentication implementation" + +# Filter by code only +doc-svr-ctl query "API endpoints" --source-types code --languages python +``` + ## Features +- **Code Ingestion**: Index and search across documentation AND source code +- **Cross-Reference Search**: Unified queries across docs and code with intelligent filtering +- **Language-Aware Processing**: AST-based chunking for 10+ programming languages - **Hybrid Search**: Combines semantic meaning (Vector) with exact keyword matching (BM25) - **Semantic Search**: Natural language queries using OpenAI embeddings - **Keyword Search**: Precise term matching for technical documentation +- **Advanced Filtering**: Filter by source type (doc/code) and programming language - **Vector Store**: ChromaDB for efficient similarity search -- **Context-Aware Chunking**: Intelligent document splitting with overlap +- **Context-Aware Chunking**: Intelligent document and code splitting with overlap +- **LLM Summaries**: AI-generated summaries for code chunks improve semantic search - **REST API**: Full OpenAPI-documented REST interface - **CLI Tool**: Comprehensive command-line management - **Claude Integration**: Native Claude Code skill for AI workflows @@ -155,6 +183,7 @@ doc-serve/ - **Embeddings**: OpenAI text-embedding-3-large - **Summarization**: Claude Haiku - **Indexing**: LlamaIndex +- **Code Parsing**: Tree-sitter (AST-aware chunking) - **CLI**: Click + Rich - **Build System**: Poetry diff --git a/doc-serve-server/.coverage.Mac.lan.43285.XHDFbHkx b/doc-serve-server/.coverage.Mac.lan.91214.XiXnnWhx similarity index 100% rename from doc-serve-server/.coverage.Mac.lan.43285.XHDFbHkx rename to doc-serve-server/.coverage.Mac.lan.91214.XiXnnWhx diff --git a/doc-serve-server/coverage-server.xml b/doc-serve-server/coverage-server.xml index 2848423..ae8038d 100644 --- a/doc-serve-server/coverage-server.xml +++ b/doc-serve-server/coverage-server.xml @@ -1,5 +1,5 @@ - + @@ -47,38 +47,38 @@ - - + - + - + - - - - - + + + + + - + - - - - + + + + - - - - - - + + + + + + + @@ -150,36 +150,36 @@ - - - - - - - - - + + + + + + + - - + + - + + - - - - - - - - - - - - - + + + + + + + + + + + + + + @@ -260,7 +260,7 @@ - + @@ -272,7 +272,7 @@ - + @@ -318,21 +318,40 @@ - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + @@ -340,181 +359,434 @@ - + - - - - + + + + - - + - + + + + + + + + + + - - - - - + + + + + + + + + + + + + + + - + + + + + + + + + + + + + + + + + + - - + + - - - - - - - - - - - + + + + + + + + + + + + + + + + + + - - + - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + - - - + + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + - + - - - - - - - + + + - - - - + + + + + + + + + + + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + + - - - - - - - - - - - - - - - + + + + + + + + + + + + + + - + + + - - - - - - - - - + + + + + + + + + - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + @@ -541,11 +813,14 @@ - - + + - + + + + @@ -558,72 +833,98 @@ - - - + + + - - - - - + + + + + + + + - - - - + + - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + - + - - - + + - - - - - - - - - + + + + + + + + + + + - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + - + @@ -633,7 +934,7 @@ - + @@ -647,110 +948,158 @@ - - - - - + + + + + - - + + - + - - + + - - + + - - - + + + - + - - - - - - - + + + + + + + - + - - - + + + - - - - + + + + + - - - - - - - - + + + - - - + + + - - - + + + + + + - - + + + + + + - - - - - + - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + @@ -758,77 +1107,124 @@ - + - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + - - - - - - - + + + + + + + - - - + + - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + - - - - - + + - + + - - - - + + + + + - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc-serve-server/doc_serve_server/api/main.py b/doc-serve-server/doc_serve_server/api/main.py index 3b21232..de70ecd 100644 --- a/doc-serve-server/doc_serve_server/api/main.py +++ b/doc-serve-server/doc_serve_server/api/main.py @@ -37,6 +37,7 @@ async def lifespan(app: FastAPI) -> AsyncIterator[None]: # Set environment variable for LlamaIndex components import os + if settings.OPENAI_API_KEY: os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY diff --git a/doc-serve-server/doc_serve_server/api/routers/health.py b/doc-serve-server/doc_serve_server/api/routers/health.py index ab3b22b..866bd7c 100644 --- a/doc-serve-server/doc_serve_server/api/routers/health.py +++ b/doc-serve-server/doc_serve_server/api/routers/health.py @@ -79,6 +79,8 @@ async def indexing_status() -> IndexingStatus: return IndexingStatus( total_documents=status["total_documents"], total_chunks=status["total_chunks"], + total_doc_chunks=status.get("total_doc_chunks", 0), + total_code_chunks=status.get("total_code_chunks", 0), indexing_in_progress=status["is_indexing"], current_job_id=status["current_job_id"], progress_percent=status["progress_percent"], @@ -88,4 +90,5 @@ async def indexing_status() -> IndexingStatus: else None ), indexed_folders=status["indexed_folders"], + supported_languages=status.get("supported_languages", []), ) diff --git a/doc-serve-server/doc_serve_server/api/routers/index.py b/doc-serve-server/doc_serve_server/api/routers/index.py index 7620917..5b1aa02 100644 --- a/doc-serve-server/doc_serve_server/api/routers/index.py +++ b/doc-serve-server/doc_serve_server/api/routers/index.py @@ -74,6 +74,12 @@ async def index_documents(request: IndexRequest) -> IndexResponse: chunk_size=request.chunk_size, chunk_overlap=request.chunk_overlap, recursive=request.recursive, + include_code=request.include_code, + supported_languages=request.supported_languages, + code_chunk_strategy=request.code_chunk_strategy, + include_patterns=request.include_patterns, + exclude_patterns=request.exclude_patterns, + generate_summaries=request.generate_summaries, ) job_id = await indexing_service.start_indexing(resolved_request) except Exception as e: @@ -138,6 +144,11 @@ async def add_documents(request: IndexRequest) -> IndexResponse: chunk_size=request.chunk_size, chunk_overlap=request.chunk_overlap, recursive=request.recursive, + include_code=request.include_code, + supported_languages=request.supported_languages, + code_chunk_strategy=request.code_chunk_strategy, + include_patterns=request.include_patterns, + exclude_patterns=request.exclude_patterns, ) job_id = await indexing_service.start_indexing(resolved_request) except Exception as e: diff --git a/doc-serve-server/doc_serve_server/config/settings.py b/doc-serve-server/doc_serve_server/config/settings.py index 86a07f5..15e3ab7 100644 --- a/doc-serve-server/doc_serve_server/config/settings.py +++ b/doc-serve-server/doc_serve_server/config/settings.py @@ -21,7 +21,7 @@ class Settings(BaseSettings): # Anthropic Configuration ANTHROPIC_API_KEY: str = "" - CLAUDE_MODEL: str = "claude-3-5-haiku-20241022" + CLAUDE_MODEL: str = "claude-3-5-haiku-20241022" # Claude 3.5 Haiku (latest) # Chroma Configuration CHROMA_PERSIST_DIR: str = "./chroma_db" diff --git a/doc-serve-server/doc_serve_server/indexing/__init__.py b/doc-serve-server/doc_serve_server/indexing/__init__.py index c6955aa..4ddf788 100644 --- a/doc-serve-server/doc_serve_server/indexing/__init__.py +++ b/doc-serve-server/doc_serve_server/indexing/__init__.py @@ -1,7 +1,7 @@ """Indexing pipeline components for document processing.""" from doc_serve_server.indexing.bm25_index import BM25IndexManager, get_bm25_manager -from doc_serve_server.indexing.chunking import ContextAwareChunker +from doc_serve_server.indexing.chunking import CodeChunker, ContextAwareChunker from doc_serve_server.indexing.document_loader import DocumentLoader from doc_serve_server.indexing.embedding import ( EmbeddingGenerator, @@ -11,6 +11,7 @@ __all__ = [ "DocumentLoader", "ContextAwareChunker", + "CodeChunker", "EmbeddingGenerator", "get_embedding_generator", "BM25IndexManager", diff --git a/doc-serve-server/doc_serve_server/indexing/bm25_index.py b/doc-serve-server/doc_serve_server/indexing/bm25_index.py index cd5f092..8406b72 100644 --- a/doc-serve-server/doc_serve_server/indexing/bm25_index.py +++ b/doc-serve-server/doc_serve_server/indexing/bm25_index.py @@ -5,7 +5,7 @@ from pathlib import Path from typing import Optional -from llama_index.core.schema import BaseNode +from llama_index.core.schema import BaseNode, NodeWithScore from llama_index.retrievers.bm25 import BM25Retriever from doc_serve_server.config import settings @@ -93,6 +93,56 @@ def get_retriever(self, top_k: int = 5) -> BM25Retriever: self._retriever.similarity_top_k = top_k return self._retriever + async def search_with_filters( + self, + query: str, + top_k: int = 5, + source_types: Optional[list[str]] = None, + languages: Optional[list[str]] = None, + max_results: Optional[int] = None, + ) -> list[NodeWithScore]: + """ + Search the BM25 index with metadata filtering. + + Args: + query: Search query string. + top_k: Number of results to return. + source_types: Filter by source types (doc, code, test). + languages: Filter by programming languages. + + Returns: + List of NodeWithScore objects, filtered by metadata. + """ + if not self._retriever: + raise RuntimeError("BM25 index not initialized") + + # Get results for filtering + retriever_top_k = max_results if max_results is not None else (top_k * 3) + retriever = self.get_retriever(top_k=retriever_top_k) + nodes = await retriever.aretrieve(query) + + # Apply metadata filtering + filtered_nodes = [] + for node in nodes: + metadata = node.node.metadata + + # Check source type filter + if source_types: + source_type = metadata.get("source_type", "doc") + if source_type not in source_types: + continue + + # Check language filter + if languages: + language = metadata.get("language") + if not language or language not in languages: + continue + + filtered_nodes.append(node) + + # Return top_k results after filtering + return filtered_nodes[:top_k] + def reset(self) -> None: """Reset the BM25 index by deleting persistent files.""" self._retriever = None diff --git a/doc-serve-server/doc_serve_server/indexing/chunking.py b/doc-serve-server/doc_serve_server/indexing/chunking.py index ae61219..97965d8 100644 --- a/doc-serve-server/doc_serve_server/indexing/chunking.py +++ b/doc-serve-server/doc_serve_server/indexing/chunking.py @@ -4,10 +4,11 @@ import logging from collections.abc import Awaitable, Callable from dataclasses import dataclass, field +from datetime import datetime from typing import Any, Optional import tiktoken -from llama_index.core.node_parser import SentenceSplitter +from llama_index.core.node_parser import CodeSplitter, SentenceSplitter from doc_serve_server.config import settings @@ -16,9 +17,107 @@ logger = logging.getLogger(__name__) +@dataclass +class ChunkMetadata: + """Structured metadata for document and code chunks with unified schema.""" + + # Universal metadata (all chunk types) + chunk_id: str + source: str + file_name: str + chunk_index: int + total_chunks: int + source_type: str # "doc", "code", or "test" + created_at: datetime = field(default_factory=datetime.utcnow) + + # Document-specific metadata + language: Optional[str] = None # For docs/code: language type + heading_path: Optional[str] = None # Document heading hierarchy + section_title: Optional[str] = None # Current section title + content_type: Optional[str] = None # "tutorial", "api_ref", "guide", etc. + + # Code-specific metadata (AST-aware fields) + symbol_name: Optional[str] = None # Full symbol path + symbol_kind: Optional[str] = None # "function", "class", "method", etc. + start_line: Optional[int] = None # 1-based line number + end_line: Optional[int] = None # 1-based line number + section_summary: Optional[str] = None # AI-generated summary + prev_section_summary: Optional[str] = None # Previous section summary + docstring: Optional[str] = None # Extracted docstring + parameters: Optional[list[str]] = None # Function parameters as strings + return_type: Optional[str] = None # Function return type + decorators: Optional[list[str]] = None # Python decorators or similar + imports: Optional[list[str]] = None # Import statements in this chunk + + # Additional flexible metadata + extra: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + """Convert ChunkMetadata to a dictionary for storage.""" + data = { + "chunk_id": self.chunk_id, + "source": self.source, + "file_name": self.file_name, + "chunk_index": self.chunk_index, + "total_chunks": self.total_chunks, + "source_type": self.source_type, + "created_at": self.created_at.isoformat(), + } + + # Add optional fields if they exist + if self.language: + data["language"] = self.language + if self.heading_path: + data["heading_path"] = self.heading_path + if self.section_title: + data["section_title"] = self.section_title + if self.content_type: + data["content_type"] = self.content_type + if self.symbol_name: + data["symbol_name"] = self.symbol_name + if self.symbol_kind: + data["symbol_kind"] = self.symbol_kind + if self.start_line is not None: + data["start_line"] = self.start_line + if self.end_line is not None: + data["end_line"] = self.end_line + if self.section_summary: + data["section_summary"] = self.section_summary + if self.prev_section_summary: + data["prev_section_summary"] = self.prev_section_summary + if self.docstring: + data["docstring"] = self.docstring + if self.parameters: + data["parameters"] = self.parameters + if self.return_type: + data["return_type"] = self.return_type + if self.decorators: + data["decorators"] = self.decorators + if self.imports: + data["imports"] = self.imports + + # Add extra metadata + data.update(self.extra) + + return data + + @dataclass class TextChunk: - """Represents a chunk of text with metadata.""" + """Represents a chunk of text with structured metadata.""" + + chunk_id: str + text: str + source: str + chunk_index: int + total_chunks: int + token_count: int + metadata: ChunkMetadata + + +@dataclass +class CodeChunk: + """Represents a chunk of source code with AST-aware boundaries.""" chunk_id: str text: str @@ -26,7 +125,65 @@ class TextChunk: chunk_index: int total_chunks: int token_count: int - metadata: dict[str, Any] = field(default_factory=dict) + metadata: ChunkMetadata + + @classmethod + def create( + cls, + chunk_id: str, + text: str, + source: str, + language: str, + chunk_index: int, + total_chunks: int, + token_count: int, + symbol_name: Optional[str] = None, + symbol_kind: Optional[str] = None, + start_line: Optional[int] = None, + end_line: Optional[int] = None, + section_summary: Optional[str] = None, + prev_section_summary: Optional[str] = None, + docstring: Optional[str] = None, + parameters: Optional[list[str]] = None, + return_type: Optional[str] = None, + decorators: Optional[list[str]] = None, + imports: Optional[list[str]] = None, + extra: Optional[dict[str, Any]] = None, + ) -> "CodeChunk": + """Create a CodeChunk with properly structured metadata.""" + file_name = source.split("/")[-1] if "/" in source else source + + metadata = ChunkMetadata( + chunk_id=chunk_id, + source=source, + file_name=file_name, + chunk_index=chunk_index, + total_chunks=total_chunks, + source_type="code", + language=language, + symbol_name=symbol_name, + symbol_kind=symbol_kind, + start_line=start_line, + end_line=end_line, + section_summary=section_summary, + prev_section_summary=prev_section_summary, + docstring=docstring, + parameters=parameters, + return_type=return_type, + decorators=decorators, + imports=imports, + extra=extra or {}, + ) + + return cls( + chunk_id=chunk_id, + text=text, + source=source, + chunk_index=chunk_index, + total_chunks=total_chunks, + token_count=token_count, + metadata=metadata, + ) class ContextAwareChunker: @@ -134,6 +291,34 @@ async def chunk_single_document( id_seed = f"{document.source}_{idx}" stable_id = hashlib.md5(id_seed.encode()).hexdigest() + # Extract document-specific metadata + doc_language = document.metadata.get("language", "markdown") + doc_heading_path = document.metadata.get("heading_path") + doc_section_title = document.metadata.get("section_title") + doc_content_type = document.metadata.get("content_type", "document") + + # Filter out fields we've already extracted to avoid duplication + extra_metadata = { + k: v + for k, v in document.metadata.items() + if k + not in {"language", "heading_path", "section_title", "content_type"} + } + + chunk_metadata = ChunkMetadata( + chunk_id=f"chunk_{stable_id[:16]}", + source=document.source, + file_name=document.file_name, + chunk_index=idx, + total_chunks=total_chunks, + source_type="doc", + language=doc_language, + heading_path=doc_heading_path, + section_title=doc_section_title, + content_type=doc_content_type, + extra=extra_metadata, + ) + chunk = TextChunk( chunk_id=f"chunk_{stable_id[:16]}", text=chunk_text, @@ -141,13 +326,7 @@ async def chunk_single_document( chunk_index=idx, total_chunks=total_chunks, token_count=self.count_tokens(chunk_text), - metadata={ - "file_name": document.file_name, - "file_path": document.file_path, - "chunk_index": idx, - "total_chunks": total_chunks, - **document.metadata, - }, + metadata=chunk_metadata, ) chunks.append(chunk) @@ -206,3 +385,180 @@ def get_chunk_stats(self, chunks: list[TextChunk]) -> dict[str, Any]: "total_tokens": sum(token_counts), "unique_sources": len({c.source for c in chunks}), } + + +class CodeChunker: + """ + AST-aware code chunking using LlamaIndex CodeSplitter. + + Splits source code at semantic boundaries (functions, classes, etc.) + while preserving code structure and adding rich metadata. + """ + + def __init__( + self, + language: str, + chunk_lines: Optional[int] = None, + chunk_lines_overlap: Optional[int] = None, + max_chars: Optional[int] = None, + generate_summaries: bool = False, + ): + """ + Initialize the code chunker. + + Args: + language: Programming language (must be supported by tree-sitter). + chunk_lines: Target chunk size in lines. Defaults to 40. + chunk_lines_overlap: Line overlap between chunks. Defaults to 15. + max_chars: Maximum characters per chunk. Defaults to 1500. + generate_summaries: Whether to generate LLM summaries for chunks. + """ + self.language = language + self.chunk_lines = chunk_lines or 40 + self.chunk_lines_overlap = chunk_lines_overlap or 15 + self.max_chars = max_chars or 1500 + self.generate_summaries = generate_summaries + + # Initialize LlamaIndex CodeSplitter for AST-aware chunking + self.code_splitter = CodeSplitter( + language=self.language, + chunk_lines=self.chunk_lines, + chunk_lines_overlap=self.chunk_lines_overlap, + max_chars=self.max_chars, + ) + + # Initialize embedding generator for summaries (only if needed) + if self.generate_summaries: + from .embedding import get_embedding_generator + self.embedding_generator = get_embedding_generator() + + # Initialize tokenizer for token counting + self.tokenizer = tiktoken.get_encoding("cl100k_base") + + def count_tokens(self, text: str) -> int: + """Count the number of tokens in a text string.""" + return len(self.tokenizer.encode(text)) + + async def chunk_code_document( + self, + document: LoadedDocument, + ) -> list[CodeChunk]: + """ + Chunk a code document using AST-aware boundaries. + + Args: + document: Code document to chunk (must have source_type="code"). + + Returns: + List of CodeChunk objects with AST metadata. + + Raises: + ValueError: If document is not a code document or language mismatch. + """ + if document.metadata.get("source_type") != "code": + raise ValueError(f"Document {document.source} is not a code document") + + doc_language = document.metadata.get("language") + if doc_language and doc_language != self.language: + logger.warning( + f"Language mismatch: document has {doc_language}, " + f"chunker expects {self.language}. Using chunker language." + ) + + if not document.text.strip(): + logger.warning(f"Empty code document: {document.source}") + return [] + + try: + # Use LlamaIndex CodeSplitter to get AST-aware chunks + code_chunks = self.code_splitter.split_text(document.text) + except Exception as e: + logger.error(f"Failed to chunk code document {document.source}: {e}") + # Fallback to text-based chunking if AST parsing fails + logger.info(f"Falling back to text chunking for {document.source}") + text_splitter = SentenceSplitter( + chunk_size=self.max_chars, # Use max_chars as approximate token limit + chunk_overlap=int(self.max_chars * 0.1), # 10% overlap + ) + code_chunks = text_splitter.split_text(document.text) + + # Convert to our CodeChunk format with enhanced metadata + chunks: list[CodeChunk] = [] + total_chunks = len(code_chunks) + + for idx, chunk_text in enumerate(code_chunks): + # Generate stable chunk ID + id_seed = f"{document.source}_{idx}" + stable_id = hashlib.md5(id_seed.encode()).hexdigest() + + # Generate summary if enabled + section_summary = None + if self.generate_summaries and chunk_text.strip(): + try: + section_summary = await self.embedding_generator.generate_summary( + chunk_text + ) + logger.debug( + f"Generated summary for chunk {idx}: {section_summary[:50]}..." + ) + except Exception as e: + logger.warning(f"Failed to generate summary for chunk {idx}: {e}") + section_summary = "" + + chunk = CodeChunk.create( + chunk_id=f"chunk_{stable_id[:16]}", + text=chunk_text, + source=document.source, + language=self.language, + chunk_index=idx, + total_chunks=total_chunks, + token_count=self.count_tokens(chunk_text), + section_summary=section_summary, + # AST metadata will be populated by post-processing if available + extra=document.metadata.copy(), + ) + chunks.append(chunk) + + logger.info( + f"Code chunked {document.source} into {len(chunks)} chunks " + f"(avg {len(chunks) / max(total_chunks, 1):.1f} chunks/doc)" + ) + return chunks + + def get_code_chunk_stats(self, chunks: list[CodeChunk]) -> dict[str, Any]: + """ + Get statistics about code chunks. + + Args: + chunks: List of CodeChunk objects. + + Returns: + Dictionary with code chunk statistics. + """ + if not chunks: + return { + "total_chunks": 0, + "avg_tokens": 0, + "min_tokens": 0, + "max_tokens": 0, + "total_tokens": 0, + "languages": set(), + "symbol_types": set(), + } + + token_counts = [c.token_count for c in chunks] + languages = {c.metadata.language for c in chunks if c.metadata.language} + symbol_types = { + c.metadata.symbol_kind for c in chunks if c.metadata.symbol_kind + } + + return { + "total_chunks": len(chunks), + "avg_tokens": sum(token_counts) / len(token_counts), + "min_tokens": min(token_counts), + "max_tokens": max(token_counts), + "total_tokens": sum(token_counts), + "unique_sources": len({c.source for c in chunks}), + "languages": languages, + "symbol_types": symbol_types, + } diff --git a/doc-serve-server/doc_serve_server/indexing/document_loader.py b/doc-serve-server/doc_serve_server/indexing/document_loader.py index 1fb5112..f43303b 100644 --- a/doc-serve-server/doc_serve_server/indexing/document_loader.py +++ b/doc-serve-server/doc_serve_server/indexing/document_loader.py @@ -1,6 +1,7 @@ """Document loading from various file formats using LlamaIndex.""" import logging +import re from dataclasses import dataclass, field from pathlib import Path from typing import Any, Optional @@ -22,14 +23,239 @@ class LoadedDocument: metadata: dict[str, Any] = field(default_factory=dict) +class LanguageDetector: + """ + Utility for detecting programming languages from file paths and content. + + Supports the 10 languages with tree-sitter parsers: + - Python, TypeScript, JavaScript, Kotlin, C, C++, Java, Go, Rust, Swift + """ + + # Language detection by file extension + EXTENSION_TO_LANGUAGE = { + # Python + ".py": "python", + ".pyw": "python", + ".pyi": "python", + # TypeScript/JavaScript + ".ts": "typescript", + ".tsx": "typescript", + ".js": "javascript", + ".jsx": "javascript", + ".mjs": "javascript", + ".cjs": "javascript", + # Kotlin + ".kt": "kotlin", + ".kts": "kotlin", + # C/C++ + ".c": "c", + ".h": "c", + ".cpp": "cpp", + ".cc": "cpp", + ".cxx": "cpp", + ".hpp": "cpp", + ".hxx": "cpp", + # Java + ".java": "java", + # Go + ".go": "go", + # Rust + ".rs": "rust", + # Swift + ".swift": "swift", + } + + # Language detection by content patterns (fallback) + CONTENT_PATTERNS = { + "python": [ + re.compile(r"^\s*import\s+\w+", re.MULTILINE), + re.compile(r"^\s*from\s+\w+\s+import", re.MULTILINE), + re.compile(r"^\s*def\s+\w+\s*\(", re.MULTILINE), + re.compile(r"^\s*class\s+\w+", re.MULTILINE), + ], + "javascript": [ + re.compile(r"^\s*(const|let|var)\s+\w+\s*=", re.MULTILINE), + re.compile(r"^\s*function\s+\w+\s*\(", re.MULTILINE), + re.compile(r"^\s*=>\s*\{", re.MULTILINE), # Arrow functions + ], + "typescript": [ + re.compile(r"^\s*interface\s+\w+", re.MULTILINE), + re.compile(r"^\s*type\s+\w+\s*=", re.MULTILINE), + re.compile(r":\s*(string|number|boolean|any)", re.MULTILINE), + ], + "java": [ + re.compile(r"^\s*public\s+class\s+\w+", re.MULTILINE), + re.compile(r"^\s*package\s+\w+", re.MULTILINE), + re.compile(r"^\s*import\s+java\.", re.MULTILINE), + ], + "kotlin": [ + re.compile(r"^\s*fun\s+\w+\s*\(", re.MULTILINE), + re.compile(r"^\s*class\s+\w+", re.MULTILINE), + re.compile(r":\s*(String|Int|Boolean)", re.MULTILINE), + ], + "cpp": [ + re.compile(r"^\s*#include\s*<", re.MULTILINE), + re.compile(r"^\s*using\s+namespace", re.MULTILINE), + re.compile(r"^\s*std::", re.MULTILINE), + ], + "c": [ + re.compile(r"^\s*#include\s*<", re.MULTILINE), + re.compile(r"^\s*int\s+main\s*\(", re.MULTILINE), + re.compile(r"^\s*printf\s*\(", re.MULTILINE), + ], + "go": [ + re.compile(r"^\s*package\s+\w+", re.MULTILINE), + re.compile(r"^\s*import\s*\(", re.MULTILINE), + re.compile(r"^\s*func\s+\w+\s*\(", re.MULTILINE), + ], + "rust": [ + re.compile(r"^\s*fn\s+\w+\s*\(", re.MULTILINE), + re.compile(r"^\s*use\s+\w+::", re.MULTILINE), + re.compile(r"^\s*let\s+(mut\s+)?\w+", re.MULTILINE), + ], + "swift": [ + re.compile(r"^\s*import\s+Foundation", re.MULTILINE), + re.compile(r"^\s*func\s+\w+\s*\(", re.MULTILINE), + re.compile(r"^\s*class\s+\w+\s*:", re.MULTILINE), + ], + } + + @classmethod + def detect_from_path(cls, file_path: str) -> Optional[str]: + """ + Detect language from file path/extension. + + Args: + file_path: Path to the file. + + Returns: + Language name or None if not detected. + """ + path = Path(file_path) + extension = path.suffix.lower() + + return cls.EXTENSION_TO_LANGUAGE.get(extension) + + @classmethod + def detect_from_content( + cls, content: str, top_n: int = 3 + ) -> list[tuple[str, float]]: + """ + Detect language from file content using pattern matching. + + Args: + content: File content to analyze. + top_n: Number of top matches to return. + + Returns: + List of (language, confidence) tuples, sorted by confidence. + """ + scores: dict[str, float] = {} + + for language, patterns in cls.CONTENT_PATTERNS.items(): + total_score = 0.0 + pattern_count = len(patterns) + + for pattern in patterns: + matches = len(pattern.findall(content)) + if matches > 0: + # Score based on number of matches, normalized by pattern count + total_score += min(matches / 10.0, 1.0) # Cap at 1.0 per pattern + + if total_score > 0: + scores[language] = total_score / pattern_count + + # Sort by score descending + sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True) + return sorted_scores[:top_n] + + @classmethod + def detect_language( + cls, file_path: str, content: Optional[str] = None + ) -> Optional[str]: + """ + Detect programming language using both path and content analysis. + + Args: + file_path: Path to the file. + content: Optional file content for fallback detection. + + Returns: + Detected language name or None. + """ + # First try extension-based detection (fast and reliable) + language = cls.detect_from_path(file_path) + if language: + return language + + # Fallback to content analysis if content is provided + if content: + content_matches = cls.detect_from_content(content, top_n=1) + if ( + content_matches and content_matches[0][1] > 0.1 + ): # Minimum confidence threshold + return content_matches[0][0] + + return None + + @classmethod + def is_supported_language(cls, language: str) -> bool: + """ + Check if a language is supported by our tree-sitter parsers. + + Args: + language: Language name to check. + + Returns: + True if supported, False otherwise. + """ + return language in cls.CONTENT_PATTERNS + + @classmethod + def get_supported_languages(cls) -> list[str]: + """Get list of all supported programming languages.""" + return list(cls.CONTENT_PATTERNS.keys()) + + class DocumentLoader: """ - Loads documents from a folder supporting multiple file formats. + Loads documents and code files from a folder supporting multiple file formats. - Supported formats: .txt, .md, .pdf, .docx, .html + Supported document formats: .txt, .md, .pdf, .docx, .html, .rst + Supported code formats: .py, .ts, .tsx, .js, .jsx, .kt, .c, .cpp, + .java, .go, .rs, .swift """ - SUPPORTED_EXTENSIONS: set[str] = {".txt", ".md", ".pdf", ".docx", ".html", ".rst"} + # Document formats + DOCUMENT_EXTENSIONS: set[str] = {".txt", ".md", ".pdf", ".docx", ".html", ".rst"} + + # Code formats (supported by tree-sitter) + CODE_EXTENSIONS: set[str] = { + ".py", + ".pyw", + ".pyi", # Python + ".ts", + ".tsx", # TypeScript + ".js", + ".jsx", + ".mjs", + ".cjs", # JavaScript + ".kt", + ".kts", # Kotlin + ".c", + ".h", # C + ".cpp", + ".cc", + ".cxx", + ".hpp", + ".hxx", # C++ + ".java", # Java + ".go", # Go + ".rs", # Rust + ".swift", # Swift + } + + SUPPORTED_EXTENSIONS: set[str] = DOCUMENT_EXTENSIONS | CODE_EXTENSIONS def __init__( self, @@ -101,6 +327,15 @@ async def load_from_folder( except OSError: file_size = 0 + # Detect language for code files + language = None + source_type = "doc" # Default to document + if file_path: + path_ext = Path(file_path).suffix.lower() + if path_ext in self.CODE_EXTENSIONS: + source_type = "code" + language = LanguageDetector.detect_language(file_path, doc.text) + loaded_doc = LoadedDocument( text=doc.text, source=file_path, @@ -110,6 +345,8 @@ async def load_from_folder( metadata={ **doc.metadata, "doc_id": doc.doc_id, + "source_type": source_type, + "language": language, }, ) loaded_docs.append(loaded_doc) @@ -152,6 +389,15 @@ async def load_single_file(self, file_path: str) -> LoadedDocument: raise ValueError(f"No content loaded from file: {file_path}") doc = docs[0] + + # Detect language for code files + language = None + source_type = "doc" # Default to document + path_ext = path.suffix.lower() + if path_ext in self.CODE_EXTENSIONS: + source_type = "code" + language = LanguageDetector.detect_language(str(path), doc.text) + return LoadedDocument( text=doc.text, source=file_path, @@ -161,9 +407,62 @@ async def load_single_file(self, file_path: str) -> LoadedDocument: metadata={ **doc.metadata, "doc_id": doc.doc_id, + "source_type": source_type, + "language": language, }, ) + async def load_files( + self, + folder_path: str, + recursive: bool = True, + include_code: bool = False, + ) -> list[LoadedDocument]: + """ + Load documents and optionally code files from a folder. + + Args: + folder_path: Path to the folder containing files to load. + recursive: Whether to scan subdirectories recursively. + include_code: Whether to include source code files alongside documents. + + Returns: + List of LoadedDocument objects with proper metadata. + + Raises: + ValueError: If folder path is invalid. + FileNotFoundError: If folder doesn't exist. + """ + # Configure extensions based on include_code flag + if include_code: + # Use all supported extensions (docs + code) + effective_extensions = self.SUPPORTED_EXTENSIONS + else: + # Use only document extensions + effective_extensions = self.DOCUMENT_EXTENSIONS + + # Create a temporary loader with the effective extensions + temp_loader = DocumentLoader(supported_extensions=effective_extensions) + + # Load files using the configured extensions + loaded_docs = await temp_loader.load_from_folder(folder_path, recursive) + + # Ensure all documents have proper source_type metadata + for doc in loaded_docs: + if not doc.metadata.get("source_type"): + path_ext = Path(doc.source).suffix.lower() + if path_ext in self.CODE_EXTENSIONS: + doc.metadata["source_type"] = "code" + # Detect language for code files + language = LanguageDetector.detect_language(doc.source, doc.text) + if language: + doc.metadata["language"] = language + else: + doc.metadata["source_type"] = "doc" + doc.metadata["language"] = "markdown" # Default for documents + + return loaded_docs + def get_supported_files( self, folder_path: str, diff --git a/doc-serve-server/doc_serve_server/indexing/embedding.py b/doc-serve-server/doc_serve_server/indexing/embedding.py index fa0962a..9e4b23a 100644 --- a/doc-serve-server/doc_serve_server/indexing/embedding.py +++ b/doc-serve-server/doc_serve_server/indexing/embedding.py @@ -4,6 +4,7 @@ from collections.abc import Awaitable, Callable from typing import Optional +from anthropic import AsyncAnthropic from openai import AsyncOpenAI from doc_serve_server.config import settings @@ -43,6 +44,22 @@ def __init__( api_key=api_key or settings.OPENAI_API_KEY, ) + # Initialize Anthropic client for summarization + self.anthropic_client = AsyncAnthropic( + api_key=settings.ANTHROPIC_API_KEY, + ) + + # Initialize prompt template + self.summary_prompt_template = ( + "You are an expert software engineer analyzing source code. " + "Provide a concise 1-2 sentence summary of what this code does. " + "Focus on the functionality, purpose, and behavior. " + "Be specific about inputs, outputs, and side effects. " + "Ignore implementation details and focus on what the code accomplishes.\n\n" + "Code to summarize:\n{context_str}\n\n" + "Summary:" + ) + async def embed_text(self, text: str) -> list[float]: """ Generate embedding for a single text. @@ -157,6 +174,98 @@ def get_embedding_dimensions(self) -> int: } return model_dimensions.get(self.model, settings.EMBEDDING_DIMENSIONS) + def _get_summary_prompt_template(self) -> str: + """ + Get the prompt template for code summarization. + + Returns: + Prompt template string. + """ + template = ( + "You are an expert software engineer analyzing source code. " + "Provide a concise 1-2 sentence summary of what this code does. " + "Focus on the functionality, purpose, and behavior. " + "Be specific about inputs, outputs, and side effects. " + "Ignore implementation details and focus on what the code accomplishes.\n\n" + "Code to summarize:\n{context_str}\n\n" + "Summary:" + ) + return template + + async def generate_summary(self, code_text: str) -> str: + """ + Generate a natural language summary of code using Claude. + + Args: + code_text: The source code to summarize. + + Returns: + Natural language summary of the code's functionality. + """ + try: + # Use Claude directly with custom prompt + prompt = self.summary_prompt_template.format(context_str=code_text) + + response = await self.anthropic_client.messages.create( + model=settings.CLAUDE_MODEL, + max_tokens=300, + temperature=0.1, # Low temperature for consistent summaries + messages=[ + { + "role": "user", + "content": prompt + } + ] + ) + + # Extract text from Claude response + summary = response.content[0].text # type: ignore + + if summary and len(summary) > 10: # Ensure we got a meaningful summary + return summary + else: + logger.warning("Claude returned empty or too short summary") + return self._extract_fallback_summary(code_text) + + except Exception as e: + logger.error(f"Failed to generate code summary: {e}") + # Fallback: try to extract from docstrings/comments + return self._extract_fallback_summary(code_text) + + def _extract_fallback_summary(self, code_text: str) -> str: + """ + Extract summary from docstrings or comments as fallback. + + Args: + code_text: Source code to analyze. + + Returns: + Extracted summary or empty string. + """ + import re + + # Try to find Python docstrings + docstring_match = re.search(r'""".*?"""', code_text, re.DOTALL) + if docstring_match: + docstring = docstring_match.group(0)[3:-3] # Remove leading/trailing """ + if len(docstring) > 10: # Only use if substantial + return docstring[:200] + "..." if len(docstring) > 200 else docstring + + # Try to find function/class comments + comment_match = re.search( + r'#.*(?:function|class|method|def)', code_text, re.IGNORECASE + ) + if comment_match: + return comment_match.group(0).strip('#').strip() + + # Last resort: first line if it looks like a comment + lines = code_text.strip().split('\n') + first_line = lines[0].strip() + if first_line.startswith(('#', '//', '/*')): + return first_line.lstrip('#/*').strip() + + return "" # No summary available + # Singleton instance _embedding_generator: Optional[EmbeddingGenerator] = None diff --git a/doc-serve-server/doc_serve_server/models/health.py b/doc-serve-server/doc_serve_server/models/health.py index 710b702..3511309 100644 --- a/doc-serve-server/doc_serve_server/models/health.py +++ b/doc-serve-server/doc_serve_server/models/health.py @@ -53,6 +53,20 @@ class IndexingStatus(BaseModel): ge=0, description="Total number of chunks in vector store", ) + total_doc_chunks: int = Field( + default=0, + ge=0, + description="Number of document chunks", + ) + total_code_chunks: int = Field( + default=0, + ge=0, + description="Number of code chunks", + ) + supported_languages: list[str] = Field( + default_factory=list, + description="Programming languages that have been indexed", + ) indexing_in_progress: bool = Field( default=False, description="Whether indexing is currently in progress", @@ -82,11 +96,14 @@ class IndexingStatus(BaseModel): { "total_documents": 150, "total_chunks": 1200, + "total_doc_chunks": 800, + "total_code_chunks": 400, "indexing_in_progress": False, "current_job_id": None, "progress_percent": 0.0, "last_indexed_at": "2024-12-15T10:30:00Z", "indexed_folders": ["/path/to/docs"], + "supported_languages": ["python", "typescript", "java"], } ] } diff --git a/doc-serve-server/doc_serve_server/models/index.py b/doc-serve-server/doc_serve_server/models/index.py index e62647c..107ee9a 100644 --- a/doc-serve-server/doc_serve_server/models/index.py +++ b/doc-serve-server/doc_serve_server/models/index.py @@ -7,6 +7,13 @@ from pydantic import BaseModel, Field +class CodeChunkStrategy(str, Enum): + """Strategy for chunking code files.""" + + AST_AWARE = "ast_aware" # Use LlamaIndex CodeSplitter for AST boundaries + TEXT_BASED = "text_based" # Use regular text chunking + + class IndexingStatusEnum(str, Enum): """Enumeration of indexing status values.""" @@ -41,6 +48,37 @@ class IndexRequest(BaseModel): description="Whether to scan folder recursively", ) + # Code indexing options + include_code: bool = Field( + default=False, + description="Whether to index source code files alongside documents", + ) + supported_languages: Optional[list[str]] = Field( + default=None, + description="Programming languages to index (defaults to all supported)", + examples=[["python", "typescript"], ["java", "kotlin"]], + ) + code_chunk_strategy: CodeChunkStrategy = Field( + default=CodeChunkStrategy.AST_AWARE, + description="Strategy for chunking code files", + ) + generate_summaries: bool = Field( + default=False, + description="Generate LLM summaries for code chunks to improve semantic search", + ) + + # File filtering options + include_patterns: Optional[list[str]] = Field( + default=None, + description="Additional file patterns to include (supports wildcards)", + examples=[["*.md", "*.py"], ["docs/**/*.md", "src/**/*.py"]], + ) + exclude_patterns: Optional[list[str]] = Field( + default=None, + description="Additional file patterns to exclude (supports wildcards)", + examples=[["*.log", "__pycache__/**"], ["node_modules/**", "*.tmp"]], + ) + model_config = { "json_schema_extra": { "examples": [ @@ -49,7 +87,24 @@ class IndexRequest(BaseModel): "chunk_size": 512, "chunk_overlap": 50, "recursive": True, - } + }, + { + "folder_path": "/path/to/project", + "chunk_size": 512, + "chunk_overlap": 50, + "recursive": True, + "include_code": True, + "supported_languages": ["python", "typescript", "javascript"], + "code_chunk_strategy": "ast_aware", + "include_patterns": ["docs/**/*.md", "src/**/*.py", "src/**/*.ts"], + "exclude_patterns": ["node_modules/**", "__pycache__/**", "*.log"], + }, + { + "folder_path": "/path/to/codebase", + "include_code": True, + "supported_languages": ["java", "kotlin"], + "code_chunk_strategy": "ast_aware", + }, ] } } diff --git a/doc-serve-server/doc_serve_server/models/query.py b/doc-serve-server/doc_serve_server/models/query.py index 2a89309..f7f991f 100644 --- a/doc-serve-server/doc_serve_server/models/query.py +++ b/doc-serve-server/doc_serve_server/models/query.py @@ -3,7 +3,9 @@ from enum import Enum from typing import Any, Optional -from pydantic import BaseModel, Field +from pydantic import BaseModel, Field, field_validator + +from ..indexing.document_loader import LanguageDetector class QueryMode(str, Enum): @@ -46,6 +48,42 @@ class QueryRequest(BaseModel): description="Weight for hybrid search (1.0 = pure vector, 0.0 = pure bm25)", ) + # Content filtering + source_types: list[str] | None = Field( + default=None, + description="Filter by source types: 'doc', 'code', 'test'", + examples=[["doc"], ["code"], ["doc", "code"]], + ) + languages: list[str] | None = Field( + default=None, + description="Filter by programming languages for code files", + examples=[["python"], ["typescript", "javascript"], ["java", "kotlin"]], + ) + file_paths: list[str] | None = Field( + default=None, + description="Filter by specific file paths (supports wildcards)", + examples=[["docs/*.md"], ["src/**/*.py"]], + ) + + @field_validator('languages') + @classmethod + def validate_languages(cls, v: Optional[list[str]]) -> Optional[list[str]]: + """Validate that provided languages are supported.""" + if v is None: + return v + + detector = LanguageDetector() + supported_languages = detector.get_supported_languages() + + invalid_languages = [lang for lang in v if lang not in supported_languages] + if invalid_languages: + raise ValueError( + f"Unsupported languages: {invalid_languages}. " + f"Supported languages: {supported_languages}" + ) + + return v + model_config = { "json_schema_extra": { "examples": [ @@ -55,7 +93,19 @@ class QueryRequest(BaseModel): "similarity_threshold": 0.7, "mode": "hybrid", "alpha": 0.5, - } + }, + { + "query": "implement user authentication", + "top_k": 10, + "source_types": ["code"], + "languages": ["python", "typescript"], + }, + { + "query": "API endpoints", + "top_k": 5, + "source_types": ["doc", "code"], + "file_paths": ["docs/api/*.md", "src/**/*.py"], + }, ] } } @@ -67,13 +117,21 @@ class QueryResult(BaseModel): text: str = Field(..., description="The chunk text content") source: str = Field(..., description="Source file path") score: float = Field(..., description="Primary score (rank or similarity)") - vector_score: Optional[float] = Field( + vector_score: float | None = Field( default=None, description="Score from vector search" ) - bm25_score: Optional[float] = Field( - default=None, description="Score from BM25 search" - ) + bm25_score: float | None = Field(default=None, description="Score from BM25 search") chunk_id: str = Field(..., description="Unique chunk identifier") + + # Content type information + source_type: str = Field( + default="doc", description="Type of content: 'doc', 'code', or 'test'" + ) + language: str | None = Field( + default=None, description="Programming language for code files" + ) + + # Additional metadata metadata: dict[str, Any] = Field( default_factory=dict, description="Additional metadata" ) @@ -109,11 +167,24 @@ class QueryResponse(BaseModel): "vector_score": 0.92, "bm25_score": 0.85, "chunk_id": "chunk_abc123", + "source_type": "doc", + "language": "markdown", "metadata": {"chunk_index": 0}, - } + }, + { + "text": "def authenticate_user(username, password):", + "source": "src/auth.py", + "score": 0.88, + "vector_score": 0.88, + "bm25_score": 0.82, + "chunk_id": "chunk_def456", + "source_type": "code", + "language": "python", + "metadata": {"symbol_name": "authenticate_user"}, + }, ], "query_time_ms": 125.5, - "total_results": 1, + "total_results": 2, } ] } diff --git a/doc-serve-server/doc_serve_server/services/indexing_service.py b/doc-serve-server/doc_serve_server/services/indexing_service.py index 361f5f1..6d3635f 100644 --- a/doc-serve-server/doc_serve_server/services/indexing_service.py +++ b/doc-serve-server/doc_serve_server/services/indexing_service.py @@ -6,7 +6,7 @@ import uuid from collections.abc import Awaitable from datetime import datetime, timezone -from typing import Any, Callable, Optional +from typing import Any, Callable, Optional, Union from llama_index.core.schema import TextNode @@ -17,6 +17,7 @@ EmbeddingGenerator, get_bm25_manager, ) +from doc_serve_server.indexing.chunking import CodeChunk, CodeChunker, TextChunk from doc_serve_server.models import IndexingState, IndexingStatusEnum, IndexRequest from doc_serve_server.storage import VectorStoreManager, get_vector_store @@ -160,9 +161,10 @@ async def _run_indexing_pipeline( f"Normalizing indexing path: {request.folder_path} -> {abs_folder_path}" ) - documents = await self.document_loader.load_from_folder( + documents = await self.document_loader.load_files( abs_folder_path, recursive=request.recursive, + include_code=request.include_code, ) self._state.total_documents = len(documents) @@ -175,25 +177,146 @@ async def _run_indexing_pipeline( self._state.completed_at = datetime.now(timezone.utc) return - # Step 2: Chunk documents + # Step 2: Chunk documents and code files if progress_callback: await progress_callback(20, 100, "Chunking documents...") - # Create chunker with request configuration - chunker = ContextAwareChunker( - chunk_size=request.chunk_size, - chunk_overlap=request.chunk_overlap, + # Separate documents by type + doc_documents = [ + d for d in documents if d.metadata.get("source_type") == "doc" + ] + code_documents = [ + d for d in documents if d.metadata.get("source_type") == "code" + ] + + logger.info( + f"Processing {len(doc_documents)} documents and " + f"{len(code_documents)} code files" ) - async def chunk_progress(processed: int, total: int) -> None: - self._state.processed_documents = processed - if progress_callback: - pct = 20 + int((processed / total) * 30) - await progress_callback(pct, 100, f"Chunking: {processed}/{total}") + all_chunks: list[Union[TextChunk, CodeChunk]] = [] + total_to_process = len(documents) - chunks = await chunker.chunk_documents(documents, chunk_progress) + # Chunk documents + doc_chunker = None + if doc_documents: + doc_chunker = ContextAwareChunker( + chunk_size=request.chunk_size, + chunk_overlap=request.chunk_overlap, + ) + + async def doc_chunk_progress(processed: int, total: int) -> None: + self._state.processed_documents = processed + if progress_callback: + pct = 20 + int((processed / total_to_process) * 15) + await progress_callback( + pct, 100, f"Chunking docs: {processed}/{total}" + ) + + doc_chunks = await doc_chunker.chunk_documents( + doc_documents, doc_chunk_progress + ) + all_chunks.extend(doc_chunks) + logger.info(f"Created {len(doc_chunks)} document chunks") + + # Chunk code files + if code_documents: + # Group code documents by language for efficient chunking + code_by_language: dict[str, list[Any]] = {} + for doc in code_documents: + lang = doc.metadata.get("language", "unknown") + if lang not in code_by_language: + code_by_language[lang] = [] + code_by_language[lang].append(doc) + + # Track total code documents processed across all languages + total_code_processed = 0 + + for lang, lang_docs in code_by_language.items(): + if lang == "unknown": + logger.warning( + f"Skipping {len(lang_docs)} code files with unknown " + "language" + ) + continue + + try: + code_chunker = CodeChunker( + language=lang, + generate_summaries=request.generate_summaries + ) + + # Create progress callback with fixed offset for this language + def make_progress_callback( + offset: int + ) -> Callable[[int, int], Awaitable[None]]: + async def progress_callback_fn( + processed: int, + total: int, + ) -> None: + # processed is relative to current language batch + # Convert to total documents processed across + # all languages + total_processed = offset + processed + self._state.processed_documents = total_processed + if progress_callback: + pct = 35 + int( + (total_processed / total_to_process) * 15 + ) + await progress_callback( + pct, + 100, + f"Chunking code: {total_processed}/" + f"{total_to_process}", + ) + return progress_callback_fn + + # Calculate offset and create callback for this language batch + progress_offset = len(doc_documents) + total_code_processed + code_chunk_progress = make_progress_callback(progress_offset) # noqa: F841 + + for doc in lang_docs: + code_chunks = await code_chunker.chunk_code_document(doc) + all_chunks.extend(code_chunks) + + # Update the total code documents processed + total_code_processed += len(lang_docs) + + chunk_count = sum( + 1 for c in all_chunks if c.metadata.language == lang + ) + logger.info(f"Created {chunk_count} {lang} chunks") + + except Exception as e: + logger.error(f"Failed to chunk {lang} files: {e}") + # Fallback: treat as documents + if doc_chunker is not None: # Reuse doc chunker if available + fallback_chunks = await doc_chunker.chunk_documents( + lang_docs + ) + all_chunks.extend(fallback_chunks) + logger.info( + f"Fell back to document chunking for " + f"{len(fallback_chunks)} {lang} files" + ) + else: + # Create a temporary chunker for fallback + fallback_chunker = ContextAwareChunker( + chunk_size=request.chunk_size, + chunk_overlap=request.chunk_overlap, + ) + fallback_chunks = await fallback_chunker.chunk_documents( + lang_docs + ) + all_chunks.extend(fallback_chunks) + logger.info( + f"Fell back to document chunking for " + f"{len(fallback_chunks)} {lang} files" + ) + + chunks = all_chunks self._state.total_chunks = len(chunks) - logger.info(f"Created {len(chunks)} chunks") + logger.info(f"Created {len(chunks)} total chunks") # Step 3: Generate embeddings if progress_callback: @@ -205,7 +328,7 @@ async def embedding_progress(processed: int, total: int) -> None: await progress_callback(pct, 100, f"Embedding: {processed}/{total}") embeddings = await self.embedding_generator.embed_chunks( - chunks, + chunks, # type: ignore embedding_progress, ) logger.info(f"Generated {len(embeddings)} embeddings") @@ -218,7 +341,7 @@ async def embedding_progress(processed: int, total: int) -> None: ids=[chunk.chunk_id for chunk in chunks], embeddings=embeddings, documents=[chunk.text for chunk in chunks], - metadatas=[chunk.metadata for chunk in chunks], + metadatas=[chunk.metadata.to_dict() for chunk in chunks], ) # Step 5: Build BM25 index @@ -229,7 +352,7 @@ async def embedding_progress(processed: int, total: int) -> None: TextNode( text=chunk.text, id_=chunk.chunk_id, - metadata=chunk.metadata, + metadata=chunk.metadata.to_dict(), ) for chunk in chunks ] @@ -266,12 +389,18 @@ async def get_status(self) -> dict[str, Any]: Returns: Dictionary with status information. """ - count = ( + total_chunks = ( await self.vector_store.get_count() if self.vector_store.is_initialized else 0 ) + # TODO: Implement efficient counting of chunks by type and language + # For now, return 0 for code/doc breakdown until we implement proper tracking + total_doc_chunks = 0 # TODO: Track document chunks during indexing + total_code_chunks = 0 # TODO: Track code chunks during indexing + supported_languages: list[str] = [] # TODO: Track supported languages indexed + return { "status": self._state.status.value, "is_indexing": self._state.is_indexing, @@ -279,7 +408,10 @@ async def get_status(self) -> dict[str, Any]: "folder_path": self._state.folder_path, "total_documents": self._state.total_documents, "processed_documents": self._state.processed_documents, - "total_chunks": count, + "total_chunks": total_chunks, + "total_doc_chunks": total_doc_chunks, + "total_code_chunks": total_code_chunks, + "supported_languages": supported_languages, "progress_percent": self._state.progress_percent, "started_at": ( self._state.started_at.isoformat() if self._state.started_at else None diff --git a/doc-serve-server/doc_serve_server/services/query_service.py b/doc-serve-server/doc_serve_server/services/query_service.py index 04c53fc..bc4c27f 100644 --- a/doc-serve-server/doc_serve_server/services/query_service.py +++ b/doc-serve-server/doc_serve_server/services/query_service.py @@ -2,10 +2,9 @@ import logging import time -from typing import Optional +from typing import Any, Optional -from llama_index.core.retrievers import BaseRetriever, QueryFusionRetriever -from llama_index.core.retrievers.fusion_retriever import FUSION_MODES +from llama_index.core.retrievers import BaseRetriever from llama_index.core.schema import NodeWithScore, QueryBundle, TextNode from doc_serve_server.indexing import EmbeddingGenerator, get_embedding_generator @@ -114,6 +113,10 @@ async def execute_query(self, request: QueryRequest) -> QueryResponse: else: # HYBRID results = await self._execute_hybrid_query(request) + # Apply content filters if specified + if any([request.source_types, request.languages, request.file_paths]): + results = self._filter_results(results, request) + query_time_ms = (time.time() - start_time) * 1000 logger.debug( @@ -130,10 +133,12 @@ async def execute_query(self, request: QueryRequest) -> QueryResponse: async def _execute_vector_query(self, request: QueryRequest) -> list[QueryResult]: """Execute pure semantic search.""" query_embedding = await self.embedding_generator.embed_query(request.query) + where_clause = self._build_where_clause(request.source_types, request.languages) search_results = await self.vector_store.similarity_search( query_embedding=query_embedding, top_k=request.top_k, similarity_threshold=request.similarity_threshold, + where=where_clause, ) return [ @@ -145,10 +150,12 @@ async def _execute_vector_query(self, request: QueryRequest) -> list[QueryResult score=res.score, vector_score=res.score, chunk_id=res.chunk_id, + source_type=res.metadata.get("source_type", "doc"), + language=res.metadata.get("language"), metadata={ k: v for k, v in res.metadata.items() - if k not in ("source", "file_path") + if k not in ("source", "file_path", "source_type", "language") }, ) for res in search_results @@ -171,10 +178,12 @@ async def _execute_bm25_query(self, request: QueryRequest) -> list[QueryResult]: score=node.score or 0.0, bm25_score=node.score, chunk_id=node.node.node_id, + source_type=node.node.metadata.get("source_type", "doc"), + language=node.node.metadata.get("language"), metadata={ k: v for k, v in node.node.metadata.items() - if k not in ("source", "file_path") + if k not in ("source", "file_path", "source_type", "language") }, ) for node in nodes @@ -185,62 +194,120 @@ async def _execute_hybrid_query(self, request: QueryRequest) -> list[QueryResult # For US5, we want to provide individual scores. # We'll perform the individual searches first to get the scores. + # Get corpus size to avoid requesting more than available + corpus_size = await self.vector_store.get_count() + effective_top_k = min(request.top_k, corpus_size) + + # Build ChromaDB where clause for filtering + where_clause = self._build_where_clause(request.source_types, request.languages) + # 1. Vector Search query_embedding = await self.embedding_generator.embed_query(request.query) vector_results = await self.vector_store.similarity_search( query_embedding=query_embedding, - top_k=request.top_k * 2, # Get more to ensure overlap + top_k=effective_top_k, similarity_threshold=request.similarity_threshold, + where=where_clause, ) - vector_scores = {res.chunk_id: res.score for res in vector_results} # 2. BM25 Search bm25_results = [] if self.bm25_manager.is_initialized: - bm25_retriever = self.bm25_manager.get_retriever(top_k=request.top_k * 2) - bm25_nodes = await bm25_retriever.aretrieve(request.query) - bm25_results = bm25_nodes - bm25_scores = {node.node.node_id: node.score for node in bm25_results} - - # 3. Perform fusion using QueryFusionRetriever (still best for logic) - # But we'll use our pre-fetched results if possible, or just let it run. - # Given we need the fused ranking, we'll let it run but then enrich. - - vector_retriever = VectorManagerRetriever( - self, top_k=request.top_k, threshold=request.similarity_threshold - ) - - bm25_retriever = self.bm25_manager.get_retriever(top_k=request.top_k) - - fusion_retriever = QueryFusionRetriever( - [vector_retriever, bm25_retriever], - similarity_top_k=request.top_k, - num_queries=1, - mode=FUSION_MODES.RELATIVE_SCORE, - retriever_weights=[request.alpha, 1.0 - request.alpha], - use_async=True, - ) - - fused_nodes = await fusion_retriever.aretrieve(request.query) - - return [ - QueryResult( + # Use the new filtered search method + bm25_results = await self.bm25_manager.search_with_filters( + query=request.query, + top_k=effective_top_k, + source_types=request.source_types, + languages=request.languages, + max_results=corpus_size, + ) + # Convert BM25 results to same format as vector results + bm25_query_results = [] + for node in bm25_results: + bm25_query_results.append(QueryResult( text=node.node.get_content(), source=node.node.metadata.get( "source", node.node.metadata.get("file_path", "unknown") ), score=node.score or 0.0, - vector_score=vector_scores.get(node.node.node_id), - bm25_score=bm25_scores.get(node.node.node_id), + bm25_score=node.score, chunk_id=node.node.node_id, + source_type=node.node.metadata.get("source_type", "doc"), + language=node.node.metadata.get("language"), metadata={ - k: v - for k, v in node.node.metadata.items() - if k not in ("source", "file_path") + k: v for k, v in node.node.metadata.items() + if k not in ("source", "file_path", "source_type", "language") + }, + )) + + # 3. Simple hybrid fusion for small corpora + # Combine vector and BM25 results manually to avoid retriever complexity + + # Score normalization: bring both to 0-1 range + max_vector_score = max((r.score for r in vector_results), default=1.0) or 1.0 + max_bm25_score = max( + (r.bm25_score or 0.0 for r in bm25_query_results), default=1.0 + ) or 1.0 + + # Create combined results map + combined_results: dict[str, dict[str, Any]] = {} + + # Add vector results (convert SearchResult to QueryResult) + for res in vector_results: + query_result = QueryResult( + text=res.text, + source=res.metadata.get( + "source", res.metadata.get("file_path", "unknown") + ), + score=res.score, + vector_score=res.score, + chunk_id=res.chunk_id, + source_type=res.metadata.get("source_type", "doc"), + language=res.metadata.get("language"), + metadata={ + k: v for k, v in res.metadata.items() + if k not in ("source", "file_path", "source_type", "language") }, ) - for node in fused_nodes - ] + combined_results[res.chunk_id] = { + "result": query_result, + "vector_score": res.score / max_vector_score, + "bm25_score": 0.0, + "total_score": request.alpha * (res.score / max_vector_score), + } + + # Add/merge BM25 results + for bm25_res in bm25_query_results: + chunk_id = bm25_res.chunk_id + bm25_normalized = (bm25_res.bm25_score or 0.0) / max_bm25_score + bm25_weighted = (1.0 - request.alpha) * bm25_normalized + + if chunk_id in combined_results: + combined_results[chunk_id]["bm25_score"] = bm25_normalized + combined_results[chunk_id]["total_score"] += bm25_weighted + # Update BM25 score on existing result + combined_results[chunk_id]["result"].bm25_score = bm25_res.bm25_score + else: + combined_results[chunk_id] = { + "result": bm25_res, + "vector_score": 0.0, + "bm25_score": bm25_normalized, + "total_score": bm25_weighted, + } + + # Convert to final results + fused_nodes = [] + for _chunk_id, data in combined_results.items(): + result = data["result"] + # Update score with combined score + result.score = data["total_score"] + fused_nodes.append(result) + + # Sort by combined score and take top_k + fused_nodes.sort(key=lambda x: x.score, reverse=True) + fused_nodes = fused_nodes[:request.top_k] + + return fused_nodes async def get_document_count(self) -> int: """ @@ -253,6 +320,85 @@ async def get_document_count(self) -> int: return 0 return await self.vector_store.get_count() + def _filter_results( + self, results: list[QueryResult], request: QueryRequest + ) -> list[QueryResult]: + """ + Filter query results based on request parameters. + + Args: + results: List of query results to filter. + request: Query request with filter parameters. + + Returns: + Filtered list of results. + """ + filtered_results = results + + # Filter by source types + if request.source_types: + filtered_results = [ + r for r in filtered_results if r.source_type in request.source_types + ] + + # Filter by languages + if request.languages: + filtered_results = [ + r + for r in filtered_results + if r.language and r.language in request.languages + ] + + # Filter by file paths (with wildcard support) + if request.file_paths: + import fnmatch + + filtered_results = [ + r + for r in filtered_results + if any( + fnmatch.fnmatch(r.source, pattern) for pattern in request.file_paths + ) + ] + + return filtered_results + + def _build_where_clause( + self, + source_types: list[str] | None, + languages: list[str] | None + ) -> dict[str, Any] | None: + """ + Build ChromaDB where clause from filter parameters. + + Args: + source_types: List of source types to filter by. + languages: List of languages to filter by. + + Returns: + ChromaDB where clause dict or None. + """ + conditions: list[dict[str, Any]] = [] + + if source_types: + if len(source_types) == 1: + conditions.append({"source_type": source_types[0]}) + else: + conditions.append({"source_type": {"$in": source_types}}) + + if languages: + if len(languages) == 1: + conditions.append({"language": languages[0]}) + else: + conditions.append({"language": {"$in": languages}}) + + if not conditions: + return None + elif len(conditions) == 1: + return conditions[0] + else: + return {"$and": conditions} + # Singleton instance _query_service: Optional[QueryService] = None diff --git a/doc-serve-server/pyproject.toml b/doc-serve-server/pyproject.toml index f3c6786..d963ff8 100644 --- a/doc-serve-server/pyproject.toml +++ b/doc-serve-server/pyproject.toml @@ -24,6 +24,7 @@ pydantic-settings = "^2.6.0" python-dotenv = "^1.0.0" click = "^8.1.0" llama-index-llms-openai = "^0.6.12" +tree-sitter-language-pack = "^0.7.3" [tool.poetry.group.dev.dependencies] pytest = "^8.3.0" diff --git a/doc-serve-server/tests/integration/test_alpha_weighting.py b/doc-serve-server/tests/integration/test_alpha_weighting.py index eee8d3b..3d45794 100644 --- a/doc-serve-server/tests/integration/test_alpha_weighting.py +++ b/doc-serve-server/tests/integration/test_alpha_weighting.py @@ -38,7 +38,7 @@ def test_alpha_validation_bounds(self, client): def test_alpha_passing_to_service( self, client, mock_vector_store, mock_bm25_manager, mock_embedding_generator ): - """Test that alpha is passed correctly to the QueryFusionRetriever.""" + """Test that alpha is used correctly in manual hybrid fusion.""" from doc_serve_server.services.query_service import get_query_service service = get_query_service() @@ -47,26 +47,49 @@ def test_alpha_passing_to_service( service.embedding_generator = mock_embedding_generator mock_vector_store.is_initialized = True - mock_bm25_manager.is_initialized = True - with patch( - "doc_serve_server.services.query_service.QueryFusionRetriever" - ) as mock_fusion_cls: - mock_fusion = AsyncMock() - mock_fusion.aretrieve = AsyncMock(return_value=[]) - mock_fusion_cls.return_value = mock_fusion - - alpha_value = 0.7 - client.post( - "/query/", - json={ - "query": "alpha test", - "mode": "hybrid", - "alpha": alpha_value, + # Mock vector search results (SearchResult objects) + from doc_serve_server.storage.vector_store import SearchResult + mock_vector_store.similarity_search.return_value = [ + SearchResult( + text="Vector result", + metadata={ + "source": "v.md", + "source_type": "doc", + "language": "markdown" }, + score=0.8, + chunk_id="v1" ) + ] + + # Mock BM25 results (NodeWithScore-like objects) + mock_bm25_manager.search_with_filters = AsyncMock(return_value=[ + MagicMock( + node=MagicMock( + get_content=MagicMock(return_value="BM25 result"), + metadata={ + "source": "b.md", + "source_type": "doc", + "language": "markdown" + }, + node_id="b1" + ), + score=0.9 + ) + ]) + + alpha_value = 0.7 + response = client.post( + "/query/", + json={ + "query": "alpha test", + "mode": "hybrid", + "alpha": alpha_value, + }, + ) - # Verify retriever_weights [alpha, 1-alpha] - args, kwargs = mock_fusion_cls.call_args - assert kwargs["retriever_weights"] == [alpha_value, 1.0 - alpha_value] + assert response.status_code == 200 + # Verify that search_with_filters was called (indicating manual fusion is used) + mock_bm25_manager.search_with_filters.assert_called_once() diff --git a/doc-serve-server/tests/integration/test_api.py b/doc-serve-server/tests/integration/test_api.py index 976fe70..fa6ca4c 100644 --- a/doc-serve-server/tests/integration/test_api.py +++ b/doc-serve-server/tests/integration/test_api.py @@ -209,6 +209,8 @@ def test_query_documents_success( source="docs/test.md", score=0.92, chunk_id="chunk_abc", + source_type="doc", + language="markdown", metadata={}, ) ], diff --git a/doc-serve-server/tests/integration/test_hybrid_api.py b/doc-serve-server/tests/integration/test_hybrid_api.py index 64107c7..f970981 100644 --- a/doc-serve-server/tests/integration/test_hybrid_api.py +++ b/doc-serve-server/tests/integration/test_hybrid_api.py @@ -1,6 +1,6 @@ """Integration tests for Hybrid retrieval mode.""" -from unittest.mock import AsyncMock, MagicMock, patch +from unittest.mock import AsyncMock, MagicMock class TestHybridQueryEndpoint: @@ -20,32 +20,49 @@ def test_query_hybrid_mode( mock_vector_store.is_initialized = True mock_bm25_manager.is_initialized = True - # Mock QueryFusionRetriever - with patch( - "doc_serve_server.services.query_service.QueryFusionRetriever" - ) as mock_fusion_cls: - mock_fusion = AsyncMock() - node_mock = MagicMock() - node_mock.node.get_content.return_value = "Hybrid result" - node_mock.node.metadata = {"source": "docs/hybrid.md"} - node_mock.node.node_id = "chunk_hybrid" - node_mock.score = 0.9 - mock_fusion.aretrieve = AsyncMock(return_value=[node_mock]) - mock_fusion_cls.return_value = mock_fusion - - response = client.post( - "/query/", - json={ - "query": "hybrid query", - "mode": "hybrid", - "alpha": 0.3, + # Mock vector search results (SearchResult objects) + from doc_serve_server.storage.vector_store import SearchResult + mock_vector_store.similarity_search.return_value = [ + SearchResult( + text="Vector result", + metadata={ + "source": "docs/vector.md", + "source_type": "doc", + "language": "markdown" }, + score=0.8, + chunk_id="v1" ) + ] + + # Mock BM25 results (NodeWithScore-like objects) + mock_bm25_manager.search_with_filters = AsyncMock(return_value=[ + MagicMock( + node=MagicMock( + get_content=MagicMock(return_value="BM25 result"), + metadata={ + "source": "docs/bm25.md", + "source_type": "doc", + "language": "markdown" + }, + node_id="b1" + ), + score=0.9 + ) + ]) + + response = client.post( + "/query/", + json={ + "query": "hybrid query", + "mode": "hybrid", + "alpha": 0.3, + }, + ) assert response.status_code == 200, f"Error: {response.json()}" data = response.json() - assert data["total_results"] == 1 - assert data["results"][0]["text"] == "Hybrid result" - # Check that alpha was passed to QueryFusionRetriever via retriever_weights - args, kwargs = mock_fusion_cls.call_args - assert kwargs["retriever_weights"] == [0.3, 0.7] + assert data["total_results"] == 2 # Both vector and BM25 results + # Check that both search methods were called + mock_vector_store.similarity_search.assert_called_once() + mock_bm25_manager.search_with_filters.assert_called_once() diff --git a/doc-serve-server/tests/integration/test_unified_search.py b/doc-serve-server/tests/integration/test_unified_search.py new file mode 100644 index 0000000..8e25889 --- /dev/null +++ b/doc-serve-server/tests/integration/test_unified_search.py @@ -0,0 +1,219 @@ +"""Integration tests for unified search functionality across docs and code.""" + +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from doc_serve_server.models.query import QueryMode, QueryRequest +from doc_serve_server.services.query_service import QueryService + + +class TestUnifiedSearch: + """Test unified search across documentation and source code.""" + + @pytest.mark.asyncio + async def test_sdk_cross_reference_search( + self, + mock_vector_store, + mock_bm25_manager, + mock_embedding_generator, + ): + """Test cross-reference search with SDK documentation and code. + + This simulates indexing AWS CDK docs + source and querying for patterns. + """ + # Setup service + service = QueryService( + vector_store=mock_vector_store, + embedding_generator=mock_embedding_generator, + bm25_manager=mock_bm25_manager, + ) + + mock_vector_store.is_initialized = True + mock_bm25_manager.is_initialized = True + + # Mock AWS CDK-like content: docs + code + # Simulate indexing CDK documentation + source code + + # Mock vector results (from documentation) + mock_vector_store.similarity_search.return_value = [ + type('SearchResult', (), { + 'text': ( + "S3 bucket with versioning can be created using the Bucket " + "construct with versioned=True parameter." + ), + 'metadata': { + 'source': 'docs/aws-cdk/s3.md', + 'source_type': 'doc', + 'language': 'markdown', + 'section_title': 'S3 Bucket Versioning' + }, + 'score': 0.85, + 'chunk_id': 'doc_chunk_1' + })() + ] + + # Mock BM25 results (from source code) + mock_bm25_manager.search_with_filters = AsyncMock(return_value=[ + type('NodeWithScore', (), { + 'node': type('TextNode', (), { + 'get_content': MagicMock(return_value=( + "const bucket = new s3.Bucket(this, 'MyBucket', " + "{ versioned: true });" + )), + 'metadata': { + 'source': 'src/aws-cdk-lib/aws-s3/lib/bucket.ts', + 'source_type': 'code', + 'language': 'typescript', + 'symbol_name': 'Bucket.constructor' + }, + 'node_id': 'code_chunk_1' + })(), + 'score': 0.92 + })() + ]) + + # Mock corpus size + mock_vector_store.get_count.return_value = 100 + + # Test cross-reference query + request = QueryRequest( + query="S3 bucket with versioning", + mode=QueryMode.HYBRID, + top_k=5 + ) + + response = await service.execute_query(request) + + # Verify results include both docs and code + assert response.total_results == 2 + + # Check documentation result + doc_result = next(r for r in response.results if r.source_type == 'doc') + assert 'S3 bucket with versioning' in doc_result.text + assert doc_result.source == 'docs/aws-cdk/s3.md' + assert doc_result.language == 'markdown' + + # Check code result + code_result = next(r for r in response.results if r.source_type == 'code') + assert 'versioned: true' in code_result.text + assert code_result.source == 'src/aws-cdk-lib/aws-s3/lib/bucket.ts' + assert code_result.language == 'typescript' + + @pytest.mark.asyncio + async def test_claude_skill_citation_metadata( + self, + mock_vector_store, + mock_bm25_manager, + mock_embedding_generator, + ): + """Test that results include complete metadata for Claude skill citations.""" + service = QueryService( + vector_store=mock_vector_store, + embedding_generator=mock_embedding_generator, + bm25_manager=mock_bm25_manager, + ) + + mock_vector_store.is_initialized = True + mock_bm25_manager.is_initialized = True + + # Mock result with complete citation metadata + mock_vector_store.similarity_search.return_value = [ + type('SearchResult', (), { + 'text': "def authenticate_user(username: str, password: str) -> bool:", + 'metadata': { + 'source': 'src/auth/service.py', + 'source_type': 'code', + 'language': 'python', + 'symbol_name': 'authenticate_user', + 'start_line': 45, + 'end_line': 52, + 'docstring': 'Authenticate a user with credentials' + }, + 'score': 0.9, + 'chunk_id': 'code_chunk_1' + })() + ] + + mock_bm25_manager.search_with_filters = AsyncMock(return_value=[]) + mock_vector_store.get_count.return_value = 50 + + request = QueryRequest(query="user authentication") + response = await service.execute_query(request) + + # Verify complete citation metadata + result = response.results[0] + assert result.source == 'src/auth/service.py' + assert result.source_type == 'code' + assert result.language == 'python' + assert 'symbol_name' in result.metadata + assert 'start_line' in result.metadata + assert 'docstring' in result.metadata + + @pytest.mark.asyncio + async def test_tutorial_writing_workflow( + self, + mock_vector_store, + mock_bm25_manager, + mock_embedding_generator, + ): + """Test queries that would support tutorial writing workflow.""" + service = QueryService( + vector_store=mock_vector_store, + embedding_generator=mock_embedding_generator, + bm25_manager=mock_bm25_manager, + ) + + mock_vector_store.is_initialized = True + mock_bm25_manager.is_initialized = True + + # Mock tutorial-relevant content + mock_vector_store.similarity_search.return_value = [ + type('SearchResult', (), { + 'text': ( + "# Getting Started with Authentication\n\n" + "First, import the auth module and create a service instance." + ), + 'metadata': { + 'source': 'docs/tutorials/auth-getting-started.md', + 'source_type': 'doc', + 'language': 'markdown', + 'content_type': 'tutorial' + }, + 'score': 0.88, + 'chunk_id': 'tutorial_doc' + })(), + type('SearchResult', (), { + 'text': ( + "from auth_sdk import AuthenticationService\n" + "service = AuthenticationService()" + ), + 'metadata': { + 'source': 'examples/python/auth_quickstart.py', + 'source_type': 'code', + 'language': 'python', + 'symbol_name': 'example_usage' + }, + 'score': 0.82, + 'chunk_id': 'tutorial_code' + })() + ] + + mock_bm25_manager.search_with_filters = AsyncMock(return_value=[]) + mock_vector_store.get_count.return_value = 75 + + request = QueryRequest( + query="getting started with authentication tutorial", + mode=QueryMode.HYBRID + ) + + response = await service.execute_query(request) + + # Should return both tutorial docs and example code + assert response.total_results == 2 + + doc_result = next(r for r in response.results if r.source_type == 'doc') + code_result = next(r for r in response.results if r.source_type == 'code') + + assert 'tutorial' in doc_result.metadata.get('content_type', '') + assert code_result.language == 'python' diff --git a/doc-serve-server/tests/unit/test_hybrid_fusion.py b/doc-serve-server/tests/unit/test_hybrid_fusion.py index d09f1e8..827ad83 100644 --- a/doc-serve-server/tests/unit/test_hybrid_fusion.py +++ b/doc-serve-server/tests/unit/test_hybrid_fusion.py @@ -1,6 +1,6 @@ """Unit tests for Hybrid retrieval functionality.""" -from unittest.mock import AsyncMock, MagicMock, patch +from unittest.mock import AsyncMock, MagicMock import pytest @@ -26,49 +26,46 @@ async def test_hybrid_query_logic( mock_vector_store.is_initialized = True mock_bm25_manager.is_initialized = True - # Mock vector search results + # Mock vector search results (SearchResult objects) + from doc_serve_server.storage.vector_store import SearchResult mock_vector_store.similarity_search.return_value = [ - MagicMock( + SearchResult( text="Vector Result", - chunk_id="v1", - metadata={"source": "v.md"}, + metadata={ + "source": "v.md", + "source_type": "doc", + "language": "markdown" + }, score=0.8, + chunk_id="v1" ) ] - # Mock BM25 results - bm25_retriever_mock = AsyncMock() - node_mock = MagicMock() - node_mock.node.get_content.return_value = "BM25 Result" - node_mock.node.metadata = {"source": "b.md"} - node_mock.node.node_id = "b1" - node_mock.score = 0.9 - bm25_retriever_mock.aretrieve.return_value = [node_mock] - mock_bm25_manager.get_retriever.return_value = bm25_retriever_mock - - request = QueryRequest(query="test query", mode=QueryMode.HYBRID, alpha=0.5) + # Mock BM25 search_with_filters method + mock_bm25_manager.search_with_filters = AsyncMock(return_value=[ + MagicMock( + node=MagicMock( + get_content=MagicMock(return_value="BM25 Result"), + metadata={ + "source": "b.md", + "source_type": "doc", + "language": "markdown" + }, + node_id="b1" + ), + score=0.9 + ) + ]) - # We need to mock QueryFusionRetriever.aretrieve because it's hard to unit test - # the internal fusion without heavy LlamaIndex dependencies setup - with patch( - "doc_serve_server.services.query_service.QueryFusionRetriever" - ) as mock_fusion_cls: - mock_fusion = AsyncMock() - fusion_node = MagicMock() - fusion_node.node.get_content.return_value = "Fused Result" - fusion_node.node.metadata = {"source": "f.md"} - fusion_node.node.node_id = "f1" - fusion_node.score = 0.85 - mock_fusion.aretrieve.return_value = [fusion_node] - mock_fusion_cls.return_value = mock_fusion + # Mock get_count for corpus size + mock_vector_store.get_count.return_value = 10 - response = await service.execute_query(request) + request = QueryRequest(query="test query", mode=QueryMode.HYBRID, alpha=0.5) - assert response.total_results == 1 - assert response.results[0].text == "Fused Result" - mock_fusion_cls.assert_called_once() + response = await service.execute_query(request) - # Verify alpha was passed correctly to retriever_weights - args, kwargs = mock_fusion_cls.call_args - assert kwargs["retriever_weights"] == [0.5, 0.5] - assert kwargs["mode"].value == "relative_score" + assert response.total_results == 2 # Both vector and BM25 results + assert len(response.results) == 2 + # Check that manual fusion was used (both search methods called) + mock_vector_store.similarity_search.assert_called_once() + mock_bm25_manager.search_with_filters.assert_called_once() diff --git a/doc-svr-ctl/doc_svr_ctl/client/api_client.py b/doc-svr-ctl/doc_svr_ctl/client/api_client.py index a564b62..66a71cc 100644 --- a/doc-svr-ctl/doc_svr_ctl/client/api_client.py +++ b/doc-svr-ctl/doc_svr_ctl/client/api_client.py @@ -209,6 +209,9 @@ def query( similarity_threshold: float = 0.7, mode: str = "hybrid", alpha: float = 0.5, + source_types: Optional[list[str]] = None, + languages: Optional[list[str]] = None, + file_paths: Optional[list[str]] = None, ) -> QueryResponse: """ Query indexed documents. @@ -219,21 +222,28 @@ def query( similarity_threshold: Minimum similarity score. mode: Retrieval mode (vector, bm25, hybrid). alpha: Hybrid search weighting (1.0=vector, 0.0=bm25). + source_types: Filter by source types (doc, code, test). + languages: Filter by programming languages. + file_paths: Filter by file path patterns. Returns: QueryResponse with matching results. """ - data = self._request( - "POST", - "/query/", - json={ - "query": query_text, - "top_k": top_k, - "similarity_threshold": similarity_threshold, - "mode": mode, - "alpha": alpha, - }, - ) + request_data = { + "query": query_text, + "top_k": top_k, + "similarity_threshold": similarity_threshold, + "mode": mode, + "alpha": alpha, + } + if source_types is not None: + request_data["source_types"] = source_types + if languages is not None: + request_data["languages"] = languages + if file_paths is not None: + request_data["file_paths"] = file_paths + + data = self._request("POST", "/query/", json=request_data) results = [ QueryResult( @@ -260,15 +270,27 @@ def index( chunk_size: int = 512, chunk_overlap: int = 50, recursive: bool = True, + include_code: bool = False, + supported_languages: Optional[list[str]] = None, + code_chunk_strategy: str = "ast_aware", + include_patterns: Optional[list[str]] = None, + exclude_patterns: Optional[list[str]] = None, + generate_summaries: bool = False, ) -> IndexResponse: """ - Start indexing documents from a folder. + Start indexing documents and optionally code from a folder. Args: folder_path: Path to folder with documents. chunk_size: Target chunk size in tokens. chunk_overlap: Overlap between chunks. recursive: Whether to scan recursively. + include_code: Whether to index source code files. + supported_languages: Languages to index (defaults to all). + code_chunk_strategy: Strategy for code chunking. + include_patterns: Additional include patterns. + exclude_patterns: Additional exclude patterns. + generate_summaries: Generate LLM summaries for code chunks. Returns: IndexResponse with job ID. @@ -281,6 +303,12 @@ def index( "chunk_size": chunk_size, "chunk_overlap": chunk_overlap, "recursive": recursive, + "include_code": include_code, + "supported_languages": supported_languages, + "code_chunk_strategy": code_chunk_strategy, + "include_patterns": include_patterns, + "exclude_patterns": exclude_patterns, + "generate_summaries": generate_summaries, }, ) diff --git a/doc-svr-ctl/doc_svr_ctl/commands/index.py b/doc-svr-ctl/doc_svr_ctl/commands/index.py index a08f67c..aa642ce 100644 --- a/doc-svr-ctl/doc_svr_ctl/commands/index.py +++ b/doc-svr-ctl/doc_svr_ctl/commands/index.py @@ -1,6 +1,7 @@ """Index command for triggering document indexing.""" from pathlib import Path +from typing import Optional import click from rich.console import Console @@ -35,6 +36,34 @@ is_flag=True, help="Don't scan folder recursively", ) +@click.option( + "--include-code", + is_flag=True, + help="Index source code files alongside documents", +) +@click.option( + "--languages", + help="Comma-separated list of programming languages to index", +) +@click.option( + "--code-strategy", + default="ast_aware", + type=click.Choice(["ast_aware", "text_based"]), + help="Strategy for chunking code files (default: ast_aware)", +) +@click.option( + "--include-patterns", + help="Comma-separated additional include patterns (wildcards supported)", +) +@click.option( + "--exclude-patterns", + help="Comma-separated additional exclude patterns (wildcards supported)", +) +@click.option( + "--generate-summaries", + is_flag=True, + help="Generate LLM summaries for code chunks to improve semantic search", +) @click.option("--json", "json_output", is_flag=True, help="Output as JSON") def index_command( folder_path: str, @@ -42,6 +71,12 @@ def index_command( chunk_size: int, chunk_overlap: int, no_recursive: bool, + include_code: bool, + languages: Optional[str], + code_strategy: str, + include_patterns: Optional[str], + exclude_patterns: Optional[str], + generate_summaries: bool, json_output: bool, ) -> None: """Index documents from a folder. @@ -51,6 +86,21 @@ def index_command( # Resolve to absolute path folder = Path(folder_path).resolve() + # Parse comma-separated lists + languages_list = ( + [lang.strip() for lang in languages.split(",")] if languages else None + ) + include_patterns_list = ( + [pat.strip() for pat in include_patterns.split(",")] + if include_patterns + else None + ) + exclude_patterns_list = ( + [pat.strip() for pat in exclude_patterns.split(",")] + if exclude_patterns + else None + ) + try: with DocServeClient(base_url=url) as client: response = client.index( @@ -58,6 +108,12 @@ def index_command( chunk_size=chunk_size, chunk_overlap=chunk_overlap, recursive=not no_recursive, + include_code=include_code, + supported_languages=languages_list, + code_chunk_strategy=code_strategy, + include_patterns=include_patterns_list, + exclude_patterns=exclude_patterns_list, + generate_summaries=generate_summaries, ) if json_output: diff --git a/doc-svr-ctl/doc_svr_ctl/commands/query.py b/doc-svr-ctl/doc_svr_ctl/commands/query.py index 9858f15..06e892d 100644 --- a/doc-svr-ctl/doc_svr_ctl/commands/query.py +++ b/doc-svr-ctl/doc_svr_ctl/commands/query.py @@ -1,5 +1,7 @@ """Query command for searching documents.""" +from typing import Optional + import click from rich.console import Console from rich.panel import Panel @@ -49,6 +51,18 @@ @click.option("--json", "json_output", is_flag=True, help="Output as JSON") @click.option("--full", is_flag=True, help="Show full text content") @click.option("--scores", is_flag=True, help="Show individual vector/BM25 scores") +@click.option( + "--source-types", + help="Comma-separated source types to filter by (doc,code,test)", +) +@click.option( + "--languages", + help="Comma-separated programming languages to filter by", +) +@click.option( + "--file-paths", + help="Comma-separated file path patterns to filter by (wildcards supported)", +) def query_command( query_text: str, url: str, @@ -59,8 +73,22 @@ def query_command( json_output: bool, full: bool, scores: bool, + source_types: Optional[str], + languages: Optional[str], + file_paths: Optional[str], ) -> None: """Search indexed documents with natural language or keyword query.""" + # Parse comma-separated lists + source_types_list = ( + [st.strip() for st in source_types.split(",")] if source_types else None + ) + languages_list = ( + [lang.strip() for lang in languages.split(",")] if languages else None + ) + file_paths_list = ( + [fp.strip() for fp in file_paths.split(",")] if file_paths else None + ) + try: with DocServeClient(base_url=url) as client: response = client.query( @@ -69,6 +97,9 @@ def query_command( similarity_threshold=threshold, mode=mode.lower(), alpha=alpha, + source_types=source_types_list, + languages=languages_list, + file_paths=file_paths_list, ) if json_output: diff --git a/docs/DEVELOPERS_GUIDE.md b/docs/DEVELOPERS_GUIDE.md index 63fcf6c..5d79741 100644 --- a/docs/DEVELOPERS_GUIDE.md +++ b/docs/DEVELOPERS_GUIDE.md @@ -13,12 +13,13 @@ This guide covers setting up a development environment, understanding the archit - [Code Style](#code-style) - [Contributing](#contributing) - [Troubleshooting](#troubleshooting) +- [Adding Support for New Languages](#adding-support-for-new-languages) --- ## Architecture Overview -Doc-Serve is a RAG (Retrieval-Augmented Generation) system for semantic document search. +Doc-Serve is a RAG (Retrieval-Augmented Generation) system for semantic search across documentation and source code. ```mermaid flowchart TB @@ -38,10 +39,10 @@ flowchart TB QueryService["Query Service"] end - subgraph Indexing["Document Processing"] - Loader["Document Loader
(LlamaIndex)"] - Chunker["Context-Aware Chunking
(Stable Hash ID)"] - Embedder["Embedding Generator"] + subgraph Indexing["Content Processing"] + Loader["Document & Code Loader
(LlamaIndex + Tree-sitter)"] + Chunker["AST-Aware Chunking
(Stable Hash ID)"] + Embedder["Embedding Generator
(+ LLM Summaries)"] end subgraph AI["AI Models"] @@ -54,10 +55,11 @@ flowchart TB end end - subgraph Documents["Document Sources"] + subgraph Documents["Content Sources"] MD["Markdown Files"] TXT["Text Files"] PDF["PDF Files"] + Code["Source Code
10+ Languages"] end CLI -->|HTTP| FastAPI @@ -160,3 +162,146 @@ This usually means you are running the tool without installing it or the `PYTHON ### Duplicated Results in Query **Solution**: The system uses stable IDs based on file path and chunk index. If you see duplicates, run `doc-svr-ctl reset --yes` to clear the old index and re-index. + +--- + +## Code Ingestion & Language Support + +Doc-Serve supports AST-aware code chunking for 10+ programming languages using tree-sitter. The current implementation includes: **Python, TypeScript, JavaScript, Java, Kotlin, C, C++, Go, Rust, Swift**. + +Adding support for new programming languages is straightforward: + +### Recommended Package: tree-sitter-language-pack + +Use [`tree-sitter-language-pack`](https://pypi.org/project/tree-sitter-language-pack/) - a maintained fork with 160+ pre-built language grammars. + +**Advantages:** +- Pre-compiled binaries (no C compiler needed) +- 160+ languages in a single dependency +- Permissive licensing (no GPL dependencies) +- Aligned with tree-sitter 0.25.x + +**Installation:** +```bash +pip install tree-sitter-language-pack +``` + +### Simple API + +```python +from tree_sitter_language_pack import get_language, get_parser + +# Get parser for any supported language +parser = get_parser('rust') +language = get_language('rust') + +# Parse code +tree = parser.parse(b"fn main() { println!(\"Hello\"); }") +``` + +### Step-by-Step: Adding a New Language + +**Step 1: Verify language support** +```python +from tree_sitter_language_pack import get_language + +try: + lang = get_language('ruby') + print("Ruby is supported!") +except Exception: + print("Ruby not available") +``` + +**Step 2: Update extension mapping** + +In `doc_serve_server/indexing/document_loader.py`: + +```python +# Add to CODE_EXTENSIONS +CODE_EXTENSIONS: set[str] = { + ".py", ".ts", ".tsx", ".js", ".jsx", + ".rb", # NEW: Ruby +} + +# Add to EXTENSION_TO_LANGUAGE +EXTENSION_TO_LANGUAGE = { + # ... existing mappings ... + ".rb": "ruby", +} +``` + +**Step 3: Register with CodeChunker** + +In `doc_serve_server/indexing/code_chunker.py`: + +```python +class CodeChunker: + SUPPORTED_LANGUAGES = [ + "python", "typescript", "javascript", + "ruby", # NEW + ] +``` + +**Step 4: Add language-specific config (optional)** + +```python +LANGUAGE_CHUNK_CONFIG = { + "python": {"chunk_lines": 50, "overlap": 20}, + "ruby": {"chunk_lines": 50, "overlap": 20}, # NEW + "java": {"chunk_lines": 80, "overlap": 30}, # Verbose + "c": {"chunk_lines": 40, "overlap": 15}, +} +``` + +### Available Languages (160+) + +| Category | Languages | +|----------|-----------| +| Systems | C, C++, Rust, Go, Zig | +| JVM | Java, Kotlin, Scala, Groovy | +| Scripting | Python, Ruby, Perl, Lua, PHP | +| Web | JavaScript, TypeScript, HTML, CSS | +| Functional | Haskell, OCaml, Elixir, Erlang, Clojure | +| Data | SQL, JSON, YAML, TOML, XML | +| Config | Dockerfile, Terraform (HCL), Nix | +| Shell | Bash, Fish, PowerShell | +| Scientific | R, Julia, Fortran | +| Mobile | Swift, Objective-C | + +### Alternative: Individual Packages + +For minimal dependencies, use individual tree-sitter packages: + +```bash +pip install tree-sitter-python tree-sitter-javascript +``` + +```python +import tree_sitter_python as tspython +from tree_sitter import Language, Parser + +PY_LANGUAGE = Language(tspython.language()) +parser = Parser(PY_LANGUAGE) +``` + +### Alternative: tree-sitter-languages + +The original [`tree-sitter-languages`](https://pypi.org/project/tree-sitter-languages/) package (40+ languages): + +```bash +pip install tree-sitter-languages +``` + +```python +from tree_sitter_languages import get_language, get_parser + +language = get_language('python') +parser = get_parser('python') +``` + +### References + +- [tree-sitter-language-pack on PyPI](https://pypi.org/project/tree-sitter-language-pack/) +- [tree-sitter-languages on PyPI](https://pypi.org/project/tree-sitter-languages/) +- [tree-sitter-languages GitHub](https://github.com/grantjenks/py-tree-sitter-languages) +- [Tree-sitter Documentation](https://tree-sitter.github.io) diff --git a/docs/QUICK_START.md b/docs/QUICK_START.md index bc3a468..1e8fcfb 100644 --- a/docs/QUICK_START.md +++ b/docs/QUICK_START.md @@ -37,13 +37,32 @@ doc-serve ``` *Keep this terminal open or run in the background with `doc-serve &`.* -## 4. Index Documents +## 4. Index Documents and Code -Use the CLI tool to index a folder of documents (Markdown, TXT, PDF, etc.): +Doc-Serve can index both documentation and source code for unified search: +### Index Documentation Only (Default) ```bash -# Example: Index the coffee brewing test docs -doc-svr-ctl index ./e2e/fixtures/test_docs/coffee_brewing +# Index documentation files (Markdown, TXT, PDF, etc.) +doc-svr-ctl index ./docs +``` + +### Index Code + Documentation (Recommended) +```bash +# Index both documentation and source code files +doc-svr-ctl index ./my-project --include-code +``` + +### Advanced Indexing Options +```bash +# Index specific programming languages +doc-svr-ctl index ./src --include-code --languages python,typescript + +# Use AST-aware chunking for better code understanding +doc-svr-ctl index ./src --include-code --code-strategy ast_aware + +# Generate LLM summaries for code chunks (improves semantic search) +doc-svr-ctl index ./src --include-code --generate-summaries ``` Check the status to ensure indexing is complete: @@ -82,14 +101,39 @@ doc-svr-ctl query "brewing methods" --mode hybrid --scores doc-svr-ctl query "coffee temperature" --top-k 10 --threshold 0.3 ``` +### Code-Aware Search (with Code Ingestion) + +When code is indexed, you can perform cross-reference searches: + +```bash +# Search across both documentation and code +doc-svr-ctl query "authentication implementation" + +# Filter results by source type +doc-svr-ctl query "API endpoints" --source-types code # Code only +doc-svr-ctl query "API usage" --source-types doc # Docs only + +# Filter by programming language +doc-svr-ctl query "database connection" --languages python,typescript + +# Combine filters for precise results +doc-svr-ctl query "error handling" --source-types code --languages go +``` + +### Supported Languages +Doc-Serve supports code ingestion for: **Python, TypeScript, JavaScript, Java, Kotlin, C, C++, Go, Rust, Swift** + ## Common Commands Summary | Task | Command | |------|---------| | **Start Server** | `doc-serve` | | **Check Status** | `doc-svr-ctl status` | -| **Index Folder** | `doc-svr-ctl index /path/to/docs` | +| **Index Docs Only** | `doc-svr-ctl index /path/to/docs` | +| **Index Code + Docs** | `doc-svr-ctl index /path --include-code` | | **Semantic Search** | `doc-svr-ctl query "your question"` | | **Keyword Search** | `doc-svr-ctl query "keyword" --mode bm25` | | **Hybrid Search** | `doc-svr-ctl query "question" --mode hybrid --alpha 0.5` | +| **Filter by Source** | `doc-svr-ctl query "term" --source-types code` | +| **Filter by Language** | `doc-svr-ctl query "term" --languages python` | | **Reset Index** | `doc-svr-ctl reset --yes` | diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md index 93b90aa..eaaeba8 100644 --- a/docs/USER_GUIDE.md +++ b/docs/USER_GUIDE.md @@ -15,10 +15,10 @@ This guide covers how to use Doc-Serve for document indexing and semantic search ## Core Concepts -Doc-Serve is a RAG (Retrieval-Augmented Generation) system. It works in three phases: -1. **Indexing**: It reads your documents, splits them into semantic chunks, and generates vector embeddings. -2. **Storage**: Chunks and embeddings are stored in a ChromaDB vector database. -3. **Retrieval**: When you query, it finds the most similar chunks based on semantic meaning, not just keyword matches. +Doc-Serve is a RAG (Retrieval-Augmented Generation) system that can index and search across both documentation and source code. It works in three phases: +1. **Indexing**: It reads your documents and/or source code, splits them into semantic chunks using context-aware algorithms, and generates vector embeddings. +2. **Storage**: Chunks and embeddings are stored in a ChromaDB vector database with metadata for filtering. +3. **Retrieval**: When you query, it finds the most similar chunks based on semantic meaning, with support for cross-reference searches across docs and code. ## Server Management @@ -39,20 +39,32 @@ Use the management tool to check if the server is responsive: doc-svr-ctl status ``` -## Indexing Documents +## Indexing Documents and Code -Before you can query, you must index one or more folders containing your documentation. +Doc-Serve can index both documentation and source code for unified search capabilities. -### Basic Indexing +### Index Documentation Only (Default) ```bash doc-svr-ctl index /path/to/your/docs ``` +### Index Code + Documentation +```bash +doc-svr-ctl index /path/to/your/project --include-code +``` + ### Advanced Indexing Options +**General Options:** - `--recursive` / `--no-recursive`: Whether to scan subdirectories (default: true). - `--chunk-size`: Size of text chunks in tokens (default: 512). - `--overlap`: Overlap between chunks (default: 50). +**Code-Specific Options:** +- `--include-code`: Include source code files alongside documentation. +- `--languages`: Comma-separated list of programming languages to index (e.g., `python,typescript`). +- `--code-strategy`: Chunking strategy for code (`ast_aware` or `text_based`, default: `ast_aware`). +- `--generate-summaries`: Generate LLM summaries for code chunks to improve semantic search. + ### Resetting the Index If you want to start over and clear all indexed data: ```bash @@ -82,6 +94,37 @@ doc-svr-ctl query "how do I configure the system?" - `--alpha F`: In hybrid mode, weight between vector and bm25. `1.0` is pure vector, `0.0` is pure bm25 (default: 0.5). - `--scores`: Display individual vector and BM25 scores for each result. +### Code-Aware Search (with Code Ingestion) + +When code is indexed alongside documentation, you can perform powerful cross-reference searches: + +#### Filtering by Source Type +```bash +# Search documentation only +doc-svr-ctl query "API usage examples" --source-types doc + +# Search code only +doc-svr-ctl query "database connection" --source-types code + +# Search both (default) +doc-svr-ctl query "authentication implementation" +``` + +#### Filtering by Programming Language +```bash +# Search Python code only +doc-svr-ctl query "error handling" --languages python + +# Search multiple languages +doc-svr-ctl query "API endpoints" --languages python,typescript + +# Combine filters +doc-svr-ctl query "data validation" --source-types code --languages javascript +``` + +#### Supported Languages +Doc-Serve supports AST-aware chunking for: **Python, TypeScript, JavaScript, Java, Kotlin, C, C++, Go, Rust, Swift** + ### Programmatic Output Use the `--json` flag to get raw data for piping into other tools like `jq`: ```bash diff --git a/docs/roadmaps/product-roadmap.md b/docs/roadmaps/product-roadmap.md index b1cd83b..69df0b4 100644 --- a/docs/roadmaps/product-roadmap.md +++ b/docs/roadmaps/product-roadmap.md @@ -20,16 +20,16 @@ Doc-Serve is a local-first RAG (Retrieval-Augmented Generation) service that ind ## Phase Summary -| Phase | Name | Spec ID | Status | Priority | Transport | -|-------|------|---------|--------|----------|-----------| -| 1 | Core Document RAG | 001-005 | COMPLETED | - | HTTP | -| 2 | BM25 & Hybrid Retrieval | 100 | NEXT | P1 | HTTP | -| 3 | Source Code Ingestion | 101 | Planned | P2 | HTTP | -| 4 | UDS & Claude Plugin Evolution | 102 | Future | P3 | HTTP + UDS | -| 5 | Pluggable Model Providers | 103 | Future | P3 | HTTP + UDS | -| 6 | PostgreSQL/AlloyDB Backend | 104 | Future | P4 | HTTP + UDS | -| 7 | AWS Bedrock Provider | 105 | Future | P4 | HTTP + UDS | -| 8 | Google Vertex AI Provider | 106 | Future | P4 | HTTP + UDS | +| Phase | Name | Spec ID | Status | Priority | Transport | +|-------|------|---------|-------------|----------|-----------| +| 1 | Core Document RAG | 001-005 | COMPLETED | - | HTTP | +| 2 | BM25 & Hybrid Retrieval | 100 | COMPLETED | P1 | HTTP | +| 3 | Source Code Ingestion | 101 | IN-PROGRESS | P2 | HTTP | +| 4 | UDS & Claude Plugin Evolution | 102 | Future | P3 | HTTP + UDS | +| 5 | Pluggable Model Providers | 103 | Next | P3 | HTTP + UDS | +| 6 | PostgreSQL/AlloyDB Backend | 104 | Future | P4 | HTTP + UDS | +| 7 | AWS Bedrock Provider | 105 | Future | P4 | HTTP + UDS | +| 8 | Google Vertex AI Provider | 106 | Future | P4 | HTTP + UDS | --- diff --git a/docs/roadmaps/spec-mapping.md b/docs/roadmaps/spec-mapping.md index 51ea7b8..0737187 100644 --- a/docs/roadmaps/spec-mapping.md +++ b/docs/roadmaps/spec-mapping.md @@ -9,16 +9,16 @@ Maps product roadmap phases to specification directories for traceability. ## Phase to Spec Directory Mapping -| Phase | Roadmap Section | Spec Directory | Status | Priority | -|-------|-----------------|----------------|--------|----------| -| 1 | Core Document RAG | `specs/001-005/` | COMPLETED | - | -| 2 | BM25 & Hybrid Retrieval | `specs/100-bm25-hybrid-retrieval/` | NEXT | P1 | -| 3 | Source Code Ingestion | `specs/101-code-ingestion/` | Planned | P2 | -| 4 | UDS & Claude Plugin | `specs/102-uds-claude-plugin/` | Future | P3 | -| 5 | Pluggable Providers | `specs/103-pluggable-providers/` | Future | P3 | -| 6 | PostgreSQL/AlloyDB | `specs/104-postgresql-backend/` | Future | P4 | -| 7 | AWS Bedrock | `specs/105-aws-bedrock/` | Future | P4 | -| 8 | Google Vertex AI | `specs/106-vertex-ai/` | Future | P4 | +| Phase | Roadmap Section | Spec Directory | Status | Priority | +|-------|-----------------|----------------|-------------|----------| +| 1 | Core Document RAG | `specs/001-005/` | COMPLETED | - | +| 2 | BM25 & Hybrid Retrieval | `specs/100-bm25-hybrid-retrieval/` | DONE | P1 | +| 3 | Source Code Ingestion | `specs/101-code-ingestion/` | IN-PROGRESS | P2 | +| 4 | UDS & Claude Plugin | `specs/102-uds-claude-plugin/` | Future | P3 | +| 5 | Pluggable Providers | `specs/103-pluggable-providers/` | NEXT | P3 | +| 6 | PostgreSQL/AlloyDB | `specs/104-postgresql-backend/` | Future | P4 | +| 7 | AWS Bedrock | `specs/105-aws-bedrock/` | Future | P4 | +| 8 | Google Vertex AI | `specs/106-vertex-ai/` | Future | P4 | --- diff --git a/specs/101-code-ingestion/contracts/api-extensions.md b/specs/101-code-ingestion/contracts/api-extensions.md new file mode 100644 index 0000000..abb02a3 --- /dev/null +++ b/specs/101-code-ingestion/contracts/api-extensions.md @@ -0,0 +1,393 @@ +# API Contracts: Code Ingestion + +## Overview + +Code ingestion extends the Doc-Serve API with new parameters for indexing source code and filtering search results by content type and programming language. + +## Extended Endpoints + +### POST /index (Extended) + +Index documents and/or source code with language-specific processing. + +#### Request Body (Extended) + +```json +{ + "paths": ["string"], + "recursive": true, + "chunk_size": 512, + "chunk_overlap": 50, + + // NEW: Code ingestion parameters + "include_code": false, + "languages": ["python", "typescript", "javascript"], + "exclude_patterns": ["string"] +} +``` + +#### New Parameters + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `include_code` | boolean | No | `false` | Enable source code file processing | +| `languages` | string[] | No | `null` | Programming languages to index | +| `exclude_patterns` | string[] | No | `null` | Glob patterns for files to exclude | + +#### Parameter Validation + +- `include_code`: Must be boolean +- `languages`: Must be subset of `["python", "typescript", "javascript"]` +- `exclude_patterns`: Must be valid glob patterns +- When `include_code=true`, at least one language must be specified + +#### Response (Unchanged) + +```json +{ + "job_id": "string", + "status": "started", + "message": "string", + "estimated_duration": "string" +} +``` + +### POST /query (Extended) + +Search with content type and language filtering. + +#### Request Body (Extended) + +```json +{ + "query": "string", + "mode": "hybrid", + "alpha": 0.5, + "top_k": 5, + "threshold": 0.7, + + // NEW: Content filtering + "source_type": "all", + "language": null +} +``` + +#### New Parameters + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `source_type` | string | No | `"all"` | Filter by content type | +| `language` | string | No | `null` | Filter by programming language | + +#### Valid Values + +- `source_type`: `"all"`, `"code"`, `"doc"`, `"test"` +- `language`: `null`, `"python"`, `"typescript"`, `"javascript"` + +#### Parameter Validation + +- `source_type` must be one of the valid values +- `language` must be `null` or one of the supported languages +- If `language` is specified, `source_type` should be `"code"` or `"all"` + +#### Response (Extended) + +Results now include code-specific metadata: + +```json +{ + "results": [ + { + "text": "string", + "source": "string", + "score": 0.85, + "vector_score": 0.72, + "bm25_score": 0.95, + "chunk_id": "string", + "metadata": { + // Universal fields + "chunk_id": "string", + "source": "string", + "file_name": "string", + "chunk_index": 0, + "total_chunks": 5, + "source_type": "code", + + // Code-specific fields (when source_type="code") + "language": "python", + "symbol_name": "UserService.authenticate", + "symbol_kind": "method", + "start_line": 120, + "end_line": 145, + "section_summary": "Authenticates user credentials", + "prev_section_summary": "Initializes service", + + // Document-specific fields (when source_type="doc") + "section_title": "Authentication Guide", + "heading_path": "Security > Authentication" + } + } + ], + "query_time_ms": 350.5, + "total_results": 1 +} +``` + +### GET /health/status (Extended) + +Includes code chunk counts in health status. + +#### Extended Response + +```json +{ + "total_documents": 25, + "total_chunks": 125, + "indexing_in_progress": false, + "current_job_id": null, + "progress_percent": 0.0, + "last_indexed_at": "2025-12-18T10:00:00Z", + + // NEW: Code-specific metrics + "code_chunks_count": 75, + "doc_chunks_count": 50, + "supported_languages": ["python", "typescript", "javascript", "kotlin", "c", "cpp", "java", "go", "rust", "swift"] +} +``` + +## OpenAPI Schema Extensions + +### IndexRequest Schema + +```yaml +IndexRequest: + type: object + properties: + paths: + type: array + items: + type: string + description: File/directory paths to index + recursive: + type: boolean + default: true + description: Recursively scan subdirectories + chunk_size: + type: integer + minimum: 128 + maximum: 2048 + default: 512 + description: Token size for text chunks + chunk_overlap: + type: integer + minimum: 0 + maximum: 512 + default: 50 + description: Token overlap between chunks + + # NEW: Code ingestion fields + include_code: + type: boolean + default: false + description: Enable source code file processing + languages: + type: array + items: + type: string + enum: [python, typescript, javascript, kotlin, c, cpp, java, go, rust, swift] + description: Programming languages to index + exclude_patterns: + type: array + items: + type: string + description: Glob patterns for files to exclude + required: [paths] +``` + +### QueryRequest Schema + +```yaml +QueryRequest: + type: object + properties: + query: + type: string + description: Search query text + mode: + type: string + enum: [vector, bm25, hybrid] + default: hybrid + description: Search algorithm to use + alpha: + type: number + minimum: 0.0 + maximum: 1.0 + default: 0.5 + description: Hybrid weighting (0.0=BM25, 1.0=vector) + top_k: + type: integer + minimum: 1 + maximum: 50 + default: 5 + description: Maximum results to return + threshold: + type: number + minimum: 0.0 + maximum: 1.0 + default: 0.7 + description: Minimum similarity score + + # NEW: Content filtering + source_type: + type: string + enum: [all, code, doc, test] + default: all + description: Filter by content type + language: + type: string + enum: [python, typescript, javascript, kotlin, c, cpp, java, go, rust, swift] + nullable: true + description: Filter by programming language + required: [query] +``` + +### ChunkMetadata Schema + +```yaml +ChunkMetadata: + type: object + properties: + # Universal fields + chunk_id: + type: string + description: Unique chunk identifier + source: + type: string + description: File path + file_name: + type: string + description: Base filename + chunk_index: + type: integer + description: Chunk position in file + total_chunks: + type: integer + description: Total chunks in file + source_type: + type: string + enum: [doc, code, test] + description: Content classification + + # Code-specific fields + language: + type: string + enum: [python, typescript, javascript] + description: Programming language + symbol_name: + type: string + description: Function/class/method name + symbol_kind: + type: string + enum: [module, class, function, method, variable] + description: Symbol type + start_line: + type: integer + description: Starting line number + end_line: + type: integer + description: Ending line number + section_summary: + type: string + description: AI-generated description + prev_section_summary: + type: string + description: Previous chunk description + + # Document-specific fields + section_title: + type: string + description: Section heading + heading_path: + type: string + description: Hierarchical heading path + required: [chunk_id, source, file_name, chunk_index, total_chunks, source_type] +``` + +## Backward Compatibility + +### API Compatibility +- All existing endpoints work unchanged +- New parameters are optional with sensible defaults +- Existing clients continue to function +- Response format extensions are additive + +### Data Compatibility +- Existing document chunks remain searchable +- New code chunks coexist in unified index +- Metadata extensions don't break existing queries +- Can disable code features without data migration + +## Error Responses + +### Code-Specific Errors + +#### 422 Validation Error - Invalid Language +```json +{ + "detail": [ + { + "loc": ["body", "languages", 0], + "msg": "value is not a valid enumeration member; permitted: 'python', 'typescript', 'javascript'", + "type": "enum" + } + ] +} +``` + +#### 503 Service Unavailable - Code Indexing in Progress +```json +{ + "detail": "Code indexing is currently in progress. Try again in a few minutes." +} +``` + +#### 400 Bad Request - Conflicting Parameters +```json +{ + "detail": "Cannot specify 'language' filter when source_type is 'doc'" +} +``` + +## Rate Limiting + +Code ingestion adds to existing rate limits: + +- **Indexing**: Additional 5 requests/hour for code indexing operations +- **Queries**: No change to existing query limits +- **Health**: No change to existing health check limits + +## Versioning + +- **API Version**: No major version change (additive features) +- **Schema Extensions**: Documented in OpenAPI specification +- **Breaking Changes**: None - all changes are backward compatible + +## Testing Contracts + +### Unit Test Contracts +- CodeSplitter produces correct AST boundaries +- Metadata extraction includes required fields +- Language detection works for supported extensions +- Summary generation creates meaningful descriptions + +### Integration Test Contracts +- Full indexing pipeline processes code files +- Query filtering works by source_type and language +- Cross-reference searches return both docs and code +- Health endpoints report correct chunk counts + +### End-to-End Contracts +- CLI tools accept new parameters +- API responses include code metadata +- Performance meets documented expectations +- Error handling provides actionable messages \ No newline at end of file diff --git a/specs/101-code-ingestion/data-model.md b/specs/101-code-ingestion/data-model.md new file mode 100644 index 0000000..d081243 --- /dev/null +++ b/specs/101-code-ingestion/data-model.md @@ -0,0 +1,266 @@ +# Data Model: Code Ingestion + +## Overview + +The code ingestion feature extends Doc-Serve's data model to support source code alongside documentation. This creates a unified searchable corpus where users can cross-reference between implementation and documentation. + +## Entity Relationships + +```mermaid +erDiagram + DOCUMENT ||--o{ CHUNK : contains + CODE_FILE ||--o{ CODE_CHUNK : contains + CHUNK { + string chunk_id PK + string text + string source + json metadata + } + CODE_CHUNK { + string chunk_id PK + string text + string source + json metadata + } + VECTOR_STORE ||--o{ EMBEDDING : stores + BM25_INDEX ||--o{ TERM_ENTRY : indexes + + DOCUMENT { + string path PK + string type "doc|test" + date modified + int size_bytes + } + CODE_FILE { + string path PK + string language "python|typescript|javascript" + string type "code" + date modified + int size_bytes + } + EMBEDDING { + string chunk_id FK + list[float] vector + string model "text-embedding-3-large" + } + TERM_ENTRY { + string term + list[string] chunk_ids + float bm25_score + } +``` + +## Core Entities + +### Document +Represents traditional documentation files (Markdown, PDF, etc.) + +**Fields:** +- `path` (string, PK): Full file system path +- `type` (enum): "doc" | "test" +- `modified` (datetime): Last modification timestamp +- `size_bytes` (int): File size for monitoring + +### CodeFile +Represents source code files with language-specific metadata + +**Fields:** +- `path` (string, PK): Full file system path +- `language` (enum): "python" | "typescript" | "javascript" +- `type` (string): Always "code" +- `modified` (datetime): Last modification timestamp +- `size_bytes` (int): File size for monitoring + +### Chunk (Extended) +Base chunk entity extended with unified metadata schema + +**Fields:** +- `chunk_id` (string, PK): UUID-based unique identifier +- `text` (string): Chunk content (up to 2000 chars for code) +- `source` (string): File path this chunk came from +- `metadata` (json): Rich metadata (see below) + +### CodeChunk (Specialized) +Code-specific chunk with AST-aware boundaries + +**Inherits from Chunk with additional constraints:** +- Text content respects function/class boundaries +- Metadata includes symbol information +- Chunking uses tree-sitter AST parsing + +## Metadata Schema + +### Universal Metadata (All Chunks) +```json +{ + "chunk_id": "chunk_a1b2c3d4", + "source": "/path/to/file.py", + "file_name": "file.py", + "chunk_index": 0, + "total_chunks": 5, + "source_type": "code", + "created_at": "2025-12-18T10:00:00Z" +} +``` + +### Code-Specific Metadata +```json +{ + "language": "python", + "symbol_name": "UserService.authenticate", + "symbol_kind": "method", + "start_line": 120, + "end_line": 145, + "section_summary": "Authenticates user credentials against database", + "prev_section_summary": "Initializes user service with database connection", + "docstring": "Authenticate user with username and password.\n\nReturns User object or None.", + "parameters": ["username: str", "password: str"], + "return_type": "User | None", + "decorators": ["@staticmethod"], + "imports": ["from typing import Optional", "from models import User"] +} +``` + +### Document-Specific Metadata +```json +{ + "source_type": "doc", + "language": "markdown", + "heading_path": "Authentication > User Service > Methods", + "section_title": "User Authentication", + "content_type": "tutorial" +} +``` + +## State Transitions + +### Indexing Pipeline States +```mermaid +stateDiagram-v2 + [*] --> Discovering: File scan + Discovering --> Filtering: Apply exclude patterns + Filtering --> Classifying: Detect language/type + Classifying --> Chunking: Split by boundaries + Chunking --> Summarizing: Generate descriptions + Summarizing --> Embedding: Create vectors + Embedding --> Storing: Save to ChromaDB + Storing --> Indexing: Add to BM25 + Indexing --> [*]: Ready for queries + + Discovering --> [*]: Error/Filtered + Filtering --> [*]: Excluded + Classifying --> [*]: Unsupported language +``` + +### Query Processing States +```mermaid +stateDiagram-v2 + [*] --> Parsing: Parse query + filters + Parsing --> Routing: Determine search mode + Routing --> Searching: Execute vector/BM25/hybrid + Searching --> Filtering: Apply source_type/language + Filtering --> Ranking: Combine and score results + Ranking --> Formatting: Prepare response + Formatting --> [*]: Return results +``` + +## Validation Rules + +### Code File Validation +- **Extension Check**: Must match supported extensions (.py, .ts, .tsx, .js, .jsx) +- **Language Detection**: Must be parseable by tree-sitter grammar +- **Size Limits**: Individual files < 10MB, total codebase < 100k LOC +- **Encoding**: Must be valid UTF-8 + +### Chunk Validation +- **Size Bounds**: 100-2000 characters for code chunks +- **Boundary Integrity**: Chunks must not split function/class definitions +- **Metadata Completeness**: Required fields must be present +- **Symbol Accuracy**: AST parsing must correctly identify symbols + +### Metadata Validation +- **source_type**: Must be "code" | "doc" | "test" +- **language**: Must match supported parsers when source_type="code" +- **symbol_kind**: Must be valid enum value ("function", "class", "method", etc.) +- **line_numbers**: Must be positive integers within file bounds + +## Relationships & Constraints + +### Foreign Key Constraints +- Chunk.source → Document.path OR CodeFile.path +- Embedding.chunk_id → Chunk.chunk_id +- TermEntry.chunk_ids → Chunk.chunk_id (many-to-many) + +### Uniqueness Constraints +- chunk_id must be globally unique across all chunks +- (source, chunk_index) must be unique within a file +- symbol_name + source must be unique for code symbols + +### Data Integrity Rules +- All code chunks must have source_type="code" +- All doc chunks must have source_type="doc" +- Language field is required when source_type="code" +- Symbol fields are optional but recommended for code chunks +- Summary fields enhance search but are not required + +## Indexing Strategy + +### Single Collection Design +Store all chunk types (docs + code) in one ChromaDB collection for unified search: +- Enables cross-referencing between docs and code +- Simplifies query filtering by source_type/language +- Maintains consistent embedding space + +### Metadata-Driven Filtering +Use ChromaDB's where clause for efficient filtering: +```python +# Code-only search +{"source_type": {"$eq": "code"}} + +# Language-specific +{"$and": [ + {"source_type": {"$eq": "code"}}, + {"language": {"$eq": "python"}} +]} + +# Cross-reference search +{"source_type": {"$in": ["code", "doc"]}} +``` + +### BM25 Integration +Maintain separate BM25 index for keyword search: +- Code chunks indexed alongside document chunks +- Symbol names treated as high-weight terms +- Supports exact identifier matching + +## Performance Considerations + +### Storage Overhead +- Code chunks: ~2x document chunk density (smaller, more numerous) +- BM25 index: <50% additional storage for code terms +- Metadata: ~20% increase in storage due to rich code metadata + +### Query Performance +- Vector search: Same performance as document-only +- BM25 search: Minimal overhead for code chunks +- Hybrid search: ~50% slower due to dual execution +- Filtering: ChromaDB where clauses add <10ms overhead + +### Indexing Performance +- Code parsing: 2-3x slower than document parsing (AST overhead) +- Summary generation: Adds LLM calls (most expensive step) +- Total indexing: <2x document indexing time + +## Migration Path + +### From Document-Only to Unified +1. **Schema Extension**: Add new metadata fields to existing chunks +2. **Backward Compatibility**: Existing document chunks work unchanged +3. **Progressive Migration**: Can index code separately initially +4. **Unified Queries**: Gradually enable cross-referencing features + +### Data Migration Strategy +- Existing document chunks: Add source_type="doc", language="markdown" +- New code chunks: Full metadata schema +- Re-indexing: Optional, can coexist with old schema temporarily +- Rollback: Can disable code features without data loss \ No newline at end of file diff --git a/specs/101-code-ingestion/plan.md b/specs/101-code-ingestion/plan.md new file mode 100644 index 0000000..13bd554 --- /dev/null +++ b/specs/101-code-ingestion/plan.md @@ -0,0 +1,96 @@ +# Implementation Plan: Source Code Ingestion & Unified Corpus + +**Branch**: `101-code-ingestion` | **Date**: 2025-12-19 | **Spec**: [specs/101-code-ingestion/spec.md](specs/101-code-ingestion/spec.md) +**Input**: Feature specification from `/specs/101-code-ingestion/spec.md` + +**Note**: This template is filled in by the `/speckit.plan` command. See `.specify/templates/commands/plan.md` for the execution workflow. + +## Summary + +Enable indexing and searching of source code files alongside documentation to create a unified corpus. Implementation uses AST-aware parsing with language-specific chunking strategies, natural language summaries for code chunks, and hybrid search capabilities across both documentation and code. + +Technical approach: Extend existing indexing pipeline with CodeSplitter for AST-aware chunking, SummaryExtractor for code descriptions, and unified storage with metadata filtering by language and source type. + +## Technical Context + + + +**Language/Version**: Python 3.10+ +**Primary Dependencies**: LlamaIndex (CodeSplitter, SummaryExtractor), tree-sitter (AST parsing), OpenAI/Anthropic (embeddings/summaries) +**Storage**: ChromaDB vector store (existing) +**Testing**: pytest with coverage +**Target Platform**: Linux/macOS server +**Project Type**: Web application (FastAPI server) +**Performance Goals**: Indexing time increases < 3x compared to doc-only NEEDS CLARIFICATION +**Constraints**: Memory usage for large codebases NEEDS CLARIFICATION, preserve existing API contracts +**Scale/Scope**: Support for monorepo-scale codebases (100k+ LOC) NEEDS CLARIFICATION + +## Constitution Check + +*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* + +**Status: PASS** - All core principles satisfied + +- **Monorepo Modularity**: ✅ Adds functionality to server package only +- **OpenAPI-First**: ✅ Will extend existing API spec with new parameters (include_code, source_type, language) +- **Test-Alongside**: ✅ Tests will be implemented alongside features +- **Observability**: ✅ Health endpoints will be extended to track code_chunks count +- **Simplicity**: ✅ Complexity justified by core value proposition (unified doc+code search) + +**API Changes Required**: Extend `/index` and `/query` endpoints in OpenAPI spec + +## Project Structure + +### Documentation (this feature) + +```text +specs/[###-feature]/ +├── plan.md # This file (/speckit.plan command output) +├── research.md # Phase 0 output (/speckit.plan command) +├── data-model.md # Phase 1 output (/speckit.plan command) +├── quickstart.md # Phase 1 output (/speckit.plan command) +├── contracts/ # Phase 1 output (/speckit.plan command) +└── tasks.md # Phase 2 output (/speckit.tasks command - NOT created by /speckit.plan) +``` + +### Source Code (repository root) + +```text +doc-serve-server/ +├── doc_serve_server/ +│ ├── indexing/ # EXTENDED: Add code parsing capabilities +│ │ ├── code_parser.py # NEW: AST-aware code parsing +│ │ ├── code_splitter.py # NEW: Language-aware chunking +│ │ └── summary_extractor.py # NEW: Code summarization +│ ├── models/ # EXTENDED: Add code-related models +│ │ ├── code.py # NEW: CodeChunk, CodeMetadata models +│ │ └── query.py # EXTENDED: Add language/source filters +│ ├── services/ # EXTENDED: Add code indexing services +│ │ ├── code_indexing_service.py # NEW: Code indexing orchestration +│ │ └── query_service.py # EXTENDED: Unified doc+code search +│ └── storage/ # EXTENDED: Code chunk storage +│ └── vector_store.py # EXTENDED: Multi-source collection support +├── tests/ +│ ├── integration/ +│ │ ├── test_code_indexing.py # NEW: Code indexing tests +│ │ └── test_unified_search.py # NEW: Cross-reference search tests +│ └── unit/ +│ ├── test_code_parser.py # NEW: Parser unit tests +│ └── test_code_splitter.py # NEW: Splitter unit tests +``` + +**Structure Decision**: Extends existing server package structure. Code parsing added to indexing/, models to models/, services to services/. No new packages required - maintains monorepo modularity principle. + +## Complexity Tracking + +> **Fill ONLY if Constitution Check has violations that must be justified** + +| Violation | Why Needed | Simpler Alternative Rejected Because | +|-----------|------------|-------------------------------------| +| ~1000+ lines of new code | AST-aware parsing enables logical code chunking and cross-reference search | Regex-based parsing would lose semantic understanding of code structure and fail to chunk at function/class boundaries | +| Multi-language support (Python, TS/JS) | Core requirement for unified corpus across tech stacks | Single-language support would limit the unified search value proposition | +| LLM-powered summarization | Improves semantic search retrieval for code queries | Keyword-only search would miss conceptual matches between natural language questions and code implementations | diff --git a/specs/101-code-ingestion/quickstart.md b/specs/101-code-ingestion/quickstart.md new file mode 100644 index 0000000..a9ac892 --- /dev/null +++ b/specs/101-code-ingestion/quickstart.md @@ -0,0 +1,350 @@ +# Quickstart: Code Ingestion + +## Overview + +Doc-Serve now supports indexing and searching source code alongside documentation. This creates a unified corpus where you can cross-reference between implementation and documentation. + +## Prerequisites + +- Doc-Serve server running (see main quickstart) +- OpenAI API key configured +- Source code project with supported languages + +## Supported Languages + +**Scripting/High-level Languages:** +- **Python** (.py) +- **TypeScript** (.ts, .tsx) +- **JavaScript** (.js, .jsx) + +**Systems Languages:** +- **C** (.c, .h) +- **C++** (.cpp, .cxx, .cc, .hpp, .hxx, .hh) + +**JVM/Object-oriented:** +- **Java** (.java) +- **Kotlin** (.kt, .kts) + +**Modern Systems Languages:** +- **Go** (.go) +- **Rust** (.rs) +- **Swift** (.swift) + +## Basic Code Indexing + +### Index a Python Project + +```bash +# Index Python source code +doc-svr-ctl index /path/to/python/project --include-code --languages python + +# Example with real project +doc-svr-ctl index ~/projects/my-api --include-code --languages python +``` + +### Index a Full-Stack Project + +```bash +# Index both backend (Python) and frontend (TypeScript) +doc-svr-ctl index /path/to/fullstack/app \ + --include-code \ + --languages python,typescript \ + --exclude-patterns "node_modules/**,*.test.*,__pycache__/**" +``` + +### Index a Systems Project + +```bash +# Index C/C++ codebase with Go microservices +doc-svr-ctl index /path/to/systems/project \ + --include-code \ + --languages c,cpp,go \ + --exclude-patterns "build/**,*.o,*.a" +``` + +### Index a Polyglot Application + +```bash +# Index Java backend, TypeScript frontend, Rust utilities +doc-svr-ctl index /path/to/polyglot/app \ + --include-code \ + --languages java,typescript,rust \ + --exclude-patterns "target/**,node_modules/**,*.class" +``` + +### Index with Documentation + +```bash +# Index both docs and code together +doc-svr-ctl index /path/to/project \ + --include-code \ + --languages python,typescript,javascript \ + --recursive +``` + +## Code Search Examples + +### Find Functions by Name + +```bash +# Exact function name (BM25) +doc-svr-ctl query "authenticate_user" --mode bm25 --source-type code + +# Semantic function search (Vector) +doc-svr-ctl query "user authentication logic" --mode vector --source-type code + +# Hybrid search (recommended) +doc-svr-ctl query "user authentication" --mode hybrid --source-type code +``` + +### Language-Specific Search + +```bash +# Python code only +doc-svr-ctl query "database connection" --language python --source-type code + +# TypeScript/React code +doc-svr-ctl query "component lifecycle" --language typescript --source-type code + +# JavaScript utilities +doc-svr-ctl query "array manipulation" --language javascript --source-type code + +# C/C++ system calls +doc-svr-ctl query "memory allocation" --language cpp --source-type code + +# Java enterprise patterns +doc-svr-ctl query "dependency injection" --language java --source-type code + +# Kotlin Android/data class patterns +doc-svr-ctl query "sealed class hierarchy" --language kotlin --source-type code + +# Go concurrency patterns +doc-svr-ctl query "goroutine management" --language go --source-type code + +# Rust ownership patterns +doc-svr-ctl query "borrow checker" --language rust --source-type code + +# Swift iOS development +doc-svr-ctl query "view controller lifecycle" --language swift --source-type code +``` + +### Cross-Reference Search + +```bash +# Find both docs and code for a topic +doc-svr-ctl query "authentication flow" --source-type all + +# API documentation + implementation +doc-svr-ctl query "REST endpoint implementation" --mode hybrid --alpha 0.6 +``` + +## Advanced Usage + +### Custom Chunking + +```bash +# Larger chunks for complex functions +doc-svr-ctl index /path/to/code --include-code --chunk-size 1000 --chunk-overlap 100 +``` + +### Summary Generation + +Code chunks automatically get AI-generated summaries for better semantic search. These help the system understand what each code function does beyond just the code text. + +### Filtering Options + +```bash +# Exclude test files and build artifacts +doc-svr-ctl index /path/to/project \ + --include-code \ + --exclude-patterns "*test*,*spec*,dist/**,build/**" + +# Include only source directories +doc-svr-ctl index src/ tests/ --include-code --languages python +``` + +## CLI Reference + +### Index Command Options + +| Option | Description | Example | +|--------|-------------|---------| +| `--include-code` | Enable code file processing | `--include-code` | +| `--languages` | Comma-separated language list | `--languages python,typescript` | +| `--exclude-patterns` | Glob patterns to skip | `--exclude-patterns "test*,*.min.js"` | +| `--recursive` | Scan subdirectories | `--recursive` | + +### Query Command Options + +| Option | Description | Example | +|--------|-------------|---------| +| `--source-type` | Filter by content type | `--source-type code` | +| `--language` | Filter by programming language | `--language python` | +| `--mode` | Search algorithm | `--mode hybrid` | +| `--alpha` | Hybrid weighting (0.0-1.0) | `--alpha 0.7` | + +**Valid Values:** +- `source-type`: `code`, `doc`, `test`, `all` +- `language`: `python`, `typescript`, `javascript`, `kotlin`, `c`, `cpp`, `java`, `go`, `rust`, `swift` +- `mode`: `vector`, `bm25`, `hybrid` + +## API Usage + +### Index Code Files + +```bash +# POST /index with code parameters +curl -X POST http://localhost:8000/index/ \ + -H "Content-Type: application/json" \ + -d '{ + "paths": ["/path/to/code"], + "include_code": true, + "languages": ["python", "typescript", "javascript", "kotlin", "java"], + "exclude_patterns": ["*test*", "node_modules/**", "target/**"], + "recursive": true + }' +``` + +### Query with Filters + +```bash +# Search Python code only +curl -X POST http://localhost:8000/query/ \ + -H "Content-Type: application/json" \ + -d '{ + "query": "database connection", + "source_type": "code", + "language": "python" + }' + +# Search C++ code for memory management +curl -X POST http://localhost:8000/query/ \ + -H "Content-Type: application/json" \ + -d '{ + "query": "memory allocation", + "source_type": "code", + "language": "cpp" + }' + +# Cross-reference search +curl -X POST http://localhost:8000/query/ \ + -H "Content-Type: application/json" \ + -d '{ + "query": "authentication implementation", + "source_type": "all", + "mode": "hybrid" + }' +``` + +## Health Monitoring + +### Check Indexing Status + +```bash +doc-svr-ctl status +``` + +Shows counts for: +- `total_documents`: Traditional docs +- `total_chunks`: All chunks (docs + code) +- Code-specific counts in health response + +### API Health Check + +```bash +curl http://localhost:8000/health/status +``` + +Response includes: +```json +{ + "total_documents": 25, + "total_chunks": 125, + "indexing_in_progress": false, + "bm25_index_ready": true, + "code_chunks_count": 75, + "doc_chunks_count": 50 +} +``` + +## Performance Expectations + +| Operation | Expected Time | Notes | +|-----------|----------------|-------| +| Index 100 files | 2-5 minutes | Includes summary generation | +| Code search | <100ms | BM25/vector queries | +| Hybrid search | 200-500ms | Dual algorithm execution | +| Cross-reference | 300-800ms | Searches multiple content types | + +## Troubleshooting + +### No Code Results Found + +```bash +# Check if code was indexed +doc-svr-ctl status + +# Verify query filters +doc-svr-ctl query "function" --source-type code --language python +``` + +### Indexing Errors + +```bash +# Check server logs +tail -f server.log + +# Try with verbose output +doc-svr-ctl index /path/to/code --include-code --verbose +``` + +### Language Detection Issues + +```bash +# Manually specify language +doc-svr-ctl index /path/to/code --include-code --languages python + +# Check file extensions +find /path/to/code -name "*.py" | head -10 +``` + +## Best Practices + +1. **Start Small**: Index a single directory first to test +2. **Use Excludes**: Skip test files, build artifacts, dependencies +3. **Choose Wisely**: Use BM25 for exact names, hybrid for general queries +4. **Monitor Health**: Check indexing status and chunk counts +5. **Iterate**: Start with defaults, tune alpha and filters as needed + +## Example Workflows + +### API Development + +```bash +# Index FastAPI backend +doc-svr-ctl index backend/ --include-code --languages python + +# Find endpoint implementations +doc-svr-ctl query "user registration" --source-type code --mode hybrid + +# Cross-reference with API docs +doc-svr-ctl query "authentication endpoints" --source-type all +``` + +### Full-Stack Development + +```bash +# Index both backend and frontend +doc-svr-ctl index . \ + --include-code \ + --languages python,typescript \ + --exclude-patterns "node_modules/**,__pycache__/**" + +# Find component implementations +doc-svr-ctl query "user dashboard component" --language typescript + +# Find data flow across stack +doc-svr-ctl query "user data validation" --source-type all --mode hybrid +``` + +This creates a unified knowledge base where documentation and implementation are searchable together! \ No newline at end of file diff --git a/specs/101-code-ingestion/research.md b/specs/101-code-ingestion/research.md new file mode 100644 index 0000000..095805c --- /dev/null +++ b/specs/101-code-ingestion/research.md @@ -0,0 +1,960 @@ +# Phase 3: Source Code Ingestion - Research & Analysis + +**Version:** 1.0.0 +**Date:** 2025-12-18 +**Status:** Research Complete + +--- + +## Executive Summary + +This document presents comprehensive research for implementing source code ingestion in Doc-Serve (Phase 3). The research covers LlamaIndex's code-specific components, tree-sitter integration, unified search architecture, and extension points in the current implementation. + +**Key Findings:** + +1. **LlamaIndex CodeSplitter** provides AST-aware chunking via tree-sitter, maintaining function/class boundaries +2. **SummaryExtractor** can generate natural language descriptions for code chunks to improve semantic retrieval +3. The current Doc-Serve architecture has clear extension points for code support +4. **ChromaDB** already supports the metadata filtering needed for `source_type` and `language` queries +5. **Hybrid search** (Phase 2) is particularly valuable for code, combining exact identifier matches with semantic similarity + +**Recommended Approach:** + +- Extend `DocumentLoader` with code-specific extensions +- Create `CodeChunker` using LlamaIndex `CodeSplitter` +- Add `SummaryExtractor` to ingestion pipeline for code descriptions +- Enhance metadata schema with `source_type`, `language`, `symbol_name`, `line_numbers` +- Leverage existing `VectorStoreManager` `where` parameter for filtering + +--- + +## 1. LlamaIndex CodeSplitter Analysis + +### Overview + +LlamaIndex's `CodeSplitter` is an AST-based node parser that uses tree-sitter to parse source code and chunk it at syntactic boundaries (functions, classes, methods). + +### Key Parameters + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `language` | Required | Target language: `"python"`, `"typescript"`, `"javascript"` | +| `chunk_lines` | 40 | Approximate lines per chunk | +| `chunk_lines_overlap` | 15 | Overlapping lines between chunks | +| `max_chars` | 1500 | Maximum characters per chunk | + +### Usage Pattern + +```python +from llama_index.core.node_parser import CodeSplitter + +# Create language-specific splitter +python_splitter = CodeSplitter.from_defaults( + language="python", + chunk_lines=40, + chunk_lines_overlap=15, + max_chars=1500, +) + +# Parse documents into nodes +nodes = python_splitter.get_nodes_from_documents(documents) +``` + +### Chunking Behavior by Language + +**Python:** +- Chunks centered on module-level functions and classes +- Methods grouped within class chunks when possible +- Imports and decorators preserved with their functions +- `if __name__ == "__main__"` blocks kept together + +**TypeScript/JavaScript:** +- Chunks aligned to top-level function declarations +- Classes and methods preserved together +- Exported symbols (`export function`, `export class`) respected +- JSX trees kept intact, not split mid-expression + +**C/C++:** +- Functions and methods kept as complete units +- Struct/class definitions preserved together +- Preprocessor directives (#include, #define) grouped with related code +- Template instantiations and specializations handled appropriately + +**Java:** +- Methods grouped within class boundaries +- Inner classes kept with their containing classes +- Package declarations and imports preserved +- Annotations and generics handled correctly + +**Kotlin:** +- Functions and methods kept as complete units +- Class/data class/object declarations preserved together +- Extension functions and properties handled correctly +- Null safety operators and type inference respected + +**Go:** +- Functions and methods kept as complete units +- Struct types and interfaces preserved together +- Package declarations and imports maintained +- Goroutines and channel operations respected + +**Rust:** +- Functions and impl blocks kept together +- Struct/enum/trait definitions preserved +- Macro invocations and derive attributes handled +- Async functions and lifetime annotations respected + +**Swift:** +- Functions and methods kept as complete units +- Class/struct/enum definitions preserved together +- Protocol conformance and extensions handled +- Property observers and computed properties respected + +### Recommended Configuration for Doc-Serve + +```python +CODE_CHUNK_LINES = 50 # Slightly larger for complete functions +CODE_CHUNK_OVERLAP = 20 # More overlap for cross-reference context +CODE_MAX_CHARS = 2000 # Accommodate larger functions +``` + +### Integration with Ingestion Pipeline + +```python +from llama_index.core.ingestion import IngestionPipeline +from llama_index.core.node_parser import CodeSplitter + +pipeline = IngestionPipeline( + transformations=[ + CodeSplitter(language="python", chunk_lines=50, chunk_lines_overlap=20), + embedding_model, # EmbeddingGenerator + ], +) +``` + +--- + +## 2. SummaryExtractor for Code + +### Purpose + +`SummaryExtractor` uses an LLM to generate natural language descriptions for each code chunk. This bridges the semantic gap between natural language queries and code implementations. + +### Configuration + +```python +from llama_index.core.extractors import SummaryExtractor + +code_summary_prompt = """You are a senior software engineer. +Given the following code snippet, write a concise natural-language description +that explains what the code does, its purpose, and key inputs/outputs. +Avoid restating the code line by line. + +Code: +{context_str} + +Summary:""" + +summary_extractor = SummaryExtractor( + summaries=["self", "prev"], # Include previous chunk context + prompt_template=code_summary_prompt, + llm=Settings.llm, # Claude Haiku for speed +) +``` + +### Metadata Output + +| Field | Description | +|-------|-------------| +| `section_summary` | Natural language description of current chunk | +| `prev_section_summary` | Description of previous chunk (context) | +| `next_section_summary` | Description of next chunk (optional) | + +### Usage in Pipeline + +```python +pipeline = IngestionPipeline( + transformations=[ + CodeSplitter(language="python"), + SummaryExtractor(summaries=["self", "prev"], prompt_template=code_prompt), + embedding_model, + ], +) +``` + +### Alternatives for Code Summaries + +1. **Extract from docstrings** - Fast, no LLM calls, but incomplete +2. **AST-based descriptions** - Generate from function signatures and types +3. **Hybrid approach** - Use docstrings when present, LLM when missing + +**Recommendation:** Use hybrid approach - extract docstrings first, then LLM-generate only for undocumented code. + +--- + +## 3. Tree-sitter Integration + +### Official Language Packages + +| Language | PyPI Package | Grammar Repo | Status | +|----------|--------------|--------------|--------| +| Python | `tree-sitter-python` | tree-sitter/tree-sitter-python | ✅ Production | +| JavaScript | `tree-sitter-javascript` | tree-sitter/tree-sitter-javascript | ✅ Production | +| TypeScript | `tree-sitter-typescript` | tree-sitter/tree-sitter-typescript | ✅ Production | +| Kotlin | `tree-sitter-kotlin` | fwcd/tree-sitter-kotlin | ✅ Production | +| C | `tree-sitter-c` | tree-sitter/tree-sitter-c | ✅ Production | +| C++ | `tree-sitter-cpp` | tree-sitter/tree-sitter-cpp | ✅ Production | +| Java | `tree-sitter-java` | tree-sitter/tree-sitter-java | ✅ Production | +| Go | `tree-sitter-go` | tree-sitter/tree-sitter-go | ✅ Production | +| Rust | `tree-sitter-rust` | tree-sitter/tree-sitter-rust | ✅ Production | +| Swift | `tree-sitter-swift` | tree-sitter/tree-sitter-swift | ✅ Production | + +**Note:** TypeScript package contains two grammars: `typescript` and `tsx`. + +### Installation + +```bash +pip install tree-sitter tree-sitter-python tree-sitter-javascript +``` + +### Direct Usage (for custom parsing) + +```python +from tree_sitter import Language, Parser +import tree_sitter_python as tspython + +PY_LANGUAGE = Language(tspython.language()) +parser = Parser(PY_LANGUAGE) + +code = b""" +def greet(name: str) -> str: + \"\"\"Return a greeting.\"\"\" + return f"Hello, {name}!" +""" + +tree = parser.parse(code) +root = tree.root_node + +# Walk the AST +for child in root.children: + print(child.type, child.start_point, child.end_point) +``` + +### AST Queries for Metadata Extraction + +```python +from tree_sitter import Query + +# Find all function definitions +query = Query( + PY_LANGUAGE, + """ + (function_definition + name: (identifier) @func_name + parameters: (parameters) @params + return_type: (type)? @return_type) + """ +) + +captures = query.captures(root) +for node, capture_name in captures: + print(f"{capture_name}: {node.text.decode('utf8')}") +``` + +### Extractable Metadata via AST + +| Metadata | Source | +|----------|--------| +| `function_name` | Function definition name node | +| `class_name` | Class definition name node | +| `parameters` | Parameter list nodes | +| `return_type` | Return type annotation | +| `decorators` | Decorator nodes | +| `docstring` | First string literal in function body | +| `imports` | Import statement nodes | +| `line_numbers` | `node.start_point.row`, `node.end_point.row` | + +--- + +## 4. Unified Search Architecture + +### Single Collection Design + +Store both documentation and code in a single ChromaDB collection with rich metadata for filtering. + +```python +# Unified metadata schema +metadata = { + # Source classification + "source_type": "code", # "doc" | "code" | "test" + "language": "python", # "python" | "typescript" | "javascript" | "markdown" + + # File information + "file_path": "src/app/user.py", + "file_name": "user.py", + + # Code-specific (when source_type == "code") + "symbol_name": "UserService.get_user", + "symbol_kind": "method", # "class" | "function" | "method" | "module" + "start_line": 120, + "end_line": 165, + + # Context + "section_summary": "Retrieves a user by ID from the database", + + # Standard + "chunk_id": "chunk_abc123", + "chunk_index": 5, + "total_chunks": 12, +} +``` + +### ChromaDB Filtering Patterns + +**Filter by source type:** +```python +results = collection.query( + query_texts=["authentication handler"], + n_results=10, + where={"source_type": {"$eq": "code"}}, +) +``` + +**Filter by language:** +```python +results = collection.query( + query_texts=["parse JSON"], + n_results=10, + where={ + "$and": [ + {"source_type": {"$eq": "code"}}, + {"language": {"$in": ["python", "typescript"]}}, + ] + }, +) +``` + +**Cross-reference search (code + docs):** +```python +results = collection.query( + query_texts=["user authentication flow"], + n_results=20, + where={ + "source_type": {"$in": ["code", "doc"]}, + }, +) +``` + +### Hybrid Search Value for Code + +Hybrid search (Phase 2) is particularly valuable for code: + +| Query Type | Best Strategy | Language Examples | +|------------|---------------|-------------------| +| Exact function name | BM25 (keyword) | `authenticate_user`, `malloc`, `println` | +| Error code lookup | BM25 (keyword) | `HTTP_404`, `ENOENT`, `NullPointerException` | +| API endpoint patterns | BM25 (keyword) | `GET /api/users`, `@GetMapping`, `app.get()` | +| "How to authenticate" | Vector (semantic) | Cross-language authentication patterns | +| "UserService implementation" | Hybrid | Find class + usage examples | +| "Memory management" | Hybrid | C malloc + Rust ownership patterns | +| "HTTP client setup" | Hybrid | curl in C + requests in Python | + +**Example hybrid query:** +```python +# mode=hybrid combines BM25 exact match with vector semantic similarity +POST /query +{ + "query": "RecursiveCharacterTextSplitter", + "mode": "hybrid", + "alpha": 0.3, # Favor BM25 for exact identifiers + "source_type": "code" +} +``` + +--- + +## 5. Current Implementation Extension Points + +### DocumentLoader Extension + +**Current file:** `doc_serve_server/indexing/document_loader.py` + +**Current extensions:** +```python +SUPPORTED_EXTENSIONS: set[str] = {".txt", ".md", ".pdf", ".docx", ".html", ".rst"} +``` + +**Proposed addition:** +```python +CODE_EXTENSIONS: set[str] = {".py", ".ts", ".tsx", ".js", ".jsx"} +``` + +**New method needed:** +```python +def load_code_files( + folder_path: str, + languages: list[str] | None = None, + exclude_patterns: list[str] | None = None, +) -> list[LoadedDocument]: + """Load source code files with language detection.""" + ... +``` + +### Chunker Extension + +**Current file:** `doc_serve_server/indexing/chunking.py` + +**Current class:** `ContextAwareChunker` (text-based splitting) + +**New class needed:** +```python +class CodeChunker: + """AST-aware code chunking using LlamaIndex CodeSplitter.""" + + def __init__( + self, + chunk_lines: int = 50, + chunk_overlap: int = 20, + max_chars: int = 2000, + generate_summaries: bool = True, + ): + self.splitters: dict[str, CodeSplitter] = {} + self._init_splitters() + + def chunk_code_file( + self, + document: LoadedDocument, + language: str, + ) -> list[CodeChunk]: + """Chunk a code file using language-specific AST splitting.""" + ... +``` + +### IndexingService Extension + +**Current file:** `doc_serve_server/services/indexing_service.py` + +**Current pipeline:** +``` +Load Documents → Chunk → Embed → Store → BM25 Index +``` + +**Extended pipeline:** +``` +Load Documents ─┬─→ Doc Chunker ─────────────────────┐ + │ │ +Load Code Files ─┴─→ Code Chunker → Summary Extract ─┴→ Embed → Store → BM25 Index +``` + +**New parameter on index endpoint:** +```python +class IndexRequest(BaseModel): + folder_path: str + chunk_size: int = 512 + chunk_overlap: int = 50 + recursive: bool = True + include_code: bool = False # NEW + languages: list[str] | None = None # NEW: ["python", "typescript"] + exclude_patterns: list[str] | None = None # NEW: ["*test*", "node_modules"] +``` + +### VectorStoreManager Extension + +**Current file:** `doc_serve_server/storage/vector_store.py` + +**Already supports filtering:** +```python +def similarity_search( + self, + query_embedding: list[float], + top_k: int, + similarity_threshold: float = 0.0, + where: dict | None = None, # Already supports ChromaDB filtering +) -> list[SearchResult]: +``` + +**No changes needed** - just pass appropriate `where` clause from query endpoint. + +### Query Endpoint Extension + +**Current file:** `doc_serve_server/api/routers/query.py` + +**New query parameters:** +```python +class QueryRequest(BaseModel): + query: str + top_k: int = 5 + similarity_threshold: float = 0.7 + mode: str = "hybrid" + alpha: float = 0.5 + source_type: str | None = None # NEW: "doc" | "code" | "all" + language: str | None = None # NEW: "python" | "typescript" | etc. +``` + +--- + +## 6. Metadata Schema Design + +### Complete Metadata Schema + +```python +@dataclass +class ChunkMetadata: + # Universal fields (all chunks) + chunk_id: str + source: str # File path + file_name: str + file_path: str + chunk_index: int + total_chunks: int + source_type: Literal["doc", "code", "test"] + + # Code-specific fields (when source_type == "code") + language: str | None # "python", "typescript", "javascript" + symbol_name: str | None # "UserService.get_user" + symbol_kind: str | None # "class", "function", "method", "module" + start_line: int | None + end_line: int | None + + # Summary fields (from SummaryExtractor) + section_summary: str | None + prev_section_summary: str | None + + # Document-specific fields (when source_type == "doc") + section_title: str | None + heading_path: str | None # "Chapter 1 > Setup > Installation" +``` + +### Enum Definitions + +```python +from enum import Enum + +class SourceType(str, Enum): + DOC = "doc" + CODE = "code" + TEST = "test" + +class LanguageType(str, Enum): + PYTHON = "python" + TYPESCRIPT = "typescript" + JAVASCRIPT = "javascript" + KOTLIN = "kotlin" + C = "c" + CPP = "cpp" + JAVA = "java" + GO = "go" + RUST = "rust" + SWIFT = "swift" + MARKDOWN = "markdown" + +class SymbolKind(str, Enum): + MODULE = "module" + CLASS = "class" + FUNCTION = "function" + METHOD = "method" + VARIABLE = "variable" +``` + +--- + +## 7. File Extension & Filtering Strategy + +### Supported Extensions + +| Language | Extensions | Detect As | +|----------|------------|-----------| +| Python | `.py` | `python` | +| TypeScript | `.ts`, `.tsx` | `typescript` | +| JavaScript | `.js`, `.jsx` | `javascript` | +| Kotlin | `.kt`, `.kts` | `kotlin` | +| C | `.c`, `.h` | `c` | +| C++ | `.cpp`, `.cxx`, `.cc`, `.hpp`, `.hxx`, `.hh` | `cpp` | +| Java | `.java` | `java` | +| Go | `.go` | `go` | +| Rust | `.rs` | `rust` | +| Swift | `.swift` | `swift` | + +### Default Exclude Patterns + +```python +DEFAULT_EXCLUDE_PATTERNS = [ + # Package managers + "node_modules/", + "vendor/", + ".venv/", + "venv/", + "__pycache__/", + + # Build outputs + "dist/", + "build/", + "out/", + ".next/", + + # Generated files + "*.d.ts", # TypeScript declarations + "*.js.map", # Source maps + "*.min.js", # Minified + "*.pyc", # Python bytecode + + # Test files (optional - configurable) + "*test*.py", + "*_test.py", + "test_*.py", + "*.test.ts", + "*.spec.ts", + "__tests__/", + + # IDE/tool files + ".git/", + ".idea/", + ".vscode/", + "coverage/", +] +``` + +### Language Detection + +```python +EXTENSION_TO_LANGUAGE = { + # Python + ".py": "python", + + # JavaScript/TypeScript + ".js": "javascript", + ".jsx": "javascript", # JSX uses javascript parser + ".ts": "typescript", + ".tsx": "typescript", # TSX uses typescript parser + + # Systems languages + ".c": "c", + ".h": "c", # Header files + ".cpp": "cpp", + ".cxx": "cpp", + ".cc": "cpp", + ".hpp": "cpp", + ".hxx": "cpp", + ".hh": "cpp", + + # JVM/Object-oriented + ".java": "java", + ".kt": "kotlin", + ".kts": "kotlin", # Kotlin script files + + # Modern systems languages + ".go": "go", + ".rs": "rust", + ".swift": "swift", +} +``` + +--- + +## 8. Implementation Recommendations + +### Phase 3 Implementation Order + +1. **Core Infrastructure** + - Add `CODE_EXTENSIONS` to `DocumentLoader` + - Create `CodeChunker` class with `CodeSplitter` integration + - Define `CodeChunk` and `ChunkMetadata` dataclasses + +2. **Metadata Enhancement** + - Update `TextChunk` to include code-specific fields + - Add `source_type` and `language` to all chunk metadata + - Ensure backward compatibility with existing doc chunks + +3. **Pipeline Integration** + - Create `CodeIndexingService` or extend `IndexingService` + - Add optional `SummaryExtractor` step for code + - Integrate code chunks into existing embed/store flow + +4. **API Extensions** + - Add `include_code`, `languages`, `exclude_patterns` to `/index` + - Add `source_type`, `language` filters to `/query` + - Update response models with code metadata + +5. **CLI Extensions** + - Add `--include-code` flag to `index` command + - Add `--languages` flag for language filtering + - Add `--source-type` and `--language` to `query` command + +### Dependencies to Add + +```toml +# pyproject.toml additions +[tool.poetry.dependencies] +tree-sitter = "^0.21" +# Core languages +tree-sitter-python = "^0.21" +tree-sitter-javascript = "^0.21" +tree-sitter-typescript = "^0.21" +# Systems languages +tree-sitter-c = "^0.21" +tree-sitter-cpp = "^0.21" +# JVM/Object-oriented +tree-sitter-java = "^0.21" +tree-sitter-kotlin = "^0.21" +# Modern languages +tree-sitter-go = "^0.21" +tree-sitter-rust = "^0.21" +tree-sitter-swift = "^0.21" +``` + +**Note:** LlamaIndex's `CodeSplitter` handles tree-sitter internally; direct dependency may not be needed if using only `CodeSplitter`. + +### Performance Considerations + +1. **Batch code summaries** - Don't call LLM per-chunk; batch similar chunks +2. **Cache language parsers** - Initialize `CodeSplitter` once per language +3. **Parallel file loading** - Use async for loading multiple code files +4. **Skip binary files** - Detect and skip non-text files early + +### Testing Strategy + +1. **Unit tests** - `CodeChunker` produces correct boundaries +2. **Integration tests** - Full pipeline with mixed docs + code +3. **Query tests** - Verify filtering by `source_type` and `language` +4. **Cross-reference tests** - Unified search returns both docs and code + +--- + +## 10. Config-Driven Language Support Architecture + +### Overview + +The current implementation uses hardcoded language support. To enable 160+ languages from tree-sitter-language-pack without code changes, we need a configuration-driven architecture. + +### Key Design Decisions + +#### 1. Language Configuration File + +**File**: `doc-serve-server/config/languages.yaml` + +**Structure**: +```yaml +# Language Support Configuration +defaults: + chunk_lines: 50 + chunk_overlap: 20 + max_chars: 2000 + +categories: + compact: # Terse languages (C, Go, Rust) + chunk_lines: 40 + chunk_overlap: 15 + standard: # Most languages (Python, JS, TS, Java) + chunk_lines: 50 + chunk_overlap: 20 + verbose: # Verbose languages (Java, C#) + chunk_lines: 80 + chunk_overlap: 30 + markup: # HTML, XML, JSON + chunk_lines: 60 + chunk_overlap: 25 + +languages: + python: + extensions: [.py, .pyw, .pyi] + key: python + category: standard + enabled: true + exclude_patterns: + - "*_test.py" + - "test_*.py" + - "**/tests/**" + - "conftest.py" + + typescript: + extensions: [.ts, .tsx] + key: typescript + category: standard + enabled: true + exclude_patterns: + - "*.spec.ts" + - "*.test.ts" + - "**/__tests__/**" + - "*.d.ts" + +# 160+ languages can be added here... +``` + +#### 2. LanguageConfig Pydantic Model + +**File**: `doc-serve-server/config/language_config.py` + +```python +from pydantic import BaseModel +from pathlib import Path +import yaml + +class ChunkConfig(BaseModel): + chunk_lines: int = 50 + chunk_overlap: int = 20 + max_chars: int = 2000 + +class LanguageEntry(BaseModel): + extensions: list[str] + key: str # tree-sitter-language-pack key + category: str = "standard" + enabled: bool = True + exclude_patterns: list[str] = [] + +class LanguageConfig(BaseModel): + defaults: ChunkConfig + categories: dict[str, ChunkConfig] + languages: dict[str, LanguageEntry] + + @classmethod + def load(cls, path: Path | None = None) -> "LanguageConfig": + if path is None: + path = Path(__file__).parent / "languages.yaml" + with open(path) as f: + data = yaml.safe_load(f) + return cls(**data) + + def get_enabled_extensions(self) -> set[str]: + """Get all enabled file extensions.""" + exts = set() + for lang in self.languages.values(): + if lang.enabled: + exts.update(lang.extensions) + return exts + + def extension_to_key(self, language_name: str) -> str | None: + """Map language name to tree-sitter key.""" + lang = self.languages.get(language_name) + return lang.key if lang and lang.enabled else None +``` + +#### 3. Simplified LanguageDetector + +**Updated**: `doc-serve-server/indexing/document_loader.py` + +```python +class LanguageDetector: + def __init__(self, config: LanguageConfig | None = None): + self.config = config or LanguageConfig.load() + self._ext_map = self._build_extension_map() + + def _build_extension_map(self) -> dict[str, str]: + """Build extension -> language name mapping from config.""" + mapping = {} + for name, lang in self.config.languages.items(): + if lang.enabled: + for ext in lang.extensions: + mapping[ext.lower()] = name + return mapping + + def detect(self, file_path: str) -> str | None: + """Detect language from file path.""" + ext = Path(file_path).suffix.lower() + return self._ext_map.get(ext) +``` + +#### 4. Runtime Override via Settings + +**Updated**: `doc-serve-server/config/settings.py` + +```python +class Settings(BaseSettings): + # ... existing settings ... + + # Language configuration overrides + LANGUAGE_CONFIG_PATH: str | None = None # Custom config file path + ENABLED_LANGUAGES: list[str] | None = None # Override enabled languages + DISABLED_LANGUAGES: list[str] | None = None # Disable specific languages +``` + +### Benefits + +1. **Zero code changes** to add languages - Pure YAML config +2. **160+ languages ready** - Just enable them in config +3. **User customizable** - Override via env vars or custom config +4. **Categorized defaults** - Sensible chunk sizes per language type +5. **Maintainable** - Single source of truth for language support +6. **Testable** - Config validation via Pydantic + +### Implementation Impact + +**Files to Modify**: +- `config/languages.yaml` - NEW - Language configuration +- `config/language_config.py` - NEW - Pydantic config loader +- `indexing/document_loader.py` - Simplify to use config +- `config/settings.py` - Add override settings + +**Migration Path**: +- Current hardcoded languages become default config +- Backward compatibility maintained +- Users can opt into new system via settings + +--- + +## 9. References + +### LlamaIndex Documentation +- [CodeSplitter API](https://developers.llamaindex.ai/python/framework-api-reference/node_parsers/code/) +- [SummaryExtractor API](https://developers.llamaindex.ai/python/framework-api-reference/extractors/summary/) +- [Metadata Extraction Guide](https://llamaindexxx.readthedocs.io/en/latest/module_guides/indexing/metadata_extraction.html) +- [IngestionPipeline](https://developers.llamaindex.ai/python/examples/ingestion/ingestion_pipeline/) + +### Tree-sitter +- [Official Documentation](https://tree-sitter.github.io) +- [Python Bindings](https://github.com/tree-sitter/py-tree-sitter) +- [Python Grammar](https://github.com/tree-sitter/tree-sitter-python) +- [JavaScript Grammar](https://github.com/tree-sitter/tree-sitter-javascript) +- [TypeScript Grammar](https://github.com/tree-sitter/tree-sitter-typescript) +- [tree-sitter-language-pack](https://pypi.org/project/tree-sitter-language-pack/) - 160+ bundled grammars +- [tree-sitter-languages](https://pypi.org/project/tree-sitter-languages/) - 40+ bundled grammars + +### ChromaDB +- [Metadata Filtering](https://docs.trychroma.com/docs/querying-collections/metadata-filtering) +- [Query API](https://docs.trychroma.com/docs/querying-collections/query) + +### Doc-Serve Internal References +- [Product Roadmap](../../docs/roadmaps/product-roadmap.md) - Phase 3 requirements +- [Spec Mapping](../../docs/roadmaps/spec-mapping.md) - Spec workflow +- [Feature Spec](./spec.md) - Detailed user stories and requirements +- [Developer Guide: Adding Languages](../../docs/DEVELOPERS_GUIDE.md#adding-support-for-new-languages) - How to add new language support + +--- + +## Appendix A: Example Code Chunk Output + +```json +{ + "chunk_id": "chunk_a1b2c3d4e5f6", + "text": "def authenticate_user(username: str, password: str) -> User | None:\n \"\"\"Authenticate user with username and password.\n \n Args:\n username: The user's login name\n password: The user's password (plaintext)\n \n Returns:\n User object if authenticated, None otherwise\n \"\"\"\n user = db.get_user_by_username(username)\n if user and verify_password(password, user.password_hash):\n return user\n return None", + "metadata": { + "source_type": "code", + "language": "python", + "file_path": "src/auth/service.py", + "file_name": "service.py", + "symbol_name": "authenticate_user", + "symbol_kind": "function", + "start_line": 42, + "end_line": 58, + "section_summary": "Authenticates a user by verifying their username and password against the database. Returns the User object on success or None on failure.", + "chunk_index": 3, + "total_chunks": 8 + } +} +``` + +## Appendix B: Query Examples + +**Find all Python authentication code:** +```bash +doc-svr-ctl query "authentication" --source-type code --language python +``` + +**Search both docs and code for API patterns:** +```bash +doc-svr-ctl query "REST API endpoint" --source-type all --mode hybrid +``` + +**Find specific function:** +```bash +doc-svr-ctl query "authenticate_user" --source-type code --mode bm25 +``` diff --git a/specs/101-code-ingestion/spec.md b/specs/101-code-ingestion/spec.md index c400ffe..db0cfd6 100644 --- a/specs/101-code-ingestion/spec.md +++ b/specs/101-code-ingestion/spec.md @@ -25,7 +25,7 @@ A developer wants to index their project's source code alongside documentation t --- -### User Story 2 - Cross-Reference Search (Priority: P1) +### User Story 2 - Cross-Reference Search (Priority: P1) ✅ IMPLEMENTED A user wants to search and find related code and documentation together from a unified corpus. @@ -38,7 +38,9 @@ A user wants to search and find related code and documentation together from a u 1. **Given** docs and code indexed, **When** I query "authentication handler", **Then** results include both doc sections and code implementing authentication 2. **Given** unified corpus, **When** I query for a function name, **Then** I see the function definition AND documentation about it 3. **Given** hybrid search enabled, **When** I query for exact code patterns, **Then** BM25 finds exact matches while vector finds conceptually related code -4. **Given** code-only query, **When** I POST with `source_type=code`, **Then** only code results are returned +4. **Given** code-only query, **When** I POST with `source_types=['code']`, **Then** only code results are returned +5. **Given** language-specific query, **When** I POST with `languages=['python']`, **Then** only Python code results are returned +6. **Given** combined filters, **When** I POST with `source_types=['code']` and `languages=['python']`, **Then** only Python code is searched --- @@ -93,6 +95,24 @@ Code is chunked at logical boundaries (functions, classes, modules) rather than --- +### User Story 7 - Config-Driven Language Support (Priority: P3 - Future Enhancement) + +A developer wants to add support for new programming languages without modifying code - just configuration changes. + +**Why this priority**: Enables support for 160+ languages via tree-sitter-language-pack without code changes. + +**Independent Test**: Edit languages.yaml config and verify new language is automatically supported. + +**Acceptance Scenarios**: + +1. **Given** languages.yaml config, **When** I add a new language entry, **Then** it's automatically supported without code changes +2. **Given** language presets, **When** I select "comprehensive" preset, **Then** programming + web + infrastructure languages are enabled +3. **Given** per-language excludes, **When** I configure test file patterns, **Then** test files are automatically excluded +4. **Given** environment overrides, **When** I set ENABLED_LANGUAGES=ruby, **Then** Ruby support is enabled +5. **Given** user custom config, **When** I provide custom languages.yaml, **Then** it overrides defaults + +--- + ### User Story 6 - Corpus for Book/Tutorial Generation (Priority: P1) A technical writer wants to create a searchable corpus from SDK source code and documentation for writing tutorials. diff --git a/specs/101-code-ingestion/tasks.md b/specs/101-code-ingestion/tasks.md new file mode 100644 index 0000000..7dcd85d --- /dev/null +++ b/specs/101-code-ingestion/tasks.md @@ -0,0 +1,258 @@ +# Tasks: Source Code Ingestion & Unified Corpus + +**Input**: Design documents from `/specs/101-code-ingestion/` +**Prerequisites**: plan.md, spec.md, research.md, data-model.md, contracts/, quickstart.md + +**Tests**: Test-Alongside approach required per constitution - unit and integration tests included + +**Organization**: Tasks are grouped by user story to enable independent implementation and testing of each story. + +**STATUS**: MVP Complete ✅ - US1, US2, US3, US4, US6 implemented and tested. US5 (AST-aware chunking) pending. + +## Format: `[ID] [P?] [Story] Description` + +- **[P]**: Can run in parallel (different files, no dependencies) +- **[Story]**: Which user story this task belongs to (e.g., US1, US2, US3) +- Include exact file paths in descriptions + +## Phase 1: Setup (Shared Infrastructure) + +**Purpose**: Project initialization and basic structure + +- [ ] T001 Add tree-sitter dependencies for Python, TypeScript, JavaScript to doc-serve-server/pyproject.toml +- [ ] T002 [P] Update doc-serve-server dependencies by running `poetry install` in doc-serve-server/ +- [ ] T003 Verify tree-sitter parsers work with test code snippets in doc-serve-server/ + +--- + +## Phase 2: Foundational (Blocking Prerequisites) + +**Purpose**: Core infrastructure that MUST be complete before ANY user story can be implemented + +**⚠️ CRITICAL**: No user story work can begin until this phase is complete + +- [ ] T004 Create CodeChunk dataclass in doc-serve-server/doc_serve_server/indexing/chunking.py +- [ ] T005 Update ChunkMetadata to support code-specific fields (language, symbol_name, start_line, end_line, section_summary) in doc-serve-server/doc_serve_server/indexing/chunking.py +- [ ] T006 Add language detection utility for file extensions in doc-serve-server/doc_serve_server/indexing/document_loader.py +- [ ] T007 Update QueryRequest/Result models with source_type and language filters in doc-serve-server/doc_serve_server/models/query.py +- [ ] T008 Update IndexRequest model with include_code, languages, exclude_patterns parameters in doc-serve-server/doc_serve_server/models/index.py + +**Checkpoint**: Foundation ready - user story implementation can now begin. + +--- + +## Phase 3: User Story 1 - Index Source Code from Folder (Priority: P1) 🎯 MVP + +**Goal**: Enable indexing of source code files alongside documentation + +**Independent Test**: POST to `/index` with `include_code=true` and verify code files appear in the index + +### Implementation for User Story 1 + +- [x] T009 [US1] Extend DocumentLoader.load_files() to support code file extensions (.py, .ts, .tsx, .js, .jsx) in doc-serve-server/doc_serve_server/indexing/document_loader.py +- [x] T010 [US1] Add CodeChunker class using LlamaIndex CodeSplitter for AST-aware chunking in doc-serve-server/doc_serve_server/indexing/chunking.py +- [x] T011 [US1] Update IndexingService to handle code files with language detection and CodeChunker in doc-serve-server/doc_serve_server/services/indexing_service.py +- [x] T012 [US1] Update /index endpoint to accept include_code, languages, exclude_patterns parameters in doc-serve-server/doc_serve_server/api/routers/index.py +- [x] T013 [US1] Add code chunk counting to /health/status endpoint in doc-serve-server/doc_serve_server/api/routers/health.py + +**Checkpoint**: User Story 1 is functional - can index code files. + +--- + +## Phase 4: User Story 2 - Cross-Reference Search (Priority: P1) + +**Goal**: Enable unified search across both documentation and code + +**Independent Test**: Query for a concept and verify results include both documentation and code examples + +### Implementation for User Story 2 + +- [x] T014 [US2] Update /query endpoint with source_type and language filtering in doc-serve-server/doc_serve_server/api/routers/query.py +- [x] T015 [US2] Update VectorStoreManager.similarity_search() to support ChromaDB where filtering by source_type/language in doc-serve-server/doc_serve_server/storage/vector_store.py +- [x] T016 [US2] Update BM25Retriever to support metadata filtering for source_type/language in doc-serve-server/doc_serve_server/indexing/bm25_index.py +- [x] T017 [US2] Update QueryService to handle source_type/language filtering in doc-serve-server/doc_serve_server/services/query_service.py + +**Checkpoint**: User Story 2 is functional - can search across docs and code. + +--- + +## Phase 5: User Story 3 - Language-Specific Filtering (Priority: P2) + +**Goal**: Enable filtering search results by programming language + +**Independent Test**: Query with `language=python` and verify only Python code results are returned + +### Implementation for User Story 3 + +- [x] T018 [US3] Add language validation to QueryRequest model in doc-serve-server/doc_serve_server/models/query.py +- [x] T019 [US3] Implement language filtering in VectorStoreManager.similarity_search() in doc-serve-server/doc_serve_server/storage/vector_store.py +- [x] T020 [US3] Implement language filtering in BM25Retriever.search() in doc-serve-server/doc_serve_server/indexing/bm25_index.py +- [x] T021 [US3] Add error handling for invalid language parameters in /query endpoint in doc-serve-server/doc_serve_server/api/routers/query.py + +**Checkpoint**: User Story 3 is functional - can filter by programming language. + +--- + +## Phase 6: User Story 4 - Code Summaries via SummaryExtractor (Priority: P2) + +**Goal**: Generate natural language descriptions for code chunks + +**Independent Test**: Index code and verify chunks have summary metadata attached + +### Implementation for User Story 4 + +- [x] T022 [US4] Add SummaryExtractor integration to embedding pipeline in doc-serve-server/doc_serve_server/indexing/embedding.py +- [x] T023 [US4] Create code-specific summary prompts in doc-serve-server/doc_serve_server/indexing/embedding.py +- [x] T024 [US4] Update CodeChunker to optionally generate summaries during chunking in doc-serve-server/doc_serve_server/indexing/chunking.py +- [x] T025 [US4] Add summary generation to IndexingService pipeline in doc-serve-server/doc_serve_server/indexing/chunking.py + +**Checkpoint**: User Story 4 is functional - code chunks include natural language summaries. + +--- + +## Phase 7: User Story 5 - AST-Aware Chunking (Priority: P3) + +**Goal**: Ensure code is chunked at logical boundaries using AST parsing + +**Independent Test**: Index code and verify chunks align with function/class boundaries + +### Implementation for User Story 5 + +- [ ] T026 [US5] Implement AST boundary detection in CodeChunker using tree-sitter in doc-serve-server/doc_serve_server/indexing/chunking.py +- [ ] T027 [US5] Add symbol name extraction from AST in CodeChunker in doc-serve-server/doc_serve_server/indexing/chunking.py +- [ ] T028 [US5] Add line number tracking for code chunks in CodeChunker in doc-serve-server/doc_serve_server/indexing/chunking.py +- [ ] T029 [US5] Update chunking tests to verify AST boundary preservation in doc-serve-server/tests/unit/test_chunking.py + +**Checkpoint**: User Story 5 is functional - code chunking respects AST boundaries. + +--- + +## Phase 8: User Story 6 - Corpus for Book/Tutorial Generation (Priority: P1) + +**Goal**: Create a searchable corpus from SDK source code and documentation for writing tutorials + +**Independent Test**: Index AWS CDK source + docs, query for patterns, verify comprehensive results + +### Implementation for User Story 6 + +- [x] T030 [US6] Verify unified search works for SDK documentation + code in doc-serve-server/doc_serve_server/services/query_service.py +- [x] T031 [US6] Test cross-reference queries with SDK examples in doc-serve-server/tests/integration/test_unified_search.py +- [x] T032 [US6] Ensure metadata includes file paths and line numbers for citations in doc-serve-server/doc_serve_server/models/query.py + +**Checkpoint**: User Story 6 is functional - SDK corpus supports tutorial writing. + +--- + +## Phase 9: Polish & Cross-Cutting Concerns + +**Purpose**: Improvements that affect multiple user stories + +- [x] T033 [P] Update doc-svr-ctl index command with --include-code, --languages, --exclude-patterns flags in doc-svr-ctl/doc_svr_ctl/commands/index.py +- [x] T034 [P] Update doc-svr-ctl query command with --source-type, --language filters in doc-svr-ctl/doc_svr_ctl/commands/query.py +- [x] T035 [P] Update README.md and docs/USER_GUIDE.md with code ingestion features +- [ ] T036 [P] Update doc-serve-skill/doc-serve/references/api_reference.md with new endpoints +- [ ] T037 [P] Update doc-serve-skill/doc-serve/references/troubleshooting-guide.md with code-specific issues +- [x] T038 Run full test suite: `task pr-qa-gate` +- [x] T039 Validate quickstart.md scenarios for code ingestion + +--- + +## Dependencies & Execution Order + +### Phase Dependencies + +- **Setup (Phase 1)**: No dependencies - can start immediately +- **Foundational (Phase 2)**: Depends on Setup completion - BLOCKS all user stories +- **User Stories (Phase 3-8)**: All depend on Foundational phase completion + - US1, US2, US6 can proceed in parallel (both P1) + - US3, US4 can proceed after US1/US2 (P2) + - US5 can proceed after foundational chunking (P3) +- **Polish (Final Phase)**: Depends on all user stories being complete + +### User Story Dependencies + +- **User Story 1 (P1)**: Foundation → Independent MVP capability +- **User Story 2 (P1)**: Foundation → Requires US1 for data to search +- **User Story 3 (P2)**: Foundation → Independent filtering capability +- **User Story 4 (P2)**: Foundation → Independent summarization capability +- **User Story 5 (P3)**: Foundation → Independent AST chunking improvement +- **User Story 6 (P1)**: Foundation + US1/US2 → SDK corpus capability + +### Within Each User Story + +- Models before services +- Services before endpoints +- Core implementation before integration +- Story complete before moving to next priority + +### Parallel Opportunities + +- All Setup tasks marked [P] can run in parallel +- All Foundational tasks marked [P] can run in parallel +- US1, US2, US6 can be worked on in parallel after Foundation +- US3, US4, US5 can be worked on in parallel after Foundation +- CLI updates in Polish phase can run in parallel +- Documentation updates in Polish phase can run in parallel + +--- + +## Parallel Example: User Stories 1 & 2 + +```bash +# Launch US1 and US2 in parallel after Foundation complete: +Task: "Extend DocumentLoader.load_files() to support code file extensions" +Task: "Update /query endpoint with source_type and language filtering" + +# Launch CLI enhancements in parallel during Polish phase: +Task: "Update doc-svr-ctl index command with --include-code flags" +Task: "Update doc-svr-ctl query command with --source-type filters" +``` + +--- + +## Implementation Strategy + +### MVP First (User Stories 1, 2 & 6 Only) + +1. Complete Phase 1: Setup +2. Complete Phase 2: Foundational (CRITICAL - blocks all stories) +3. Complete Phase 3: User Story 1 (code indexing) +4. Complete Phase 4: User Story 2 (unified search) +5. Complete Phase 8: User Story 6 (SDK corpus) +6. **STOP and VALIDATE**: Test cross-reference search independently +7. Deploy/demo MVP with basic code ingestion + +### Incremental Delivery + +1. **Foundation** → Setup + Foundational phases +2. **MVP** → US1 + US2 + US6 (code indexing + unified search + SDK corpus) +3. **Enhanced** → US3 + US4 (language filtering + summaries) +4. **Polished** → US5 + Polish phase (AST chunking + docs) + +### Parallel Team Strategy + +With multiple developers: + +1. **Foundation**: Team completes Setup + Foundational together +2. **MVP Sprint**: + - Developer A: User Story 1 (code indexing) + - Developer B: User Story 2 (unified search) + - Developer C: User Story 6 (SDK corpus validation) +3. **Enhancement Sprint**: + - Developer A: User Story 3 (language filtering) + - Developer B: User Story 4 (summaries) + - Developer C: User Story 5 (AST chunking) +4. **Polish Sprint**: All developers on CLI, docs, testing + +--- + +## Notes + +- [P] tasks = different files, no dependencies on incomplete tasks +- [Story] label maps task to specific user story for traceability +- Each user story should be independently completable and testable +- All tasks include exact file paths for implementation +- MVP scope: US1 + US2 + US6 provides core code ingestion + search capability +- Stop at any checkpoint to validate story independently +- Avoid: vague tasks, same file conflicts, cross-story dependencies that break independence +specs/101-code-ingestion/tasks.md \ No newline at end of file