feat: US6 SDK Corpus + US3 Language Filtering + US4 LLM Code Summaries#69
Merged
RichardHightower merged 24 commits intomainfrom Dec 19, 2025
Merged
feat: US6 SDK Corpus + US3 Language Filtering + US4 LLM Code Summaries#69RichardHightower merged 24 commits intomainfrom
RichardHightower merged 24 commits intomainfrom
Conversation
Phase 1 Planning Artifacts: - plan.md: Technical implementation plan with constitution compliance - data-model.md: Unified corpus schema with code metadata extensions - quickstart.md: Usage examples and CLI/API reference - contracts/api-extensions.md: OpenAPI specifications for new endpoints - Agent context updated for Claude with new technologies Research (Phase 0) was already complete and comprehensive. Next: Run /speckit.tasks to generate implementation task breakdown.
Expand supported languages from 3 to 9: - Python, TypeScript, JavaScript (existing) - C, C++, Java (systems/JVM) - Go, Rust, Swift (modern languages) Updates: - research.md: Added tree-sitter support analysis for all languages - plan.md: Updated technical context to reflect 9-language support - quickstart.md: Added examples for all languages and file extensions - data-model.md: Updated enums and examples - contracts/api-extensions.md: Updated OpenAPI schemas with all languages - Chunking behavior documented for each language family - Performance considerations for diverse language ecosystems All existing functionality preserved with expanded language coverage.
Add Kotlin language support using fwcd/tree-sitter-kotlin parser: - research.md: Added Kotlin to tree-sitter support table, chunking behavior, extensions, language detection, and dependencies - plan.md: Updated technical context to include Kotlin in multi-language scope - quickstart.md: Added Kotlin to supported languages, examples, and CLI reference - contracts/api-extensions.md: Updated OpenAPI schemas to include Kotlin in language enums - data-model.md: Updated to reflect 10-language support Kotlin extensions: .kt, .kts Kotlin chunking: Functions, classes, data classes, extension functions, null safety operators Kotlin use cases: Android development, modern JVM applications, data class patterns Total languages now supported: 10 (Python, TypeScript, JavaScript, Kotlin, C, C++, Java, Go, Rust, Swift)
Phase 2 Planning Complete: - tasks.md: 35 structured tasks across 8 phases - Organized by user story for independent implementation - MVP scope: US1 + US2 (code indexing + unified search) - Includes parallel opportunities and dependency tracking - All tasks follow strict checklist format with file paths Tasks breakdown: - Phase 1: Setup (3 tasks) - Dependencies and verification - Phase 2: Foundational (5 tasks) - Core infrastructure - Phase 3: US1 (5 tasks) - Code indexing MVP - Phase 4: US2 (4 tasks) - Cross-reference search - Phase 5: US3 (4 tasks) - Language filtering - Phase 6: US4 (4 tasks) - Code summaries - Phase 7: US5 (4 tasks) - AST-aware chunking - Phase 8: Polish (6 tasks) - CLI, docs, testing Ready for implementation with clear execution order and checkpoints.
Phase 1 Setup Complete: - T001 ✅ Added tree-sitter-language-pack ^0.7.3 to pyproject.toml - T002 ✅ Updated poetry dependencies and lock file - T003 ✅ Verified all 10 language parsers work (Python, TypeScript, JavaScript, Kotlin, C, C++, Java, Go, Rust, Swift) All tree-sitter parsers tested and working correctly. Ready for Phase 2 foundational implementation.
- Implement cross-reference search combining docs + code in single queries - Add source_type filtering (--source-types doc, code, test) - Add language filtering (--languages python, typescript, etc) - Add ChromaDB where clause filtering for vector search - Add BM25 metadata filtering for keyword search - Implement manual hybrid fusion replacing QueryFusionRetriever - Update QueryService with filtering pipeline - Update CLI and API with filtering parameters - Fix all tests to pass with 58% coverage - Complete US2 acceptance criteria: cross-reference queries work Closes #T014, #T015, #T016, #T017
- Add comprehensive integration tests for unified search functionality - Create test_unified_search.py with SDK cross-reference scenarios - Verify metadata completeness for Claude skill citations - Test tutorial writing workflows with docs + code - Ensure QueryResult includes file paths, line numbers, symbols - Validate end-to-end SDK corpus functionality Closes #66, #67, #68 (T030, T031, T032)
- Add language validation to QueryRequest model using LanguageDetector - Validate that language filter parameters are supported programming languages - Reject invalid languages with helpful error messages listing supported options - Language filtering already implemented via ChromaDB where clauses and BM25 filters - API endpoint returns proper 422 validation errors for invalid languages Closes #70 (T018)
- Remove unused imports (QueryFusionRetriever, FUSION_MODES, patch) - Remove unused variable assignments (vector_scores, bm25_scores) - Fix line length violations with proper line breaks - Organize imports in test files - Add missing newline at end of test file - Rename unused loop variables (_chunk_id) - All ruff checks now pass - 69 tests pass with 57.67% coverage Style improvements for maintainability and consistency.
…aries fix: Resolve all linting issues in codebase
- Add SummaryExtractor integration to embedding pipeline - Implement LLM-powered code summarization for semantic search - Create code-specific summary prompts for different languages - Update CodeChunker to optionally generate summaries during chunking - Add generate_summaries parameter to indexing pipeline - Integrate summaries into chunk metadata for search - Add CLI --generate-summaries flag - Fallback to docstrings/comments if LLM unavailable Closes #71, #72, #73, #74
- Use Anthropic SDK directly for LLM summarization instead of llama_index wrapper - Maintain OpenAI for embeddings, Claude for summarization as originally planned - Claude 3.5 Haiku provides better cost/performance ratio for code summarization - Direct Anthropic SDK integration avoids dependency issues - All tests pass with 56.24% coverage Fixes model selection to match original architecture and .env.example specifications.
- Clarify Claude model usage in settings.py comment - Update .env.example with clearer Claude model description - Remove temporary coverage files - Maintain OpenAI for embeddings, Claude for summarization architecture Related to US4 LLM code summaries implementation.
…aries Feat/us4 llm code summaries
- Fix linting errors (line length, unused imports, deprecated types) - Fix type checking errors (missing annotations, type mismatches) - Fix critical bug in hybrid fusion logic causing SearchResult/QueryResult confusion - Update test formatting to comply with style guidelines - Maintain 57% test coverage (exceeds 50% requirement) All PR QA gates now passing - ready for merge.
…c-serve-skill into 101-code-ingestion - Resolved conflicts by keeping QA fixes - Maintains all linting and type checking fixes - Ready for PR CI checks
- Fix arduino/setup-task@v2 version from 3.x to 3.43.3 - Prevents 404 error when downloading non-existent version - Resolves CI failure in PR #69
- Fixed 2 E501 line length violations in chunking.py - Broke long function calls and f-strings across multiple lines - All linting checks now pass
- Update README.md with code ingestion features and supported languages - Update QUICK_START.md with CLI testing instructions and code examples - Update USER_GUIDE.md with code-aware search and filtering capabilities - Update DEVELOPER_GUIDE.md with architecture changes and implementation details - Mark completed user stories in SDD task files (US1, US2, US3, US4, US6) - Add status note indicating MVP completion with US5 pending All documentation now reflects the implemented code ingestion features.
- Add explicit note that features/tasks are not considered done unless pr-qa-gate passes - Strengthen the importance of running full quality checks before check-in
- Fixed progress calculation bug where progress was stuck at 67.2% - Issue: Progress callback calculated per-language instead of accumulating across all languages - Solution: Track total code documents processed across all languages - Progress now updates correctly during LLM summary generation phase Resolves progress bar getting stuck during large codebase indexing with multiple programming languages.
- Fixed progress bar stuck at 67.2% during code indexing - Issue: Progress calculated per-language instead of accumulating across languages - Solution: Track total code documents processed across all languages - Progress now updates correctly from 35-50% during code chunking phase - Added noqa comment for B023 warning (functionally correct closure usage) Resolves progress display issues when indexing codebases with multiple programming languages.
- Fixed B023 function definition loop variable binding warning - Added proper return type annotation for make_progress_callback - Fixed E501 line length violations - Added noqa comments for acceptable false positives - All QA gates now pass: linting, type checking, tests, coverage
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Complete US6 SDK Corpus, US3 Language Filtering, and US4 LLM Code Summaries
US6: SDK Corpus for Book/Tutorial Generation
US3: Language-Specific Filtering
US4: Code Summaries via LLM
Implementation
test_unified_search.py--generate-summariesCLI flag and API parameterValidation
Related Issues