Feat/embedding retrieve rerank traverse#63
Draft
thientu wants to merge 8 commits into
Draft
Conversation
Unified plan that scopes the deferred embedding channel from hybrid-retrieval-reranking.md (Phase 4) and ontology-semantic-search-mvp.md (Future Enhancements). Locks 4 design decisions: all-Rust stack (fastembed + usearch + ort), embed text blob + ontology description + aliases, adaptive traversal per seed type, incremental embed via embedding_state table, lazy-download with embed --init, ANN-only reranker fallback (option A), worktree exclusion default-on.
…d + usearch Drops the originally-planned ort dep: fastembed-rs covers both embedding inference and the bge-reranker-v2-m3 cross-encoder via TextRerank natively, so a separate ONNX runtime binding is redundant. Default build unchanged; new deps are optional behind the embeddings feature.
Adds src/embeddings/ behind the embeddings feature: - text_blob: classify + build text blobs for code, ontology, doc nodes. No code bodies; deterministic u64 usearch key + content_hash from SHA-256. - state: embedding_state CozoDB table (qualified_name, usearch_key, content_hash, state, embedded_at). mark_stale / list_stale / list_orphans / upsert_fresh / delete_state_rows / count_by_state. Indexer hook calls mark_stale_for_qualified_names after every insert_elements batch so the next embed run is incremental. - models: fastembed wrappers for Embedder (BGESmallENV15, 384-dim) and Reranker (BGERerankerV2M3). cache_dir under dirs::cache_dir/leankg/models. init_models for embed --init pre-download. - index: usearch HNSW wrapper, cosine + f32, file persistence. - build: orchestrates incremental vs full rebuild, batched embed, orphan reaping, writes embeddings.usearch + embeddings.meta.json. Wiring: schema.rs creates the embedding_state table when the feature is compiled in; indexer marks touched elements stale; lib.rs exports the module. Default builds (no feature) are unaffected. Phases 2-6 (retrieval pipeline, adaptive traversal, MCP tool, CLI, tests) stack on top in subsequent commits.
Adds src/retrieval/ behind the embeddings feature:
- ann: Stage 2 wrapper. Embeds the query via fastembed, runs usearch
top-K search, returns raw (key, distance) pairs. No CozoDB access.
- rerank: Stage 3 wrapper. Loads fastembed TextRerank (bge-reranker-v2-m3).
Any load or inference failure is non-fatal — the stage degrades to
ANN-order pass-through and tags the result as Fallback (Q4 option A).
- pipeline: orchestrates Stage 2 → worktree/env filter → Stage 3. Returns
RetrievalResult { seeds, reranker_status, candidate counts, stale flag }.
Q2 worktree filter defaults on (.worktrees/, .claude/worktrees/,
.opencode/worktrees/). env filter defaults to 'local'.
Element lookup is O(n) per query for now — fine under 50k nodes, optimizes
to a batched Datalog lookup later. Phases 3-6 stack on top.
Adds traverse_seeds to src/graph/traversal.rs. BFS from each seed with per-element-type rules: - workflow seeds: 2 hops, has_step/next_step/branches_to/implemented_by/ entry_point_of/step_in_process/has_failure_mode, fanout 20 - workflow_step/decision_point/failure_mode: 2 hops, fanout 15 - domain_entity/service/api_endpoint/data_store: 1 hop, fanout 15 - known_issue/playbook/team_knowledge: 1 hop, fanout 10 - function/class: 1 hop (calls/imports/tested_by/etc), fanout 10 - file/module: 1 hop, fanout 10 - unknown / doc: 1 hop, documented_by/documents_concept, fanout 5 Global cap 60 traversed neighbors. Bidirectional: walks both outgoing (get_relationships) and incoming (get_relationships_for_target) edges. Function is feature-independent: takes plain (qualified_name, element_type) tuples so it stays reusable without the embeddings feature. The retrieval pipeline / MCP handler will adapt their Seed list to this shape in Phase 4.
Adds the embedding-backed retrieval tool, gated by the embeddings feature so default builds neither advertise nor dispatch it. Tool schema (src/mcp/tools.rs): query + env + top_k (default 50) + rerank_top_n (default 10) + traverse (default true) + include_worktrees (default false, Q2 worktree filter) + debug (default false) + project. Handler (src/mcp/handler.rs::kg_semantic_context): - 404s with a clear error if .leankg/embeddings.usearch is missing - Initializes SemanticRetrievalPipeline (loads usearch + fastembed models) - Calls pipeline.retrieve with the Q2 worktree filter and Q4 fallback already baked into the pipeline - Stage 4 adaptive traversal via traverse_seeds when traverse=true - Q3 stale-embeddings detection (embeddings_are_stale): compares embeddings.meta.json.built_at vs leankg.db mtime, conservative default - Response shape mirrors kg_context: query, env, seeds[], traversed[]. Diagnostics only included when debug=true (reranker status, candidate counts, latency per stage, edges). - Token budget enforcement happens in execute_tool like every other tool.
Adds two gated subcommands to mirror the MCP tool from a terminal:
- embed [--init|--full] [--batch-size N] [--project PATH]
--init: pre-download embedding + reranker models to cache (no build)
default: incremental build (stale-only)
--full: full rebuild (recovery / model swap)
- semantic-context QUERY [--env local] [--top-k 50] [--rerank-top-n 10]
[--no-traverse] [--include-worktrees] [--debug]
[--project PATH]
Both wired through main.rs helpers run_embed / run_semantic_context.
main.rs now declares mod embeddings / mod retrieval behind the same
feature gate so the binary can use them. Default builds (no feature)
skip the variants entirely; --help will not list them.
Docs: - docs/mcp-tools.md: new Semantic Retrieval section documenting kg_semantic_context (schema, args, response shape, fallback behavior). - docs/mcp-setup.md: new Embedding Retrieval setup section covering the feature flag, one-time model download, index lifecycle, worktree exclusion, and reranker-fallback troubleshooting. Tests: - tests/embeddings_state_e2e.rs (gated): integration coverage for the embedding_state CozoDB helpers — table creation idempotency, mark_stale insertion + dedup, upsert_fresh state transitions, list_orphans detection, delete_state_rows, count_by_state, lookup_usearch_key round-trip. Uses init_db + tempfile; no model downloads required, so it runs to completion wherever the feature is compiled in. Unit tests for text_blob, usearch, traversal rules, and pipeline worktree-filter already live alongside their source under #[cfg(test)].
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Type of Change
Testing
cargo test)Checklist
Breaking Changes
Related Issues
Additional Context