Feat/embedding retrieve rerank traverse by thientu · Pull Request #63 · FreePeak/LeanKG

thientu · 2026-06-30T10:43:47Z

Summary

Type of Change

Testing

Unit tests pass (cargo test)
Integration tests pass
Manual verification

Checklist

Code follows project conventions
Self-review completed
Documentation updated (if needed)
No new warnings or errors

Breaking Changes

Related Issues

Additional Context

Unified plan that scopes the deferred embedding channel from hybrid-retrieval-reranking.md (Phase 4) and ontology-semantic-search-mvp.md (Future Enhancements). Locks 4 design decisions: all-Rust stack (fastembed + usearch + ort), embed text blob + ontology description + aliases, adaptive traversal per seed type, incremental embed via embedding_state table, lazy-download with embed --init, ANN-only reranker fallback (option A), worktree exclusion default-on.

…d + usearch Drops the originally-planned ort dep: fastembed-rs covers both embedding inference and the bge-reranker-v2-m3 cross-encoder via TextRerank natively, so a separate ONNX runtime binding is redundant. Default build unchanged; new deps are optional behind the embeddings feature.

Adds src/embeddings/ behind the embeddings feature: - text_blob: classify + build text blobs for code, ontology, doc nodes. No code bodies; deterministic u64 usearch key + content_hash from SHA-256. - state: embedding_state CozoDB table (qualified_name, usearch_key, content_hash, state, embedded_at). mark_stale / list_stale / list_orphans / upsert_fresh / delete_state_rows / count_by_state. Indexer hook calls mark_stale_for_qualified_names after every insert_elements batch so the next embed run is incremental. - models: fastembed wrappers for Embedder (BGESmallENV15, 384-dim) and Reranker (BGERerankerV2M3). cache_dir under dirs::cache_dir/leankg/models. init_models for embed --init pre-download. - index: usearch HNSW wrapper, cosine + f32, file persistence. - build: orchestrates incremental vs full rebuild, batched embed, orphan reaping, writes embeddings.usearch + embeddings.meta.json. Wiring: schema.rs creates the embedding_state table when the feature is compiled in; indexer marks touched elements stale; lib.rs exports the module. Default builds (no feature) are unaffected. Phases 2-6 (retrieval pipeline, adaptive traversal, MCP tool, CLI, tests) stack on top in subsequent commits.

Adds src/retrieval/ behind the embeddings feature: - ann: Stage 2 wrapper. Embeds the query via fastembed, runs usearch top-K search, returns raw (key, distance) pairs. No CozoDB access. - rerank: Stage 3 wrapper. Loads fastembed TextRerank (bge-reranker-v2-m3). Any load or inference failure is non-fatal — the stage degrades to ANN-order pass-through and tags the result as Fallback (Q4 option A). - pipeline: orchestrates Stage 2 → worktree/env filter → Stage 3. Returns RetrievalResult { seeds, reranker_status, candidate counts, stale flag }. Q2 worktree filter defaults on (.worktrees/, .claude/worktrees/, .opencode/worktrees/). env filter defaults to 'local'. Element lookup is O(n) per query for now — fine under 50k nodes, optimizes to a batched Datalog lookup later. Phases 3-6 stack on top.

Adds traverse_seeds to src/graph/traversal.rs. BFS from each seed with per-element-type rules: - workflow seeds: 2 hops, has_step/next_step/branches_to/implemented_by/ entry_point_of/step_in_process/has_failure_mode, fanout 20 - workflow_step/decision_point/failure_mode: 2 hops, fanout 15 - domain_entity/service/api_endpoint/data_store: 1 hop, fanout 15 - known_issue/playbook/team_knowledge: 1 hop, fanout 10 - function/class: 1 hop (calls/imports/tested_by/etc), fanout 10 - file/module: 1 hop, fanout 10 - unknown / doc: 1 hop, documented_by/documents_concept, fanout 5 Global cap 60 traversed neighbors. Bidirectional: walks both outgoing (get_relationships) and incoming (get_relationships_for_target) edges. Function is feature-independent: takes plain (qualified_name, element_type) tuples so it stays reusable without the embeddings feature. The retrieval pipeline / MCP handler will adapt their Seed list to this shape in Phase 4.

Adds the embedding-backed retrieval tool, gated by the embeddings feature so default builds neither advertise nor dispatch it. Tool schema (src/mcp/tools.rs): query + env + top_k (default 50) + rerank_top_n (default 10) + traverse (default true) + include_worktrees (default false, Q2 worktree filter) + debug (default false) + project. Handler (src/mcp/handler.rs::kg_semantic_context): - 404s with a clear error if .leankg/embeddings.usearch is missing - Initializes SemanticRetrievalPipeline (loads usearch + fastembed models) - Calls pipeline.retrieve with the Q2 worktree filter and Q4 fallback already baked into the pipeline - Stage 4 adaptive traversal via traverse_seeds when traverse=true - Q3 stale-embeddings detection (embeddings_are_stale): compares embeddings.meta.json.built_at vs leankg.db mtime, conservative default - Response shape mirrors kg_context: query, env, seeds[], traversed[]. Diagnostics only included when debug=true (reranker status, candidate counts, latency per stage, edges). - Token budget enforcement happens in execute_tool like every other tool.

Adds two gated subcommands to mirror the MCP tool from a terminal: - embed [--init|--full] [--batch-size N] [--project PATH] --init: pre-download embedding + reranker models to cache (no build) default: incremental build (stale-only) --full: full rebuild (recovery / model swap) - semantic-context QUERY [--env local] [--top-k 50] [--rerank-top-n 10] [--no-traverse] [--include-worktrees] [--debug] [--project PATH] Both wired through main.rs helpers run_embed / run_semantic_context. main.rs now declares mod embeddings / mod retrieval behind the same feature gate so the binary can use them. Default builds (no feature) skip the variants entirely; --help will not list them.

Docs: - docs/mcp-tools.md: new Semantic Retrieval section documenting kg_semantic_context (schema, args, response shape, fallback behavior). - docs/mcp-setup.md: new Embedding Retrieval setup section covering the feature flag, one-time model download, index lifecycle, worktree exclusion, and reranker-fallback troubleshooting. Tests: - tests/embeddings_state_e2e.rs (gated): integration coverage for the embedding_state CozoDB helpers — table creation idempotency, mark_stale insertion + dedup, upsert_fresh state transitions, list_orphans detection, delete_state_rows, count_by_state, lookup_usearch_key round-trip. Uses init_db + tempfile; no model downloads required, so it runs to completion wherever the feature is compiled in. Unit tests for text_blob, usearch, traversal rules, and pipeline worktree-filter already live alongside their source under #[cfg(test)].

thientu added 8 commits June 30, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/embedding retrieve rerank traverse#63

Feat/embedding retrieve rerank traverse#63
thientu wants to merge 8 commits into
mainfrom
feat/embedding-retrieve-rerank-traverse

thientu commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thientu commented Jun 30, 2026

Summary

Type of Change

Testing

Checklist

Breaking Changes

Related Issues

Additional Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant