Skip to content

Feat/embedding retrieve rerank traverse#63

Draft
thientu wants to merge 8 commits into
mainfrom
feat/embedding-retrieve-rerank-traverse
Draft

Feat/embedding retrieve rerank traverse#63
thientu wants to merge 8 commits into
mainfrom
feat/embedding-retrieve-rerank-traverse

Conversation

@thientu

@thientu thientu commented Jun 30, 2026

Copy link
Copy Markdown

Summary

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring
  • Chore

Testing

  • Unit tests pass (cargo test)
  • Integration tests pass
  • Manual verification

Checklist

  • Code follows project conventions
  • Self-review completed
  • Documentation updated (if needed)
  • No new warnings or errors

Breaking Changes

Related Issues

Additional Context

thientu added 8 commits June 30, 2026 08:31
Unified plan that scopes the deferred embedding channel from
hybrid-retrieval-reranking.md (Phase 4) and ontology-semantic-search-mvp.md
(Future Enhancements). Locks 4 design decisions: all-Rust stack
(fastembed + usearch + ort), embed text blob + ontology description +
aliases, adaptive traversal per seed type, incremental embed via
embedding_state table, lazy-download with embed --init, ANN-only
reranker fallback (option A), worktree exclusion default-on.
…d + usearch

Drops the originally-planned ort dep: fastembed-rs covers both embedding
inference and the bge-reranker-v2-m3 cross-encoder via TextRerank natively,
so a separate ONNX runtime binding is redundant. Default build unchanged;
new deps are optional behind the embeddings feature.
Adds src/embeddings/ behind the embeddings feature:

- text_blob: classify + build text blobs for code, ontology, doc nodes.
  No code bodies; deterministic u64 usearch key + content_hash from SHA-256.
- state: embedding_state CozoDB table (qualified_name, usearch_key,
  content_hash, state, embedded_at). mark_stale / list_stale / list_orphans
  / upsert_fresh / delete_state_rows / count_by_state. Indexer hook calls
  mark_stale_for_qualified_names after every insert_elements batch so the
  next embed run is incremental.
- models: fastembed wrappers for Embedder (BGESmallENV15, 384-dim) and
  Reranker (BGERerankerV2M3). cache_dir under dirs::cache_dir/leankg/models.
  init_models for embed --init pre-download.
- index: usearch HNSW wrapper, cosine + f32, file persistence.
- build: orchestrates incremental vs full rebuild, batched embed, orphan
  reaping, writes embeddings.usearch + embeddings.meta.json.

Wiring: schema.rs creates the embedding_state table when the feature is
compiled in; indexer marks touched elements stale; lib.rs exports the
module. Default builds (no feature) are unaffected.

Phases 2-6 (retrieval pipeline, adaptive traversal, MCP tool, CLI,
tests) stack on top in subsequent commits.
Adds src/retrieval/ behind the embeddings feature:

- ann: Stage 2 wrapper. Embeds the query via fastembed, runs usearch
  top-K search, returns raw (key, distance) pairs. No CozoDB access.
- rerank: Stage 3 wrapper. Loads fastembed TextRerank (bge-reranker-v2-m3).
  Any load or inference failure is non-fatal — the stage degrades to
  ANN-order pass-through and tags the result as Fallback (Q4 option A).
- pipeline: orchestrates Stage 2 → worktree/env filter → Stage 3. Returns
  RetrievalResult { seeds, reranker_status, candidate counts, stale flag }.
  Q2 worktree filter defaults on (.worktrees/, .claude/worktrees/,
  .opencode/worktrees/). env filter defaults to 'local'.

Element lookup is O(n) per query for now — fine under 50k nodes, optimizes
to a batched Datalog lookup later. Phases 3-6 stack on top.
Adds traverse_seeds to src/graph/traversal.rs. BFS from each seed with
per-element-type rules:

- workflow seeds: 2 hops, has_step/next_step/branches_to/implemented_by/
  entry_point_of/step_in_process/has_failure_mode, fanout 20
- workflow_step/decision_point/failure_mode: 2 hops, fanout 15
- domain_entity/service/api_endpoint/data_store: 1 hop, fanout 15
- known_issue/playbook/team_knowledge: 1 hop, fanout 10
- function/class: 1 hop (calls/imports/tested_by/etc), fanout 10
- file/module: 1 hop, fanout 10
- unknown / doc: 1 hop, documented_by/documents_concept, fanout 5

Global cap 60 traversed neighbors. Bidirectional: walks both outgoing
(get_relationships) and incoming (get_relationships_for_target) edges.
Function is feature-independent: takes plain (qualified_name, element_type)
tuples so it stays reusable without the embeddings feature. The retrieval
pipeline / MCP handler will adapt their Seed list to this shape in Phase 4.
Adds the embedding-backed retrieval tool, gated by the embeddings feature
so default builds neither advertise nor dispatch it.

Tool schema (src/mcp/tools.rs): query + env + top_k (default 50) +
rerank_top_n (default 10) + traverse (default true) + include_worktrees
(default false, Q2 worktree filter) + debug (default false) + project.

Handler (src/mcp/handler.rs::kg_semantic_context):
- 404s with a clear error if .leankg/embeddings.usearch is missing
- Initializes SemanticRetrievalPipeline (loads usearch + fastembed models)
- Calls pipeline.retrieve with the Q2 worktree filter and Q4 fallback
  already baked into the pipeline
- Stage 4 adaptive traversal via traverse_seeds when traverse=true
- Q3 stale-embeddings detection (embeddings_are_stale): compares
  embeddings.meta.json.built_at vs leankg.db mtime, conservative default
- Response shape mirrors kg_context: query, env, seeds[], traversed[].
  Diagnostics only included when debug=true (reranker status, candidate
  counts, latency per stage, edges).
- Token budget enforcement happens in execute_tool like every other tool.
Adds two gated subcommands to mirror the MCP tool from a terminal:

- embed [--init|--full] [--batch-size N] [--project PATH]
    --init: pre-download embedding + reranker models to cache (no build)
    default: incremental build (stale-only)
    --full: full rebuild (recovery / model swap)

- semantic-context QUERY [--env local] [--top-k 50] [--rerank-top-n 10]
                       [--no-traverse] [--include-worktrees] [--debug]
                       [--project PATH]

Both wired through main.rs helpers run_embed / run_semantic_context.
main.rs now declares mod embeddings / mod retrieval behind the same
feature gate so the binary can use them. Default builds (no feature)
skip the variants entirely; --help will not list them.
Docs:
- docs/mcp-tools.md: new Semantic Retrieval section documenting
  kg_semantic_context (schema, args, response shape, fallback behavior).
- docs/mcp-setup.md: new Embedding Retrieval setup section covering the
  feature flag, one-time model download, index lifecycle, worktree
  exclusion, and reranker-fallback troubleshooting.

Tests:
- tests/embeddings_state_e2e.rs (gated): integration coverage for the
  embedding_state CozoDB helpers — table creation idempotency,
  mark_stale insertion + dedup, upsert_fresh state transitions,
  list_orphans detection, delete_state_rows, count_by_state,
  lookup_usearch_key round-trip. Uses init_db + tempfile; no model
  downloads required, so it runs to completion wherever the feature is
  compiled in.

Unit tests for text_blob, usearch, traversal rules, and pipeline
worktree-filter already live alongside their source under #[cfg(test)].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant