Skip to content

feat: keep embeddings fresh after builds and purge orphaned vectors#599

Open
SHudici wants to merge 1 commit into
tirth8205:mainfrom
SHudici:feat/embedding-auto-refresh
Open

feat: keep embeddings fresh after builds and purge orphaned vectors#599
SHudici wants to merge 1 commit into
tirth8205:mainfrom
SHudici:feat/embedding-auto-refresh

Conversation

@SHudici

@SHudici SHudici commented Jul 3, 2026

Copy link
Copy Markdown

Problem

Embeddings are write-once: nothing refreshes them after a build, so semantic search silently decays as the code changes, and vectors for deleted or renamed nodes keep surfacing as ghost results.

Fix

  • EmbeddingStore.purge_orphans() deletes vectors whose qualified_name no longer exists in nodes (both tables share one SQLite file). embed_all_nodes() calls it, so a manual crg embed also cleans up ghosts. Guarded by a sqlite_master check so it is a no-op on databases without a nodes table.
  • refresh_embeddings(graph_store) runs as a post-build step on all three postprocess paths: tools/build.py::_run_postprocess (the MCP build_or_update_graph tool — skipped at postprocess="minimal", like the other enrichment steps), run_postprocess (the MCP re-run tool, with a new embeddings: bool = True flag exposed through run_postprocess_tool), and the shared postprocessing.run_post_processing (CLI watch mode, eval runner).

The refresh is opt-in by construction — a build can never trigger a surprise full re-embed or cloud spend:

  1. It only acts when the graph already contains embeddings (the user ran embed at least once).
  2. It only acts with the exact provider identity that produced them, resolved from the per-row provider tag (local:<model>, google:<model>, openai:<model>@<host>, ...). If the tagged provider cannot be resolved (missing extras, missing env vars) or resolves to a different model/endpoint, the refresh logs and skips.
  3. The hash-incremental embed_nodes() keeps the refresh cheap: only new or changed nodes are re-encoded, and orphans are purged in the same pass.

Any provider or transport error is downgraded to a build warning, matching the existing step-isolation contract in postprocessing.py.

Testing

New tests cover: orphan purging (with and without a nodes table), never-embedded → skip, legacy/unresolvable provider tag → skip, provider identity mismatch → skip, and the embed-new-plus-purge happy path ({"embedded": 1, "purged": 1}), plus postprocessing-step isolation (refresh failure → warning, not a failed build). The build-tool path has its own tests: full postprocess invokes the refresh and surfaces embeddings_refreshed/embeddings_purged counts, minimal skips it, a None return adds no keys, a provider error degrades to a build warning, and the run_postprocess embeddings flag is honored. Full suite: 1412 passed / 0 failed.

🤖 Generated with Claude Code

Embeddings were write-once: nothing refreshed them after a build, so
semantic search silently decayed as code changed, and vectors for
deleted or renamed nodes kept surfacing as ghosts.

Two additions:

- EmbeddingStore.purge_orphans() deletes vectors whose qualified_name
  no longer exists in the nodes table (both tables share one SQLite
  file). embed_all_nodes() now calls it, so a manual embed also cleans
  up ghosts.
- refresh_embeddings(graph_store) runs after every build-time
  post-process: the shared run_post_processing() pipeline (watch mode,
  eval runner) gains it as a fifth step, and the tool pipeline used by
  the CLI build command and the MCP build_or_update_graph /
  run_postprocess tools invokes it at the "full" postprocess level.
  run_postprocess grows an embeddings flag (default True) alongside
  its existing flows/communities/fts knobs.

The refresh is opt-in by construction: it only acts when the graph
already contains embeddings (the user ran embed at least once), and
only with the exact provider identity that produced them (stored in
the per-row provider tag). If the tagged provider cannot be resolved
(missing extras or env vars) or resolves to a different
model/endpoint, the refresh is skipped, so a build can never trigger
a surprise full re-embed or cloud spend. The hash-incremental
embed_nodes() keeps the refresh cheap: only new or changed nodes are
re-encoded.

Every wiring point downgrades provider or transport errors to a build
warning, matching the existing step-isolation contract.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant