feat: keep embeddings fresh after builds and purge orphaned vectors#599
Open
SHudici wants to merge 1 commit into
Open
feat: keep embeddings fresh after builds and purge orphaned vectors#599SHudici wants to merge 1 commit into
SHudici wants to merge 1 commit into
Conversation
Embeddings were write-once: nothing refreshed them after a build, so semantic search silently decayed as code changed, and vectors for deleted or renamed nodes kept surfacing as ghosts. Two additions: - EmbeddingStore.purge_orphans() deletes vectors whose qualified_name no longer exists in the nodes table (both tables share one SQLite file). embed_all_nodes() now calls it, so a manual embed also cleans up ghosts. - refresh_embeddings(graph_store) runs after every build-time post-process: the shared run_post_processing() pipeline (watch mode, eval runner) gains it as a fifth step, and the tool pipeline used by the CLI build command and the MCP build_or_update_graph / run_postprocess tools invokes it at the "full" postprocess level. run_postprocess grows an embeddings flag (default True) alongside its existing flows/communities/fts knobs. The refresh is opt-in by construction: it only acts when the graph already contains embeddings (the user ran embed at least once), and only with the exact provider identity that produced them (stored in the per-row provider tag). If the tagged provider cannot be resolved (missing extras or env vars) or resolves to a different model/endpoint, the refresh is skipped, so a build can never trigger a surprise full re-embed or cloud spend. The hash-incremental embed_nodes() keeps the refresh cheap: only new or changed nodes are re-encoded. Every wiring point downgrades provider or transport errors to a build warning, matching the existing step-isolation contract. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Embeddings are write-once: nothing refreshes them after a build, so semantic search silently decays as the code changes, and vectors for deleted or renamed nodes keep surfacing as ghost results.
Fix
EmbeddingStore.purge_orphans()deletes vectors whosequalified_nameno longer exists innodes(both tables share one SQLite file).embed_all_nodes()calls it, so a manualcrg embedalso cleans up ghosts. Guarded by asqlite_mastercheck so it is a no-op on databases without anodestable.refresh_embeddings(graph_store)runs as a post-build step on all three postprocess paths:tools/build.py::_run_postprocess(the MCPbuild_or_update_graphtool — skipped atpostprocess="minimal", like the other enrichment steps),run_postprocess(the MCP re-run tool, with a newembeddings: bool = Trueflag exposed throughrun_postprocess_tool), and the sharedpostprocessing.run_post_processing(CLI watch mode, eval runner).The refresh is opt-in by construction — a build can never trigger a surprise full re-embed or cloud spend:
embedat least once).local:<model>,google:<model>,openai:<model>@<host>, ...). If the tagged provider cannot be resolved (missing extras, missing env vars) or resolves to a different model/endpoint, the refresh logs and skips.embed_nodes()keeps the refresh cheap: only new or changed nodes are re-encoded, and orphans are purged in the same pass.Any provider or transport error is downgraded to a build warning, matching the existing step-isolation contract in
postprocessing.py.Testing
New tests cover: orphan purging (with and without a
nodestable), never-embedded → skip, legacy/unresolvable provider tag → skip, provider identity mismatch → skip, and the embed-new-plus-purge happy path ({"embedded": 1, "purged": 1}), plus postprocessing-step isolation (refresh failure → warning, not a failed build). The build-tool path has its own tests: full postprocess invokes the refresh and surfacesembeddings_refreshed/embeddings_purgedcounts, minimal skips it, aNonereturn adds no keys, a provider error degrades to a build warning, and therun_postprocessembeddingsflag is honored. Full suite: 1412 passed / 0 failed.🤖 Generated with Claude Code