feat: keep embeddings fresh after builds and purge orphaned vectors by SHudici · Pull Request #599 · tirth8205/code-review-graph

SHudici · 2026-07-03T10:16:29Z

Problem

Embeddings are write-once: nothing refreshes them after a build, so semantic search silently decays as the code changes, and vectors for deleted or renamed nodes keep surfacing as ghost results.

Fix

EmbeddingStore.purge_orphans() deletes vectors whose qualified_name no longer exists in nodes (both tables share one SQLite file). embed_all_nodes() calls it, so a manual crg embed also cleans up ghosts. Guarded by a sqlite_master check so it is a no-op on databases without a nodes table.
refresh_embeddings(graph_store) runs as a post-build step on all three postprocess paths: tools/build.py::_run_postprocess (the MCP build_or_update_graph tool — skipped at postprocess="minimal", like the other enrichment steps), run_postprocess (the MCP re-run tool, with a new embeddings: bool = True flag exposed through run_postprocess_tool), and the shared postprocessing.run_post_processing (CLI watch mode, eval runner).

The refresh is opt-in by construction — a build can never trigger a surprise full re-embed or cloud spend:

It only acts when the graph already contains embeddings (the user ran embed at least once).
It only acts with the exact provider identity that produced them, resolved from the per-row provider tag (local:<model>, google:<model>, openai:<model>@<host>, ...). If the tagged provider cannot be resolved (missing extras, missing env vars) or resolves to a different model/endpoint, the refresh logs and skips.
The hash-incremental embed_nodes() keeps the refresh cheap: only new or changed nodes are re-encoded, and orphans are purged in the same pass.

Any provider or transport error is downgraded to a build warning, matching the existing step-isolation contract in postprocessing.py.

Testing

New tests cover: orphan purging (with and without a nodes table), never-embedded → skip, legacy/unresolvable provider tag → skip, provider identity mismatch → skip, and the embed-new-plus-purge happy path ({"embedded": 1, "purged": 1}), plus postprocessing-step isolation (refresh failure → warning, not a failed build). The build-tool path has its own tests: full postprocess invokes the refresh and surfaces embeddings_refreshed/embeddings_purged counts, minimal skips it, a None return adds no keys, a provider error degrades to a build warning, and the run_postprocess embeddings flag is honored. Full suite: 1412 passed / 0 failed.

🤖 Generated with Claude Code

Embeddings were write-once: nothing refreshed them after a build, so semantic search silently decayed as code changed, and vectors for deleted or renamed nodes kept surfacing as ghosts. Two additions: - EmbeddingStore.purge_orphans() deletes vectors whose qualified_name no longer exists in the nodes table (both tables share one SQLite file). embed_all_nodes() now calls it, so a manual embed also cleans up ghosts. - refresh_embeddings(graph_store) runs after every build-time post-process: the shared run_post_processing() pipeline (watch mode, eval runner) gains it as a fifth step, and the tool pipeline used by the CLI build command and the MCP build_or_update_graph / run_postprocess tools invokes it at the "full" postprocess level. run_postprocess grows an embeddings flag (default True) alongside its existing flows/communities/fts knobs. The refresh is opt-in by construction: it only acts when the graph already contains embeddings (the user ran embed at least once), and only with the exact provider identity that produced them (stored in the per-row provider tag). If the tagged provider cannot be resolved (missing extras or env vars) or resolves to a different model/endpoint, the refresh is skipped, so a build can never trigger a surprise full re-embed or cloud spend. The hash-incremental embed_nodes() keeps the refresh cheap: only new or changed nodes are re-encoded. Every wiring point downgrades provider or transport errors to a build warning, matching the existing step-isolation contract. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: keep embeddings fresh after builds and purge orphaned vectors#599

feat: keep embeddings fresh after builds and purge orphaned vectors#599
SHudici wants to merge 1 commit into
tirth8205:mainfrom
SHudici:feat/embedding-auto-refresh

SHudici commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SHudici commented Jul 3, 2026

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant