fix(workspace): scope projectPrefixStore.ListDocuments to this project#263
Open
kryptt wants to merge 1 commit into
Open
fix(workspace): scope projectPrefixStore.ListDocuments to this project#263kryptt wants to merge 1 commit into
kryptt wants to merge 1 commit into
Conversation
projectPrefixStore.ListDocuments was a pure passthrough to the
underlying shared vector store, returning every document across every
project in the workspace. The indexer then computed its scan-vs-stored
diff against that workspace-wide set, treated every other project's
document as "no longer on disk", and ran RemoveFile against each one.
The removes silently no-op'd — projectPrefixStore.DeleteByFile re-adds
the project prefix to the already-prefixed path, producing a
doubly-prefixed key that never matches a stored row — so DB integrity
was preserved, but two side effects leaked through:
* Misleading stats: every per-project scan reported the entire
workspace's document count under "files removed", e.g.
Initial scan complete: 0 files indexed, 0 chunks created,
8173 files removed, 24 skipped (took 1.866s)
when only 24 of the workspace's 8173 docs are in this project and
nothing was actually removed.
* Wasted work: thousands of Postgres roundtrips per scan as the
indexer fires no-op deletes for every other project's file.
Fix: ListDocuments now filters to entries whose path begins with this
project's prefix and strips the prefix before returning, so the
indexer's relative-path bookkeeping lines up with what its scanner
enumerates. The delete-loop's RemoveFile call re-adds the prefix via
the existing wrapper, so legitimate removes still target the right
rows.
Test plan: TestProjectPrefixStore_PassThroughAndGetChunks updated to
feed a mixed-workspace listing and assert (a) only this project's
entries come back, and (b) the prefix is stripped. The previous
assertion (len(docs) == 2 against ["a", "b"]) was checking the buggy
passthrough behaviour and would fail under the fix; both pieces now
align on the new contract.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
projectPrefixStore.ListDocumentsincli/watch.gois a pure passthrough to the underlying shared vector store. In workspace mode the shared store holds every project's documents, so the indexer's scan-vs-stored diff (inindexer.IndexAllWithBatchProgress) compares this project's freshly-scanned files against the whole workspace's document set, treats every other project's document as "no longer present on disk", and runsRemoveFileagainst each one.The removes are silent no-ops —
projectPrefixStore.DeleteByFilere-adds the project prefix to the already-prefixed key, producing a doubly-prefixed path that never matches a stored row — so the database stays consistent. But two visible problems leak through:Misleading stats. Every per-project scan reports the workspace's total document count under "files removed":
```
Initial scan complete: 0 files indexed, 0 chunks created,
8173 files removed, 24 skipped (took 1.866s)
```
when this project contains 24 of those 8173 documents and nothing is actually being removed.
Wasted work. Thousands of Postgres roundtrips per scan as the indexer fires no-op
DeleteByFile+DeleteDocumentcalls for every other project's file. With multiple workspace restarts (e.g. service restarts, container redeploys) the noise compounds.I noticed this debugging a separate issue in a 7-project workspace; every scan was emitting "8173 files removed" with no underlying change.
Fix
ListDocumentsnow filters to entries whose path begins with this project'sworkspaceName/projectName/prefix and strips the prefix before returning, so the indexer's relative-path bookkeeping aligns with what its scanner enumerates. Legitimate deletions still target the right rows because the surroundingDeleteByFile/DeleteDocumentwrappers re-add the prefix.Test plan
TestProjectPrefixStore_PassThroughAndGetChunkswas updated to feed a mixed-workspace listing:```go
listDocumentsResult: []string{
"ws/proj/main.go",
"ws/proj/sub/file.go",
"ws/other/main.go", // different project — filtered
"different-ws/proj/main.go", // different workspace — filtered
"unprefixed.go", // no prefix — filtered
},
```
The assertion now requires exactly
[\"main.go\", \"sub/file.go\"]back — both filtered to this project and stripped of the prefix. The original test was asserting the buggy passthrough behaviour (len(docs) == 2against[\"a\", \"b\"]) and would fail under the fix; both pieces now align on the new contract.go test ./... -count=1is clean.