fix(trace): invalidate symbol cache when the extractor version changes#264
Open
kryptt wants to merge 1 commit into
Open
fix(trace): invalidate symbol cache when the extractor version changes#264kryptt wants to merge 1 commit into
kryptt wants to merge 1 commit into
Conversation
The symbol-extraction dedup in cli/watch.go's runInitialScan skipped any
file whose stored content hash matched the freshly-scanned hash. That
behaviour was almost right — but it ignored a second axis: the
extractor itself might have changed since the symbols were persisted.
A release that adds a new tree-sitter grammar, broadens regex patterns,
or fixes an extraction bug will produce a different symbol set for the
same file content, but the existing dedup never re-ran the extractor on
files whose content was stable across the upgrade.
Effect: users who pulled in better symbol extraction (e.g. landing a
new language) saw zero improvement on their existing indexes until they
manually invalidated symbols.gob or modified every source file. The
upgrade was completely silent.
Fix:
* Add Version() string to the SymbolExtractor interface. Implementations
return an opaque token (suggested convention: short slugs like
"regex-v1", "treesitter-v2-lua") that gets bumped manually whenever
the extractor's output for the same input would change.
* Persist the extractor version alongside the file content hash. New
GOBSymbolStore field FileExtractorVersions stores per-file versions;
SaveFileWithSignature writes both fingerprints; GetFileExtractorVersion
reads them back.
* runInitialScan's dedup now requires BOTH the content hash AND the
extractor version to match before skipping. When either drifts (or
when the cache predates this field entirely — older gob files
decode with a nil FileExtractorVersions map), the file is
re-extracted exactly once and the cache updates to the new version.
* The fsnotify-driven single-file update path in handleFileEvent
persists via SaveFileWithSignature too, so the cache stays
consistent regardless of which code path produced the symbols.
Backwards-compatible decoding: older symbols.gob files load cleanly with
FileExtractorVersions == nil; the dedup treats this as "version
unknown" and re-extracts on first scan, which is the intended one-time
upgrade behaviour.
Tests:
* Existing TestRunInitialScan_SkipsSymbolExtractionWhenContentHashMatches
now seeds via SaveFileWithSignature so the cache hits cleanly.
* New TestRunInitialScan_ReExtractsWhenExtractorVersionDiffers proves
a stale extractor version invalidates a content-matching cache.
* New TestRunInitialScan_ReExtractsWhenExtractorVersionIsMissing proves
the legacy gob file path (no FileExtractorVersions at all) also
re-extracts.
* Mock SymbolExtractors in tests grow Version().
go test ./... clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The symbol-extraction dedup in
runInitialScanskips any file whose stored content hash matches the freshly-scanned hash:```go
// cli/watch.go (upstream main)
if existingHash, ok := symbolStore.GetFileContentHash(fileInfo.Path); ok && existingHash == fileInfo.Hash {
continue
}
```
That's right about file content but ignores a second axis: the extractor itself may have changed since the symbols were persisted. A release that:
patterns.go,will produce a different symbol set for the same file content. But the dedup never re-runs the extractor on files whose source didn't change across the upgrade. Result: the new extraction quality is completely invisible to existing users until they manually delete
symbols.gobor touch every source file.I noticed this debugging an emacs workspace where the elisp extractor was substantially upgraded — the binary was redeployed, the watcher restarted, and the symbol index stayed exactly as it was. Same shape across other languages: any change to
extractor_ts.go'swalkNodeForSymbolsswitch orpatterns.go's regex tables benefits only files modified after the upgrade.Solution
Pair each persisted symbol entry with the extractor signature that produced it. The dedup then requires both the content hash and the extractor version to match before skipping.
Interface change
```go
type SymbolExtractor interface {
// … existing methods …
}
```
`RegexExtractor` returns `"regex-v1"`; `TreeSitterExtractor` returns `"treesitter-v1"`. The convention is documented inline: bump the constant manually when symbol output changes meaningfully. Future releases that add a language or fix an extractor bump to `regex-v2`, `treesitter-v2-lua`, etc.
Storage change
`GOBSymbolStore` grows a parallel `FileExtractorVersions map[string]string` next to the existing `FileContentHashes`. A new `SaveFileWithSignature` writes both fingerprints atomically; `GetFileExtractorVersion` reads them back. The legacy `SaveFileWithContentHash` stays for backwards-compat — old call sites keep working but the dedup will treat their entries as "version unknown" and re-extract once.
`gob` decoding is backwards compatible: older symbols.gob files without the new field decode cleanly with `FileExtractorVersions == nil`. The dedup treats this as a cache miss → one-time re-extraction → cache updates to the current version. Subsequent scans skip cleanly.
Dedup change
```go
existingHash, hashOK := symbolStore.GetFileContentHash(fileInfo.Path)
existingVersion, versionOK := symbolStore.GetFileExtractorVersion(fileInfo.Path)
if hashOK && versionOK && existingHash == fileInfo.Hash && existingVersion == extractor.Version() {
continue
}
```
Same logic flows into the fsnotify single-file update path (`handleFileEvent`) so the cache stays consistent across both startup-scan and live-edit flows.
Test plan
`go test ./...` is clean across the entire module.
Migration notes