Skip to content

fix(trace): invalidate symbol cache when the extractor version changes#264

Open
kryptt wants to merge 1 commit into
yoanbernabeu:mainfrom
kryptt:pr/fix-symbol-dedup-extractor-version
Open

fix(trace): invalidate symbol cache when the extractor version changes#264
kryptt wants to merge 1 commit into
yoanbernabeu:mainfrom
kryptt:pr/fix-symbol-dedup-extractor-version

Conversation

@kryptt

@kryptt kryptt commented May 29, 2026

Copy link
Copy Markdown

Problem

The symbol-extraction dedup in runInitialScan skips any file whose stored content hash matches the freshly-scanned hash:

```go
// cli/watch.go (upstream main)
if existingHash, ok := symbolStore.GetFileContentHash(fileInfo.Path); ok && existingHash == fileInfo.Hash {
continue
}
```

That's right about file content but ignores a second axis: the extractor itself may have changed since the symbols were persisted. A release that:

  • ships a new tree-sitter grammar (e.g. lands C++ symbol coverage where there was none),
  • broadens the regex patterns in patterns.go,
  • fixes a bug that was silently swallowing certain definitions,

will produce a different symbol set for the same file content. But the dedup never re-runs the extractor on files whose source didn't change across the upgrade. Result: the new extraction quality is completely invisible to existing users until they manually delete symbols.gob or touch every source file.

I noticed this debugging an emacs workspace where the elisp extractor was substantially upgraded — the binary was redeployed, the watcher restarted, and the symbol index stayed exactly as it was. Same shape across other languages: any change to extractor_ts.go's walkNodeForSymbols switch or patterns.go's regex tables benefits only files modified after the upgrade.

Solution

Pair each persisted symbol entry with the extractor signature that produced it. The dedup then requires both the content hash and the extractor version to match before skipping.

Interface change

```go
type SymbolExtractor interface {
// … existing methods …

// Version returns an opaque token that changes whenever this
// extractor's output for the same input would change.
Version() string

}
```

`RegexExtractor` returns `"regex-v1"`; `TreeSitterExtractor` returns `"treesitter-v1"`. The convention is documented inline: bump the constant manually when symbol output changes meaningfully. Future releases that add a language or fix an extractor bump to `regex-v2`, `treesitter-v2-lua`, etc.

Storage change

`GOBSymbolStore` grows a parallel `FileExtractorVersions map[string]string` next to the existing `FileContentHashes`. A new `SaveFileWithSignature` writes both fingerprints atomically; `GetFileExtractorVersion` reads them back. The legacy `SaveFileWithContentHash` stays for backwards-compat — old call sites keep working but the dedup will treat their entries as "version unknown" and re-extract once.

`gob` decoding is backwards compatible: older symbols.gob files without the new field decode cleanly with `FileExtractorVersions == nil`. The dedup treats this as a cache miss → one-time re-extraction → cache updates to the current version. Subsequent scans skip cleanly.

Dedup change

```go
existingHash, hashOK := symbolStore.GetFileContentHash(fileInfo.Path)
existingVersion, versionOK := symbolStore.GetFileExtractorVersion(fileInfo.Path)
if hashOK && versionOK && existingHash == fileInfo.Hash && existingVersion == extractor.Version() {
continue
}
```

Same logic flows into the fsnotify single-file update path (`handleFileEvent`) so the cache stays consistent across both startup-scan and live-edit flows.

Test plan

  • Existing `TestRunInitialScan_SkipsSymbolExtractionWhenContentHashMatches` is updated to seed via `SaveFileWithSignature` (matching the new contract). The original `SaveFileWithContentHash` seed now models the legacy-cache case and would correctly trigger a re-extract.
  • New `TestRunInitialScan_ReExtractsWhenExtractorVersionDiffers` seeds a stale `"regex-vOLD"` and asserts the sentinel symbol is invalidated + the source's real symbol is re-extracted.
  • New `TestRunInitialScan_ReExtractsWhenExtractorVersionIsMissing` seeds via the legacy `SaveFileWithContentHash` (simulating an old gob file with no extractor-version metadata) and asserts the same invalidate-and-re-extract behaviour.
  • Mock `SymbolExtractor` implementations in tests grow `Version()`.

`go test ./...` is clean across the entire module.

Migration notes

  • First scan after upgrade re-extracts every file that's still in the symbol store. For a workspace with, say, 100k files this is a one-time cost on the order of a few minutes (extraction is fast; no embedding involved). Subsequent scans dedup cleanly.
  • Bump the version string in `extractor.go` / `extractor_ts.go` when shipping changes to symbol extraction that should invalidate cached output. The bump is the canonical "this release wants symbol re-extraction" signal.

The symbol-extraction dedup in cli/watch.go's runInitialScan skipped any
file whose stored content hash matched the freshly-scanned hash. That
behaviour was almost right — but it ignored a second axis: the
extractor itself might have changed since the symbols were persisted.
A release that adds a new tree-sitter grammar, broadens regex patterns,
or fixes an extraction bug will produce a different symbol set for the
same file content, but the existing dedup never re-ran the extractor on
files whose content was stable across the upgrade.

Effect: users who pulled in better symbol extraction (e.g. landing a
new language) saw zero improvement on their existing indexes until they
manually invalidated symbols.gob or modified every source file. The
upgrade was completely silent.

Fix:

  * Add Version() string to the SymbolExtractor interface. Implementations
    return an opaque token (suggested convention: short slugs like
    "regex-v1", "treesitter-v2-lua") that gets bumped manually whenever
    the extractor's output for the same input would change.
  * Persist the extractor version alongside the file content hash. New
    GOBSymbolStore field FileExtractorVersions stores per-file versions;
    SaveFileWithSignature writes both fingerprints; GetFileExtractorVersion
    reads them back.
  * runInitialScan's dedup now requires BOTH the content hash AND the
    extractor version to match before skipping. When either drifts (or
    when the cache predates this field entirely — older gob files
    decode with a nil FileExtractorVersions map), the file is
    re-extracted exactly once and the cache updates to the new version.
  * The fsnotify-driven single-file update path in handleFileEvent
    persists via SaveFileWithSignature too, so the cache stays
    consistent regardless of which code path produced the symbols.

Backwards-compatible decoding: older symbols.gob files load cleanly with
FileExtractorVersions == nil; the dedup treats this as "version
unknown" and re-extracts on first scan, which is the intended one-time
upgrade behaviour.

Tests:
  * Existing TestRunInitialScan_SkipsSymbolExtractionWhenContentHashMatches
    now seeds via SaveFileWithSignature so the cache hits cleanly.
  * New TestRunInitialScan_ReExtractsWhenExtractorVersionDiffers proves
    a stale extractor version invalidates a content-matching cache.
  * New TestRunInitialScan_ReExtractsWhenExtractorVersionIsMissing proves
    the legacy gob file path (no FileExtractorVersions at all) also
    re-extracts.
  * Mock SymbolExtractors in tests grow Version().

go test ./... clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant