Skip to content

QueryResource: don't resurrect disposed cache entries in next/error#5317

Open
jwatzman wants to merge 1 commit into
facebook:mainfrom
jwatzman:cache-slot
Open

QueryResource: don't resurrect disposed cache entries in next/error#5317
jwatzman wants to merge 1 commit into
facebook:mainfrom
jwatzman:cache-slot

Conversation

@jwatzman

Copy link
Copy Markdown

When a QueryResource cache entry is disposed while its fetch is still in flight, the next and error callbacks in _fetchAndSaveQuery call _getOrCreateCacheEntry, which silently recreates the entry in the LRU under the same cacheIdentifier. The resurrected entry has refcount 0 — no permanent retain was ever attached to it by QueryResource.retain, and no useEffect cleanup will ever fire. It outlives navigation, sits in the LRU until evicted by capacity, and gets reused on a later mount that happens to compute the same cacheIdentifier.

Reusing the zombie skips prepareWithIdentifier's _fetchAndSaveQuery path entirely: no environment.check, no fetch. By the time the zombie is reused, the store records uniquely retained by this query have likely been freed — released to the GC buffer when the original entry was disposed, then reclaimed by a later GC sweep triggered by an unrelated operation. The fragment reader walks into the data, hits a missing pageInfo or @connection-key tracker record (or any record selected only by this query), and returns undefined for the affected field. The application sees useLazyLoadQuery return a data ref whose fields are unexpectedly undefined, with no fetch, no error, and no suspension to recover.

Several scenarios can dispose an entry while its fetch is in flight:

  • React StrictMode, where the mount/cleanup/mount cycle of useLazyLoadQueryNode synchronously disposes the cache entry created during the initial render, then the maybeHiddenOrFastRefresh path triggers a forceUpdate that creates a new entry under a different cacheBreaker. The original fetch is still alive on the deduped observable; when it lands, the disposed entry is resurrected alongside the new one being updated. This is the most reliable trigger and is what made the bug observable to us.

  • Navigating away mid-fetch. A user clicks into a route that cache-misses, the fetch starts, the user navigates away before it completes. The route unmounts, the cache entry is disposed. When the fetch lands, the disposed entry is resurrected.

  • Concurrent rendering discarding a suspended subtree. A component throws a Promise from useLazyLoadQuery; React suspends. A parent transitions away before the fetch completes; the suspended render never commits. The temporary retain expires after TEMPORARY_RETAIN_DURATION_MS (5 minutes) and the entry is disposed. The fetch — started synchronously inside _fetchAndSaveQuery before the component suspended — continues to completion and resurrects the entry.

  • Offscreen / fast refresh, which produce similar dispose-while- fetching cycles. The maybeHiddenOrFastRefresh comment in useLazyLoadQueryNode already calls these out as the rationale for its forceUpdate path.

In all of these, the dispose path in createCacheEntry's SuspenseResource only unsubscribes the network observable for live queries; for regular queries the subscription stays alive precisely so late payloads can land. That is what allows the next/error callbacks to fire on a disposed entry. Cancelling the subscription on dispose is not the right fix — the fetch is deduped via fetchQueryDeduped and may also be attached to a live cache entry sharing the same fetch identifier (most obviously the post-forceUpdate entry in the StrictMode case), which still needs to receive its next.

The fix is to make next and error bail when the cache entry has been disposed, while still allowing them to create the entry on a first-time synchronous fire. The store still receives the data via the normalizer (execute.normalize.*), so any live entry for the same fetch is updated correctly; only the zombie write is skipped.

A naive this._cache.get(cacheIdentifier) == null → bail check conflates two different states: "the entry has been disposed" (the zombie case we want to suppress) and "the entry has not yet been created" (a synchronous observable that fires next/error during the .subscribe() call, before the post-subscribe creation block at the bottom of _fetchAndSaveQuery runs). The latter happens with cached or mocked transports — network-only with a synchronous fetch observable is the canonical case, and the test suite exercises it directly. Bailing there would drop the only payload the entry will ever see, leaving the post-subscribe block to cache a networkPromise that never resolves; the component then suspends indefinitely on success, or sees a Promise instead of the Error on failure.

To distinguish the two states, _fetchAndSaveQuery tracks a local entryHasBeenCreated flag, flipped to true at each site that populates the LRU for this cacheIdentifier (the shouldAllowRender block, the first-fire branch inside next/error, and the post-subscribe networkPromise creation). The callbacks bail only when the entry is missing and was previously created. First-time synchronous fires fall through to _getOrCreateCacheEntry exactly as before.

For an error arriving after the entry has been disposed, the error is dropped instead of being cached for later re-throw. This is correct: the entry has no useEffect retainers, so there is no consumer to read a cached error. A fresh mount with the same cacheIdentifier will trigger a new fetch and observe the error (or success) on its own.

The StrictMode case is easy to reproduce and likely accounts for most of the dev-time pain. The prod-side triggers (mid-fetch navigation, suspended-and-discarded renders) are observed in the wild but rare: they leave a zombie behind only if the user later revisits a mount that computes the same cacheIdentifier, and the records the query uniquely retains have been GC'd in the meantime. Most apps don't notice because most records are co-retained by multiple operations and survive GC regardless. Apps with at least one query that uniquely retains some records (e.g. a @connection(key: ...) selection whose records aren't selected by any other live query — common in chat-style features) will eventually hit it.

When a QueryResource cache entry is disposed while its fetch is still
in flight, the `next` and `error` callbacks in `_fetchAndSaveQuery`
call `_getOrCreateCacheEntry`, which silently recreates the entry in
the LRU under the same cacheIdentifier. The resurrected entry has
refcount 0 — no permanent retain was ever attached to it by
`QueryResource.retain`, and no useEffect cleanup will ever fire. It
outlives navigation, sits in the LRU until evicted by capacity, and
gets reused on a later mount that happens to compute the same
cacheIdentifier.

Reusing the zombie skips `prepareWithIdentifier`'s `_fetchAndSaveQuery`
path entirely: no `environment.check`, no fetch. By the time the
zombie is reused, the store records uniquely retained by this query
have likely been freed — released to the GC buffer when the original
entry was disposed, then reclaimed by a later GC sweep triggered by an
unrelated operation. The fragment reader walks into the data, hits a
missing pageInfo or `@connection`-key tracker record (or any record
selected only by this query), and returns `undefined` for the affected
field. The application sees `useLazyLoadQuery` return a data ref
whose fields are unexpectedly undefined, with no fetch, no error, and
no suspension to recover.

Several scenarios can dispose an entry while its fetch is in flight:

- **React StrictMode**, where the mount/cleanup/mount cycle of
  `useLazyLoadQueryNode` synchronously disposes the cache entry
  created during the initial render, then the `maybeHiddenOrFastRefresh`
  path triggers a `forceUpdate` that creates a new entry under a
  different cacheBreaker. The original fetch is still alive on the
  deduped observable; when it lands, the disposed entry is resurrected
  alongside the new one being updated. This is the most reliable
  trigger and is what made the bug observable to us.

- **Navigating away mid-fetch**. A user clicks into a route that
  cache-misses, the fetch starts, the user navigates away before it
  completes. The route unmounts, the cache entry is disposed. When
  the fetch lands, the disposed entry is resurrected.

- **Concurrent rendering discarding a suspended subtree**. A component
  throws a Promise from `useLazyLoadQuery`; React suspends. A parent
  transitions away before the fetch completes; the suspended render
  never commits. The temporary retain expires after
  `TEMPORARY_RETAIN_DURATION_MS` (5 minutes) and the entry is
  disposed. The fetch — started synchronously inside
  `_fetchAndSaveQuery` *before* the component suspended — continues
  to completion and resurrects the entry.

- **Offscreen / fast refresh**, which produce similar dispose-while-
  fetching cycles. The `maybeHiddenOrFastRefresh` comment in
  `useLazyLoadQueryNode` already calls these out as the rationale for
  its forceUpdate path.

In all of these, the dispose path in `createCacheEntry`'s
`SuspenseResource` only unsubscribes the network observable for live
queries; for regular queries the subscription stays alive precisely so
late payloads can land. That is what allows the `next`/`error`
callbacks to fire on a disposed entry. Cancelling the subscription on
dispose is not the right fix — the fetch is deduped via
`fetchQueryDeduped` and may also be attached to a *live* cache entry
sharing the same fetch identifier (most obviously the post-`forceUpdate`
entry in the StrictMode case), which still needs to receive its
`next`.

The fix is to make `next` and `error` bail when the cache entry has
been disposed, while still allowing them to *create* the entry on a
first-time synchronous fire. The store still receives the data via
the normalizer (`execute.normalize.*`), so any live entry for the same
fetch is updated correctly; only the zombie write is skipped.

A naive `this._cache.get(cacheIdentifier) == null → bail` check
conflates two different states: "the entry has been disposed" (the
zombie case we want to suppress) and "the entry has not yet been
created" (a synchronous observable that fires `next`/`error` *during*
the `.subscribe()` call, before the post-subscribe creation block at
the bottom of `_fetchAndSaveQuery` runs). The latter happens with
cached or mocked transports — `network-only` with a synchronous fetch
observable is the canonical case, and the test suite exercises it
directly. Bailing there would drop the only payload the entry will
ever see, leaving the post-subscribe block to cache a `networkPromise`
that never resolves; the component then suspends indefinitely on
success, or sees a Promise instead of the Error on failure.

To distinguish the two states, `_fetchAndSaveQuery` tracks a local
`entryHasBeenCreated` flag, flipped to true at each site that
populates the LRU for this `cacheIdentifier` (the `shouldAllowRender`
block, the first-fire branch inside `next`/`error`, and the
post-subscribe `networkPromise` creation). The callbacks bail only
when the entry is missing *and* was previously created. First-time
synchronous fires fall through to `_getOrCreateCacheEntry` exactly as
before.

For an `error` arriving after the entry has been disposed, the error
is dropped instead of being cached for later re-throw. This is
correct: the entry has no useEffect retainers, so there is no consumer
to read a cached error. A fresh mount with the same cacheIdentifier
will trigger a new fetch and observe the error (or success) on its
own.

The StrictMode case is easy to reproduce and likely accounts for most
of the dev-time pain. The prod-side triggers (mid-fetch navigation,
suspended-and-discarded renders) are observed in the wild but rare:
they leave a zombie behind only if the user later revisits a mount
that computes the same cacheIdentifier, *and* the records the query
uniquely retains have been GC'd in the meantime. Most apps don't
notice because most records are co-retained by multiple operations and
survive GC regardless. Apps with at least one query that uniquely
retains some records (e.g. a `@connection(key: ...)` selection whose
records aren't selected by any other live query — common in chat-style
features) will eventually hit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@meta-cla meta-cla Bot added the CLA Signed label Jun 11, 2026
@jwatzman

Copy link
Copy Markdown
Author

This is the result of a debugging session with Claude, guided by me, trying to find the source of a bug in a real production app. Unfortunately due to the nature, it's kind of hard to distil down the real example beyond the test case here.

This PR was written by Claude in its entirety. I have read it and think it makes sense, but am not really in a position to carefully review it. I don't know what your policy on AI contributions is -- I will take no offence if you close this without reading it due to the AI nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant