fix(core): reclaim abandoned init locks to prevent isolate-poisoning deadlock#1405
Conversation
…deadlock If the request owning a cold-isolate init was cancelled mid-await (client disconnect during migrations), workerd tears down its continuation and the finally that releases the init guard never runs. The stuck guard (runtimeInitializing boolean / shared dbInitPromise) then deadlocked every subsequent request in the isolate until platform wall-limit (524 at 100s), poisoning the isolate until eviction. Replace both guards with a reclaimable lock: waiters poll (never awaiting a cross-request promise), a stale owner is reclaimed after a 15s deadline, and waiters give up with an error after 30s instead of hanging. Closes #1274 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 90711fd The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
docs | 6f8ab6d | Jun 11 2026, 03:18 PM |
@emdash-cms/admin
@emdash-cms/auth
@emdash-cms/auth-atproto
@emdash-cms/blocks
@emdash-cms/cloudflare
@emdash-cms/contentful-to-portable-text
emdash
create-emdash
@emdash-cms/gutenberg-to-portable-text
@emdash-cms/plugin-cli
@emdash-cms/plugin-types
@emdash-cms/registry-client
@emdash-cms/registry-lexicons
@emdash-cms/sandbox-workerd
@emdash-cms/x402
@emdash-cms/plugin-ai-moderation
@emdash-cms/plugin-atproto
@emdash-cms/plugin-audit-log
@emdash-cms/plugin-color
@emdash-cms/plugin-embeds
@emdash-cms/plugin-field-kit
@emdash-cms/plugin-forms
@emdash-cms/plugin-webhook-notifier
commit: |
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
emdash-playground | 6f8ab6d | Jun 11 2026, 03:20 PM |
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
emdash-demo-cache | 6f8ab6d | Jun 11 2026, 03:19 PM |
… anchoring, globalThis singletons - Derive the db init deadline from MIGRATION_RACE_WAIT_MS (now exported) so a healthy owner blocking in waitForConcurrentMigrator is never presumed dead: DB_INIT_DEADLINE_MS = race wait + 20s. The outer runtime lock uses DB_INIT_DEADLINE_MS + 15s so nested deadlines compose. - Default maxWaitMs is now deadlineMs + 15s headroom so a waiter always survives long enough to reclaim a dead owner before giving up. - Anchor the in-flight init promise via after()/waitUntil: a request cancelled mid-init no longer abandons the work — init completes, populates the cache, and releases the lock, preventing the poisoning instead of recovering from it. Reclaim remains as the net. - Move the runtime singleton + lock and db cache + lock onto globalThis behind Symbol.for keys (same reasoning as the setup-verified flag): bundler-duplicated SSR chunks would otherwise get independent locks and caches — multiple runtimes and cron schedulers per isolate. - New tests: live-but-slow owner reclaimed at deadline (init runs at most twice, cache converges) and anchor receives a rejection-swallowed copy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Review addressed in 85bc8b4: (a) Deadline composition — (b) waitUntil anchoring — (c) globalThis singletons — runtime instance + lock and db cache + lock now live behind (d) Slow-live-owner test — added: owner slower than the deadline, waiter reclaims; asserts init runs at most twice, both callers resolve, cache converges, and a third caller is served from cache. Plus an anchor test. (e) Date.now() guard comment added (clock only advances across I/O in workerd — the sleep is load-bearing). Confirmed the doInit path degrades gracefully on a maxWait throw: the existing Also worth stating for the record (per review): reclaim-induced concurrent migration lands in the already-handled Suites: 2,641 unit + 1,039 integration pass; core typecheck and lint clean. |
There was a problem hiding this comment.
Pull request overview
Fixes a workerd isolate “poisoning” deadlock during cold initialization by replacing non-reclaimable init guards (boolean/shared promise) with a reclaimable, polling-based init lock that can be anchored via after()/waitUntil and reclaimed after a deadline.
Changes:
- Add
initWithLock/createInitLockutility (with unit tests) to support reclaimable, polling-based init. - Update runtime DB initialization and Astro middleware runtime initialization to use the new init lock and store holders on
globalThisviaSymbol.for. - Export
MIGRATION_RACE_WAIT_MSfor deadline derivation and add a changeset entry.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/utils/init-lock.ts | Introduces reclaimable init lock helper used to avoid isolate-poisoning deadlocks. |
| packages/core/src/emdash-runtime.ts | Switches DB singleton initialization to initWithLock; adds globalThis-backed db holder + derived deadline constant. |
| packages/core/src/astro/middleware.ts | Switches runtime singleton initialization to initWithLock; adds globalThis-backed runtime holder + derived deadline. |
| packages/core/src/database/migrations/runner.ts | Exports MIGRATION_RACE_WAIT_MS for use in init-lock deadline derivation. |
| packages/core/tests/unit/utils/init-lock.test.ts | Adds unit coverage for reclaim, timeout, anchoring, and waiter behaviors. |
| packages/core/tests/unit/astro/middleware-prerender.test.ts | Updates mocks/reset logic to account for new exported deadline + global singleton key. |
| .changeset/init-lock-reclaim.md | Publishes patch note describing the deadlock fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export interface InitLock { | ||
| /** Epoch ms when the current owner claimed the lock, or null when free. */ | ||
| ownerStartedAt: number | null; | ||
| } | ||
|
|
||
| export function createInitLock(): InitLock { | ||
| return { ownerStartedAt: null }; | ||
| } |
| const ownerStartedAt = lock.ownerStartedAt; | ||
| if (ownerStartedAt === null || Date.now() - ownerStartedAt > deadlineMs) { | ||
| // Free, or the owner has been gone past the deadline — claim it. | ||
| // Synchronous between awaits, so two waiters can't both claim. | ||
| lock.ownerStartedAt = Date.now(); | ||
| try { | ||
| const initPromise = init(); | ||
| options?.anchor?.( | ||
| initPromise.then( | ||
| () => undefined, | ||
| () => undefined, | ||
| ), | ||
| ); | ||
| return await initPromise; | ||
| } finally { | ||
| // If this request dies mid-init unanchored this never runs; | ||
| // the next waiter reclaims after deadlineMs instead. | ||
| lock.ownerStartedAt = null; | ||
| } | ||
| } |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
This is a well-scoped bug fix that correctly diagnoses and addresses a real production issue: workerd cancels can strand an init lock, poisoning the isolate. The reclaimable-polling-lock approach is the right one for this runtime, and moving the singletons to globalThis behind Symbol.for follows the repo's established pattern. Tests are thorough and cover the dead-owner reclaim path.
I found one correctness bug in the lock implementation and a test that happens to pass without proving the property it advertises:
-
initWithLockfinally clobbers reclaimed locks. Thefinallyblock unconditionally setslock.ownerStartedAt = null. If the original owner was live-but-slow and a waiter reclaimed the lock, the original owner'sfinallywipes out the new owner's claim. A third waiter that arrives while the reclaimer is still running can then claim the lock and start a third init. For the runtime lock that means duplicate cron schedulers / sandbox runners; for the DB lock it means duplicate Kysely instances and competing migrators. The fix is a claim-generation counter: only the current owner may clear the lock. -
The "at most twice" test doesn't actually prove the property. It passes because the reclaiming waiter's init is instantaneous, so the original owner's harmless clobber happens after the lock was already released. A third waiter that arrives while both the original owner and the reclaimer are still in flight would trigger a third init, but the test never exercises that overlap.
-
Minor:
setupFlagStoreis used lexically before its declaration inmiddleware.ts. Not a runtime bug (the function is called after module evaluation), but it's confusing ordering; moving the variable abovegetRuntimeHolderwould make it consistent with the rest of the file.
The changeset is correct, the PR type is appropriate, and the approach is sound once the clobber race is closed.
| if (ownerStartedAt === null || Date.now() - ownerStartedAt > deadlineMs) { | ||
| // Free, or the owner has been gone past the deadline — claim it. | ||
| // Synchronous between awaits, so two waiters can't both claim. | ||
| lock.ownerStartedAt = Date.now(); |
There was a problem hiding this comment.
[needs fixing] The finally block unconditionally sets lock.ownerStartedAt = null. If the original owner was live-but-slow and a waiter reclaimed the lock in the meantime, the original owner's finally clobbers the new owner's claim. A third waiter that arrives while the reclaimer is still running can then claim the lock and start a third concurrent init — for the runtime lock that means duplicate cron schedulers and sandbox runners.
The fix is to track claim ownership so only the current owner may clear the lock:
| lock.ownerStartedAt = Date.now(); | |
| export interface InitLock { | |
| ownerStartedAt: number | null; | |
| generation: number; | |
| } | |
| export function createInitLock(): InitLock { | |
| return { ownerStartedAt: null, generation: 0 }; | |
| } |
And in initWithLock:
| lock.ownerStartedAt = Date.now(); | |
| if (ownerStartedAt === null || Date.now() - ownerStartedAt > deadlineMs) { | |
| const myGeneration = lock.generation + 1; | |
| lock.generation = myGeneration; | |
| lock.ownerStartedAt = Date.now(); | |
| try { | |
| const initPromise = init(); | |
| options?.anchor?.( | |
| initPromise.then( | |
| () => undefined, | |
| () => undefined, | |
| ), | |
| ); | |
| return await initPromise; | |
| } finally { | |
| if (lock.generation === myGeneration) { | |
| lock.ownerStartedAt = null; | |
| } | |
| } | |
| } |
|
|
||
| const owner = initWithLock( | ||
| lock, | ||
| () => cache, |
There was a problem hiding this comment.
[suggestion] This test happens to pass because the reclaiming waiter's init is instantaneous, so the original owner's finally runs after the lock was already released. It does not prove the "at most twice" property it advertises. To actually validate the claim, add a third waiter that arrives while both the original owner and the reclaimer are still in flight, and assert that it waits (or reclaims only after the reclaimer's deadline) rather than triggering initCalls === 3.
| lock: InitLock; | ||
| } | ||
|
|
||
| function getRuntimeHolder(): RuntimeHolder { |
There was a problem hiding this comment.
[suggestion] setupFlagStore is declared on line 112, below this function that uses it. In JavaScript the function is hoisted and isn't called during module evaluation, so there's no runtime TDZ bug, but it's confusing ordering. Move setupFlagStore (and the SETUP_VERIFIED_KEY declaration next to it) above getRuntimeHolder so the variable is lexically declared before its first use.
A slow owner that finished (or failed) after a waiter had reclaimed the lock would unconditionally clear ownerStartedAt in its finally, releasing the reclaimer's claim and admitting a third concurrent init. Each claim now bumps a generation counter and release is gated on still being the current owner. Also: normalize init via Promise.resolve().then(init) so a synchronous throw still attaches the anchor before rejecting; move setupFlagStore above its first lexical use; add a regression test where the stale owner fails while the reclaimer is mid-init and a third caller arrives in the clobber window (fails with 3 inits before the fix, 2 after). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Both review findings addressed in 90711fd: Clobber race (emdashbot #1, Copilot both comments) — confirmed real, and the suggested fix is what shipped: Test gap (emdashbot #2) — you were right that the "at most twice" test passed by timing accident: the genuinely dangerous interleaving needs the stale owner to fail slowly (success populates the cache, which already shields later callers). Added the regression test with the full overlap: owner A rejects at t≈400 after waiter B reclaimed at t≈300, third caller C arrives at t≈440 in the clobber window while B is mid-init. Before the fix it fails with 3 inits; after, C waits for B and Also picked up Copilot's secondary suggestion — init is now normalized via 9/9 lock tests, 2,642 unit + 1,039 integration pass; typecheck/lint/format clean. |
|
/review |
| async () => { | ||
| const deps = buildDependencies(config); | ||
| const runtime = await EmDashRuntime.create(deps, initTimings); | ||
| holder.instance = runtime; | ||
| return runtime; | ||
| }, |
| holder.cache.set(cacheKey, db); | ||
| return db; |
There was a problem hiding this comment.
Reviewed against 90711fd. The core fix is sound — the reclaimable polling lock is the right shape for workerd's cancellation semantics, the generation counter closes the clobber race that emdashbot and Copilot flagged, and the deadline composition (MIGRATION_RACE_WAIT_MS < DB_INIT_DEADLINE_MS < RUNTIME_INIT_DEADLINE_MS, with maxWaitMs = deadlineMs + 15s so a waiter always outlives the deadline) is internally consistent. Anchoring init via after() is the right belt to the reclaim's suspenders. Tests cover the production deadlock, the clobber race (with an actual interleaved C-arrives-during-B-init regression test), and the sync-throw path.
A few residual concerns worth flagging — none block the fix:
-
holder.instance/holder.cachecache clobber from a live-but-slow owner finishing after a reclaimer. Thegenerationguard protects the lock release, but the cache write insideinit(e.g.holder.instance = runtimeinmiddleware.ts:223,holder.cache.set(cacheKey, db)inemdash-runtime.ts:1369) is not guarded. A slow owner that legitimately finishes after a reclaimer succeeds will overwrite the reclaimer's cached value with its own runtime/db — and nowholder.instancepoints to an instance whosecronScheduler(NodeCronScheduler with real timers, on the Node host) was never the one in active use, and the orphan reclaimer's scheduler continues to fire. On the Workers target this is benign (PiggybackScheduler ticks viatickCron()on the active runtime only, orphans are quiet), but on Node it's a real timer leak per reclaim event. ExistingrunMigrationsrace recovery viawaitForConcurrentMigratormeans the db side won't double-apply schema, but it does end up with two distinctKyselyinstances racing to be "the cached one". -
The new clobber-race test asserts cache convergence but not which value wins. In
tests/unit/utils/init-lock.test.ts, the live-slow-owner test assertsthird === cache— butcachewas last written by the slow owner ("slow-owner"), not the reclaimer. The test is correctly proving "init runs at most twice and the cache is non-null", but a future reader might mistake it for "the reclaimer's value wins". Worth a one-line comment that the cache-clobber behavior is intentional (and the next-best alternative is the generation-gated cache write in (1)). -
after()registration is a microtask, not synchronous.waitUntilReadyis a module-scope dynamic import; on workerd the resolution is effectively instantaneous, but thewaitUntilReady.then(waitUntil => waitUntil(promise))handoff inafter()is still one microtask away. If the owning request is cancelled in the narrow window betweenafter(() => initPromise)being called and that microtask running, thewaitUntilregistration may not actually land. Workers' microtask scheduling normally drains pending microtasks before teardown, so this is likely fine, but it's hard to verify without a real workerd integration test — and the reclaim path covers it as a backstop, which is the point. No action needed; just naming the assumption.
Approach and risk assessment are correct, and the fix degrades gracefully (waiter throws after maxWaitMs → existing catch in doInit logs and leaves locals.emdash unset → routes return NOT_CONFIGURED). Ship it; consider (1) and (2) as follow-ups.
| async () => { | ||
| const deps = buildDependencies(config); | ||
| const runtime = await EmDashRuntime.create(deps, initTimings); | ||
| holder.instance = runtime; |
There was a problem hiding this comment.
Cache write is not generation-gated. If a live-but-slow owner reaches this assignment after a waiter reclaimed and ran its own init, the slow owner's runtime overwrites the reclaimer's. Future requests then read an instance whose cronScheduler was never adopted by the active code path, while the reclaimer's scheduler keeps firing — a NodeCronScheduler timer leak per reclaim event on the Node host (benign on Workers, where PiggybackScheduler is dormant without tickCron()). Consider gating this write on the lock generation the same way the release is, e.g. expose the claim from initWithLock so init can check if (lock.generation === myClaim) holder.instance = runtime. Not a deadlock — the reclaimer's caller still gets the correct runtime via the return value — but it admits an orphan-scheduler leak that the rest of the fix is otherwise careful to avoid.
| } finally { | ||
| dbInitPromise = null; | ||
| } | ||
| holder.cache.set(cacheKey, db); |
There was a problem hiding this comment.
Same concern as the runtime holder write in middleware.ts — this holder.cache.set runs inside init, ungated by the lock generation. A slow-but-live owner finishing after a reclaimer replaces the reclaimer's cached Kysely with its own. Both Kyselys are functionally valid (D1 sessions / dialect-level state is per-instance, not shared), so this isn't correctness-breaking, but it does mean two runMigrations will run concurrently on the same physical DB (UNIQUE-constraint race is handled, so that part is safe) and the "cached" Kysely identity will flip mid-isolate. Worth either gating on the generation or documenting that the second migrator path is the intended degradation.
| }, | ||
| { deadlineMs: 100, pollMs: 10 }, | ||
| ); | ||
| expect(third).toBe(cache); |
There was a problem hiding this comment.
Subtle: this asserts third === cache, but cache was last written by the slow owner ("slow-owner") clobbering the reclaimer's "reclaimer" value. The test correctly proves "init runs at most twice and the cache is populated" — but a future reader might read it as "the reclaimer's value wins". It does not. Worth a comment that the slow-owner cache write is intentionally last-writer-wins (and that holder-side cache writes in middleware/runtime inherit that behavior). Without the comment this test will mislead anyone trying to understand which runtime/db survives a reclaim race.
|
Review posted. Summary:
The fix itself is correct and well-tested. The remaining concerns are non-blocking edge cases on the recovery path (live-but-slow owner finishing after a reclaim), most of which are benign on the Workers deployment target and only meaningful on Node hosts. |
A reclaimed-but-live init that finished late could overwrite the reclaimer's published runtime/db, flipping the singleton back to a stale instance and orphaning the active cron scheduler (a timer leak on Node). initWithLock now passes an isCurrentClaim predicate to init; both call sites publish only while still the current owner, and a superseded runtime stops its own cron scheduler instead of leaking it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Follow-up round addressed in 6f8ab6d — all three reviewers converged on the publication side of the same race, and you're right that the cache write needed the same gating as the release: Generation-gated publication — The orphan-scheduler leak (ask-bonk) — went one better than suppressing the write: a superseded runtime now calls its own Db cache identity flip (ask-bonk/Copilot) — same gating on Test readability (ask-bonk) — added the requested comment: the existing convergence test's inits write the cache ungated, so it demonstrates last-writer-wins explicitly ( 10/10 lock tests, 2,643 unit + 1,039 integration, typecheck/lint/format clean. |
What does this PR do?
Fixes an isolate-poisoning deadlock in runtime and database initialization on workerd.
Mechanism:
getRuntime()(middleware) andgetDatabase()(runtime) guard cold-isolate initialization with single-owner state — aruntimeInitializingboolean and a shareddbInitPromiserespectively. If the request that owns the init is cancelled mid-await(client disconnects during cold migrations on contended D1), workerd tears down its continuation and thefinallythat releases the guard never runs. The guard is stuck forever: every subsequent request in that isolate spins in the wait path (or awaits a never-settling promise) until the platform kills it at the wall limit — observed in production as 524s at exactly ~100s, withcf-cache-status: BYPASS, zero bytes, and no Server-Timing. The isolate stays poisoned until eviction, and the CDN cache's request coalescing amplifies each poisoned fill to every concurrent visitor of that URL.External reproduction during diagnosis: probes that aborted in-flight cold requests drove the hang rate on a test deployment from ~17% to 44% of uncached requests — each abort during a cold init manufactured another poisoned isolate. This is self-reinforcing in production: slow cold init → user abandons → poisoned isolate → more 524s.
Fix: both sites now use a reclaimable init lock (
packages/core/src/utils/init-lock.ts):TDD: the test simulating the abandoned owner (init never settles, cleanup never runs) was written first and fails against the old boolean/shared-promise shape.
Closes #1274
Type of change
Checklist
pnpm typecheckpasses (packages/core; the repo-wide run fails on a pre-existing unbuilt-workspace error inregistry-client, unrelated to this change)pnpm lintpasses (zero new diagnostics vs. the pre-existing baseline)pnpm testpasses — core unit (174 files / 2,639 tests) and integration (70 files / 1,039 tests) suitespnpm formathas been runAI-generated code disclosure
Screenshots / test output
New suite:
packages/core/tests/unit/utils/init-lock.test.ts(6 tests) — cached short-circuit, single init shared by concurrent waiters, stale-owner reclaim after deadline (the production deadlock), lock release on init failure, bounded max wait, waiter picks up owner's cached value.Production fingerprint this fixes (test deployment, cache-busted requests): 524 at ~100,0xx ms,
cf-cache-status: BYPASS, 0 bytes, no Server-Timing; bimodal isolates (healthy ones serve ~300–500ms misses throughout).🤖 Generated with Claude Code
Try this PR
Open a fresh playground →
A full working EmDash site, deployed from this branch. Each visit gets its own session-scoped sandbox: no login needed and no shared state. Try the admin, edit content, hit the public site.
Tracks
fix/init-lock-poisoned-isolate. Updated automatically when the playground redeploys.