Skip to content

fix(core): prevent isolate-wide hang from a stalled session.get on Workers (#1274)#1538

Open
swissky wants to merge 2 commits into
emdash-cms:mainfrom
swissky:fix/session-get-hang
Open

fix(core): prevent isolate-wide hang from a stalled session.get on Workers (#1274)#1538
swissky wants to merge 2 commits into
emdash-cms:mainfrom
swissky:fix/session-get-hang

Conversation

@swissky

@swissky swissky commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes logged-in pages (admin and public) hanging indefinitely on Cloudflare Workers (#1274).

A request cancelled mid-session.get("user") (client disconnect / context teardown) can leave the underlying session-store read as a promise that never settles — it neither resolves nor rejects. Awaiting it directly hangs the request, and because the stalled promise is shared at the isolate level, every later session-bearing request on that isolate hangs too — the exact 0-CPU / multi-minute-wall / canceled / no-child-spans signature reported in the issue. It's reliably reproducible on fresh isolates right after a deploy, and a surrounding try/catch can't help because the promise never rejects.

handlePluginRouteAuth, handlePublicRouteAuth and handlePasskeyAuth now resolve the session user through a shared resolveSessionUser() helper that:

  1. Anchors the read with after() so a cancelled request still drives it to completion — the promise settles and the isolate is not poisoned for subsequent requests (prevents the hang rather than merely surviving it); and
  2. Applies a fail-closed timeout backstop — a still-stalled read resolves to undefined, and every caller already treats the absence of a session user as unauthenticated (anonymous on public routes, 401/redirect on protected ones). It can only ever drop privileges for that one request, never grant them.

This mirrors the reclaimable-cache / after() pattern already used in core for settings/secrets.

Closes #1274.

Type of change

  • Bug fix

Checklist

  • I have read CONTRIBUTING.md
  • pnpm typecheck passes (emdash package, workspace deps built)
  • pnpm lint passes (oxlint --type-aware --deny-warnings)
  • pnpm format has been run (oxfmt + prettier)
  • Tests: no new tests — the trigger is a cancellation/timing race in the workerd session layer that's hard to reproduce deterministically in a unit test; verified instead with live before/after Workers traces (below). Happy to add a test in whatever shape you prefer.
  • User-visible strings: none changed
  • Changeset added (emdash patch)

AI-generated code disclosure

  • This PR includes AI-generated code — model/tool: Cursor + Claude Opus 4.8

Verification — production, Cloudflare Workers, EmDash 0.21.0

Captured with wrangler tail (data anonymised — paths/timings/outcomes only).

Before — logged-in requests wedge:

request wall cpu outcome
GET /_emdash/admin 7,551 ms 0 ms canceled
GET /_emdash/admin/content/… 7,710 ms 0 ms canceled

After this change — fresh post-deploy isolate, same conditions, 26/26 admin+API requests ok:

request wall cpu outcome
GET /_emdash/admin 84 ms 15 ms ok
GET /_emdash/api/dashboard 452 ms 21 ms ok
GET /_emdash/api/content/… ~200 ms 2–14 ms ok
login · passkey verify · manifest · plugins · users · trash · authors <500 ms <100 ms ok

0 canceled, 0 hangs.

…rkers (emdash-cms#1274)

A request cancelled mid-`session.get("user")` on Cloudflare Workers can leave
the session-store read as a promise that never settles, poisoning the isolate
so every later session-bearing request hangs (≈0 CPU, multi-minute wall,
`canceled`) — reliably reproducible on fresh isolates right after a deploy.

The auth middleware now resolves the session user via a helper that:
  1. anchors the read with `after()` so a cancelled request still drives it to
     completion (the promise settles; the isolate is not poisoned), and
  2. applies a fail-closed timeout backstop — a still-stalled read degrades to
     "unauthenticated for this request" instead of hanging.
@changeset-bot

changeset-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 8dd4180

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 16 packages
Name Type
emdash Patch
@emdash-cms/cloudflare Patch
@emdash-cms/sandbox-workerd Patch
@emdash-cms/fixture-perf-site Patch
@emdash-cms/perf-demo-site Patch
@emdash-cms/cache-demo-site Patch
@emdash-cms/do-demo-site Patch
@emdash-cms/do-solo-demo-site Patch
@emdash-cms/admin Patch
@emdash-cms/auth Patch
@emdash-cms/blocks Patch
@emdash-cms/gutenberg-to-portable-text Patch
@emdash-cms/x402 Patch
create-emdash Patch
@emdash-cms/auth-atproto Patch
@emdash-cms/plugin-embeds Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new

pkg-pr-new Bot commented Jun 18, 2026

Copy link
Copy Markdown

Open in StackBlitz

@emdash-cms/admin

npm i https://pkg.pr.new/@emdash-cms/admin@1538

@emdash-cms/auth

npm i https://pkg.pr.new/@emdash-cms/auth@1538

@emdash-cms/auth-atproto

npm i https://pkg.pr.new/@emdash-cms/auth-atproto@1538

@emdash-cms/blocks

npm i https://pkg.pr.new/@emdash-cms/blocks@1538

@emdash-cms/cloudflare

npm i https://pkg.pr.new/@emdash-cms/cloudflare@1538

@emdash-cms/contentful-to-portable-text

npm i https://pkg.pr.new/@emdash-cms/contentful-to-portable-text@1538

emdash

npm i https://pkg.pr.new/emdash@1538

create-emdash

npm i https://pkg.pr.new/create-emdash@1538

@emdash-cms/gutenberg-to-portable-text

npm i https://pkg.pr.new/@emdash-cms/gutenberg-to-portable-text@1538

@emdash-cms/plugin-cli

npm i https://pkg.pr.new/@emdash-cms/plugin-cli@1538

@emdash-cms/plugin-types

npm i https://pkg.pr.new/@emdash-cms/plugin-types@1538

@emdash-cms/registry-client

npm i https://pkg.pr.new/@emdash-cms/registry-client@1538

@emdash-cms/registry-lexicons

npm i https://pkg.pr.new/@emdash-cms/registry-lexicons@1538

@emdash-cms/sandbox-workerd

npm i https://pkg.pr.new/@emdash-cms/sandbox-workerd@1538

@emdash-cms/x402

npm i https://pkg.pr.new/@emdash-cms/x402@1538

@emdash-cms/plugin-ai-moderation

npm i https://pkg.pr.new/@emdash-cms/plugin-ai-moderation@1538

@emdash-cms/plugin-atproto

npm i https://pkg.pr.new/@emdash-cms/plugin-atproto@1538

@emdash-cms/plugin-audit-log

npm i https://pkg.pr.new/@emdash-cms/plugin-audit-log@1538

@emdash-cms/plugin-color

npm i https://pkg.pr.new/@emdash-cms/plugin-color@1538

@emdash-cms/plugin-embeds

npm i https://pkg.pr.new/@emdash-cms/plugin-embeds@1538

@emdash-cms/plugin-field-kit

npm i https://pkg.pr.new/@emdash-cms/plugin-field-kit@1538

@emdash-cms/plugin-forms

npm i https://pkg.pr.new/@emdash-cms/plugin-forms@1538

@emdash-cms/plugin-webhook-notifier

npm i https://pkg.pr.new/@emdash-cms/plugin-webhook-notifier@1538

commit: 8dd4180

@emdashbot emdashbot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach: sound, but under-applied. Anchoring the session read with after()/waitUntil so a cancelled request still drives it to settlement — preventing the isolate poisoning rather than merely surviving it — plus a fail-closed timeout backstop, is the right pattern and mirrors the existing single-flight/reclaimable-cache + after() usage in emdash-runtime.ts, settings, and secrets. The fail-closed property holds: a stalled or rejecting read resolves to undefined, and every caller treats absence-of-user as unauthenticated, so the helper can only drop privileges, never grant them. The mechanics are correct — read is .catch-wrapped, the after callback handles both settle arms, and clearTimeout runs in finally (no timer leak, no unhandled rejection, no double read).

The headline problem is coverage. This PR hardens the three session.get("user") sites in auth.ts, but two structurally identical bare awaits are left untouched:

  • packages/core/src/astro/middleware.ts:419 — the main middleware, registered before auth (in integration/index.ts, all order: "pre" in registration order: emdash/middleware → redirect → setup → emdash/middleware/auth). So on a session-bearing request like GET /_emdash/admin this is the first session read, before the hardened auth middleware ever runs. By this PR's own theory, a client disconnect during that await leaves a never-settling promise that hangs the request here and poisons the isolate — the auth-middleware fix never executes. The fix as written doesn't fully prevent the hang it set out to prevent.
  • packages/core/src/astro/routes/api/snapshot.ts:34 — this route is in PUBLIC_API_EXACT, so the auth middleware return next()s in the isPublicApiRoute branch before any session read; this manual read is the endpoint's only session lookup, and its surrounding try/catch can't catch a never-settling promise (as the PR itself notes).

resolveSessionUser is currently module-local to auth.ts; extracting it to a shared module and applying it at both remaining sites closes the gap (the main middleware can keep its !hasSessionCookie short-circuit, which is the #733 KV-read-miss optimization).

Tests. None added. The workerd cancellation race itself isn't unit-testable, but the helper's two safety behaviors are: the timeout backstop (pass a session whose get("user") returns new Promise(() => {}) and assert resolveSessionUser resolves to undefined within the timeout instead of hanging) and fail-closed-on-reject. The infrastructure already exists — tests/unit/astro/middleware-prerender.test.ts mocks session.get and asserts on it for the #733 regression. Flagged below.

Checked: after()/clearTimeout/Promise.race mechanics; the fail-closed security property; import path and ordering; changeset (correct emdash/patch); every session.get("user") call site repo-wide (these two are the misses). Static review only — did not run typecheck/tests/lint.

Verdict: comment — the approach is right and the hardened paths are correct, but the same hole remains in the main middleware (the first session read on the request) and the snapshot route, and the testable backstop lacks a test.


Findings

  • [needs fixing] packages/core/src/astro/middleware.ts:418-419

    Line 419 still does a bare await context.session?.get("user") with no after() anchor and no timeout backstop — exactly the pattern this PR fixes in auth.ts. Crucially, this middleware runs first: in integration/index.ts it's registered before emdash/middleware/auth (both order: "pre", in registration order), so for a session-bearing request like GET /_emdash/admin this is the first session read, before the hardened auth middleware runs at all.

    By this PR's own theory (an unanchored read cancelled mid-flight leaves a never-settling promise that hangs the request and poisons the isolate), this read is equally susceptible: a client disconnect during this await stalls it here, the request hangs before auth.ts executes, and the isolate is poisoned. The fix as written does not fully prevent the hang it set out to prevent.

    resolveSessionUser is module-local to auth.ts; extract it to a shared module and use it here (keeping the !hasSessionCookie short-circuit, which is the #733 KV-read-miss optimization):

    		const sessionUser =
    			context.isPrerendered || !hasSessionCookie
    				? null
    				: await resolveSessionUser(context.session);
    

    (Requires importing the extracted helper; resolveSessionUser already accepts session | undefined and returns undefined when absent, so passing context.session directly is safe.)

  • [needs fixing] packages/core/src/astro/routes/api/snapshot.ts:34

    Same unprotected pattern: a bare await session.get("user"). The surrounding try/catch cannot help — as this PR notes, a stalled read neither resolves nor rejects, so the catch never fires and this request hangs (and can poison the isolate).

    This route is in PUBLIC_API_EXACT, so the auth middleware return next()s in the isPublicApiRoute branch before doing any session read. That makes this manual read the endpoint's only session lookup — the PR's resolveSessionUser never runs here. Lower traffic than the main middleware, but the same hole. Use the shared helper once it's extracted:

    			const sessionUser = await resolveSessionUser(session);
    
  • [needs fixing] packages/core/src/astro/middleware/auth.ts:65

    AGENTS.md requires TDD for bugs ("A bug without a reproducing test is not fixed"), and this helper adds new, load-bearing behavior with no test. The workerd cancellation race itself isn't unit-testable, but the helper's two safety behaviors are:

    1. Timeout backstop — pass a session whose get("user") returns a never-settling promise (new Promise(() => {})) and assert resolveSessionUser resolves to undefined within the timeout instead of hanging the test.
    2. Fail-closed on reject — pass a get that rejects and assert undefined.

    The infrastructure already exists: tests/unit/astro/middleware-prerender.test.ts mocks session.get and asserts on it for the #733 regression, so this is straightforward. Extracting/exporting resolveSessionUser (which the two findings above require for reuse in middleware.ts and snapshot.ts) also makes it directly unit-testable in isolation. The live Workers traces are good evidence the fix works in production, but they don't guard the backstop logic against a future regression.

@swissky

swissky commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Can i get my own Emdashbot 😅 he is great

Addresses review feedback on emdash-cms#1538 — the guard was under-applied. Two
structurally identical bare `await session.get("user")` reads were left
unhardened:

  - middleware.ts: the main middleware runs *before* the auth middleware
    (both order: "pre"), so on a session-bearing request this is the *first*
    session read — it could hang and poison the isolate before the auth fix
    ever runs.
  - routes/api/snapshot.ts: a PUBLIC_API_EXACT route the auth middleware skips,
    so its manual read is the only session lookup.

Extract the guard to a shared `astro/session-user.ts` (`resolveSessionUser`)
and apply it at all three sites (main middleware keeps its `!hasSessionCookie`
short-circuit, the emdash-cms#733 KV-read-miss optimization). Add unit tests for the two
testable safety behaviors: timeout backstop on a never-settling read, and
fail-closed on a rejecting read.
@github-actions github-actions Bot added review/needs-rereview Author pushed changes since the last review and removed review/needs-review No maintainer or bot review yet labels Jun 18, 2026
@swissky

swissky commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Great review — you're right, the guard was under-applied. Pushed a follow-up (8dd4180):

  1. middleware.ts:419 — agreed, this is the first session read and runs before the auth middleware, so the original fix wasn't actually covering the path it set out to. Now uses the shared helper, keeping the !hasSessionCookie short-circuit (the Avoid Astro session KV reads for anonymous public requests without session cookie #733 KV-read-miss optimization).
  2. snapshot.ts:34 — same treatment, switched to the helper.
  3. Extraction + tests — pulled resolveSessionUser into a shared astro/session-user.ts so all three sites import it, and added tests/unit/astro/session-user.test.ts for the two testable behaviors: a never-settling read resolves to undefined within the timeout instead of hanging, and a rejecting read fails closed to undefined.

Local checks pass: oxlint --type-aware --deny-warnings, tsgo typecheck, and the new tests (4/4).

@ascorbic

Copy link
Copy Markdown
Collaborator

btw, this is emdashbot if you do want one! https://github.com/emdash-cms/emdash/tree/main/infra/flue-review

@swissky

swissky commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

I took #1538 for a proper spin on throwaway, isolated infra to check whether it actually clears the admin hang we keep hitting on Cloudflare Workers — and wanted to share what came out, including one gap and one unrelated bug.

Setup

What the cold path looks like

A cold isolate spends real time in getRuntime() before locals.emdash (incl. .db) is attached:

server-timing: rt;dur≈500 "Runtime init", rt.db;dur≈466 "DB init + migrations"

Warm, the same header is rt;dur=0. When several authenticated admin requests hit the same cold isolate, they race that init, and I saw two failure modes:

  1. The Session KV lookup hangs indefinitely, blocking entire page response for logged-in users #1274 hang. wrangler tail: outcome: canceled, cpuTime: 0ms, wallTime 100–680s, on both public pages and admin/API routes. 34 such canceled requests in a single window.
  2. NOT_CONFIGURED 500. getRuntime() throws transiently during the race, the middleware's catch leaves locals.emdash unset, and requireDb(emdash?.db) returns 500. Fast: cpuTime 2–4ms, wallTime 20–350ms. This is Intermittent "EmDash is not initialized" (NOT_CONFIGURED 500 / 503) on Cloudflare Workers admin under load — per-request emdash.db missing on some isolates #1462.

The #1538 timeout helps with (1) — it stops the unbounded hang — but on a fully wedged cold isolate (cpuTime: 0, no JS scheduled) the Promise.race can't always fire, and when the request does get through it then lands on (2). So with #1538 alone, under Smart Placement + cold isolates, the admin still doesn't come up — the symptom just shifts from "hangs forever" to "fast 500 / NOT_CONFIGURED".

Sanity checks that line up with that reading:

  • Smart Placement off, or once the isolate is warm (rt;dur=0) → admin loads fine. Prod mostly survives because steady traffic + a keep-warm cron keep isolates warm; the incident that made us disable placement was a cold burst.

What actually cleared it

Wrapping the getRuntime() call in the middleware in a small retry/wait — i.e. wait for init to finish instead of failing the request — removed both symptoms:

let runtime;
for (let attempt = 0; attempt < 4; attempt++) {
  try {
    runtime = await getRuntime(config, initSubTimings);
    if (runtime?.db) break;
  } catch (err) {
    if (attempt === 3) throw err;
  }
  await new Promise((r) => setTimeout(r, 75 * (attempt + 1)));
}

With this plus the #1538 session timeout, Smart Placement on, on a freshly deployed (cold) Worker:

  • ~150-request concurrent burst across /, /_emdash/admin/, /_emdash/api/{manifest,dashboard,auth/me}: 30×200, 30×302, 90×401, 0×500, 0 canceled, slowest 2.9s (the cold init — but it completes).
  • Authenticated admin loaded cleanly straight after the cold deploy; reloading the content list repeatedly → all 200/302, no NOT_CONFIGURED.
  • Full logout → clear cookies → fresh login cycle works (this was the exact thing that used to wedge).
  • cf-placement: local-ZRH throughout, so this is genuinely with Smart Placement on, not by turning it off.

So my takeaway: #1538 is necessary (it kills the unbounded hang) but on Workers it isn't sufficient on its own — the per-request emdash.db cold-init race (#1462) needs the runtime init to be awaited/retried. This also matches the "Expected" in #1462 ("the request should wait for initialization to complete"). Happy to open a separate PR for the getRuntime() retry against #1462, or fold it in here — whatever you prefer.

Separate bug: /authors 500

Independent of all the above: GET /_emdash/api/content/<collection>/authors returns 500 deterministically, even when emdash.db is healthy and the admin is otherwise fully usable (it only breaks the author-filter dropdown). Notes:

  • Happens for collections with authors and without them (e.g. an empty pages collection), so it's not the NULL-author data case.
  • The underlying SELECT DISTINCT author_id ... WHERE deleted_at IS NULL AND author_id IS NOT NULL runs fine by hand against the same D1.
  • console.error("Content authors error:", …) never surfaced in wrangler tail, so I couldn't capture the actual exception remotely.

Flagging it here because it shows up in the same traces and might be worth a look alongside #1462. Glad to provide tail JSON or test against a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core cla: signed review/needs-rereview Author pushed changes since the last review size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session KV lookup hangs indefinitely, blocking entire page response for logged-in users

2 participants