Skip to content

IOCP: reusing a completion while it is still queued corrupts the loop (null-result panic + orphaned completions) — the IOCP counterpart of #169/#224 #227

Description

@nanasess

TL;DR

On the IOCP backend, re-initializing/re-arming a Completion while it is still linked in self.completions corrupts the event loop. Loop.timer's c.* = .{...} resets the whole struct, clearing both c.result and the intrusive c.next pointer, which produces two symptoms from one root cause:

  1. Null-result paniccompletions.pop() later unwraps c.result.? ("Completion queue items MUST have a result set") on the cleared result → attempt to use null value.
  2. Orphaned completions — because c.next was also cleared, popping c sets head = null, permanently dropping every completion queued after c (their callbacks never fire → silent stalls). Same defect as [bug] v.next is not proper cleared for re-enqueue a completion between different queues #169.

This is the IOCP counterpart of the multi-queue-membership class that #169 describes and that #224 is fixing for kqueue — but no open PR touches iocp.zig.

Verified mechanism (against 34fa508)

  • queue.Intrusive.pop does self.head = next.next then next.next = null (src/queue.zig); push asserts v.next == null.
  • Completion.next: ?*Completion = null — a default field, so c.* = .{...} in Loop.timer resets it to null.
  • So if c is in self.completions as [head=c → D → E] and any path re-initializes it via loop.timer(c, ...) (or the .active branch of timer_reset), then on the next drain:
    • c.result → null → panic at the c.result.? unwrap in the completions loop (src/backend/iocp.zig).
    • c.next → null → head becomes null and D, E are orphaned.

Secondary hazard: double-cancellation

A canceled timer is pushed to self.completions with its state kept .active (so the active count can be decremented on processing). If timer_reset runs on it while it is still pending there, it sees .active and schedules a second cancellation; the second stop_completion then removes the timer from self.timers where it no longer exists (heap corruption) and pushes it to the queue again.

How it was hit

A Ghostty-based Windows terminal (ghostinthewsl) on IOCP: the cursor-blink timer (renderer thread) re-arms/resets/cancels the same timer completion from several events (blink re-arm, focus change, content-driven reset), so it can be re-initialized while still queued. Symptoms matched exactly — intermittent crashes, and on an older related port, input/scroll freezing preceding the crash (the orphaned-completion stall). Windows/IOCP is WIP per the README.

Relation to #169 / #170 / #224

What I ran in production (a mitigation, not a fix)

As a pragmatic guard in a fork, I skip a popped entry whose result is null, before touching its reused state:

diff --git a/src/backend/iocp.zig b/src/backend/iocp.zig
@@ -259,6 +259,17 @@ pub const Loop = struct {
 
             // Process the completions we already have completed.
             while (self.completions.pop()) |c| {
+                // Mitigation (not a fix): a stale entry whose result was cleared
+                // because the completion was re-initialized (c.* = .{...}) while
+                // still linked here. Skip it before touching the reused state.
+                // Prevents the c.result.? panic; does NOT fix the orphaned-
+                // completion / double-cancellation corruption from reuse-while-queued.
+                if (c.result == null) continue;
+
                 const c_active = c.flags.state == .active;

Full patch (also carries a parity guard in kqueue, which I'd defer to #224): https://patch-diff.githubusercontent.com/raw/nanasess/libxev/pull/1.patch

Why this mitigation is empirically sufficient for this case

With a single repeating timer, the queued completion typically has no successors, so clearing its next orphans nothing — only the result-null panic remains, which the guard skips. A -Doptimize=ReleaseFast build has run ~1 week with zero crashes (previously it crashed within minutes–hours). Note that in ReleaseFast neither the c.result.? safety check nor assert(v.next == null) fires, so the corruption would be silent there; the explicit if (c.result == null) guard still takes effect. The general case (a queued completion with successors) is not mitigated — that needs the ownership fix.

Suggested direction

Consistent with #169 / #170 / #224: prevent a completion from being re-initialized/re-armed while it is a member of an intrusive queue (or dequeue it safely first), rather than clearing next / result after the fact. The correct fix touches the loop's ownership model, so I'm leaving the design to you.

Disclosure / environment

  • I'm not fluent in Zig, and I'm not confident this patch is the correct fix — it's a mitigation I verified empirically, not something I can vouch for at the internals level. This was diagnosed and written with AI assistance. I'm providing it as a reference patch only (not a PR) and leaving the proper fix to you.
  • Windows 11 + WSL2, zig 0.15.2, -Dtarget=x86_64-windows-gnu. Minidumps / symbolized stacks available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions