Skip to content

Windows CI times out (35m) when Kubo MFS files.rm hangs in community.delete() cleanup #101

@Rinse12

Description

@Rinse12

Summary

Windows CI jobs (test-pkc-js-node-local (windows-latest)) intermittently hit the 35-minute workflow timeout and get cancelled. Root cause is Kubo's files.rm hanging in community.delete() cleanup, which brings down the whole test job (and effectively the in-process Kubo node interaction) until the workflow killer fires.

This appears to be an upstream Kubo bug, tracked as ipfs/kubo#10842 (MFS bug: 'ipfs files rm' hanging).

Most recent occurrence

Evidence

From the per-test stderr log for test/node/community/incoming.community.address.rejection.community.test.ts (test file is unrelated to PR #99 — pre-existing test that just happened to be the unlucky one this run):

  • 09:49:13 — test starts
  • 09:51:56 — community stop completes; afterAll calls community.delete()
  • 09:53:56 — first MFS rm attempt times out:
PKCError: Timed out removing MFS paths. We may need to nuke the whole MFS directory and republish everything
Details:
  {
    \"toDeleteMfsPaths\": [\"/12D3KooW...\"],
    \"kuboRpcUrl\": \"http://localhost:15001/api/v0\"
  }
    at RetryOperation._fn (src/util.ts:954:34)
    at removeMfsFilesSafely (src/util.ts:936:12)
    at deleteCommunity (src/runtime/node/community/local-community/lifecycle.ts:458:35)
  • removeMfsFilesSafely (in src/util.ts) retries 3 more times with pTimeout of 120s each plus exponential backoff → roughly 8 more minutes of hang per delete
  • Workflow timeout hits before the test suite can move on

Why it's not the PR

  • The test file is not touched on the PR branch and uses createSubWithNoChallenge, so the challenge-result-shape work doesn't intersect with it
  • Same code path on master completes in ~26 minutes (see run 26288400419)
  • Hang is on Windows-only Kubo MFS RPC, independent of pkc-js logic

Mitigation options

Pick one (or more) depending on appetite:

  1. Wait on upstream — track MFS bug: ipfs files rm hanging ipfs/kubo#10842 and bump pinned Kubo version once a fix lands
  2. Reduce retries in removeMfsFilesSafely — 4 attempts × 120s is a lot of dead time during teardown; dropping inputNumOfRetries (default 3) or milliseconds (currently 120000) for the cleanup path would surface the failure faster and let the rest of the suite continue
  3. Increase the Windows job timeout-minutes — masks the symptom but won't fix the underlying hang
  4. Fall back to nuking the whole MFS directory when the targeted rm fails enough times (the error message itself hints at this)

Upstream reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions