Skip to content

First-spawn chmod races mcpb extraction: ENOENT on bundle install #270

@mgoldsborough

Description

@mgoldsborough

Symptom

First-time install of a sizeable Python .mcpb bundle from the mpak registry fails to start with an ENOENT on a vendored dylib path. Restart succeeds — the file IS at that path on retry. Observed during Phase 2c validation, installing @nimblebraininc/synapse-research@0.3.0 (270 MB packed):

[api] [workspace-runtime] Failed to start synapse-research in ws_mat:
  ENOENT: no such file or directory, chmod
  '/Users/.../apps/cache/nimblebraininc-synapse-research/deps/PIL/.dylibs/libwebp.7.dylib'

Post-failure check confirms the file IS at that path (Mach-O 64-bit dynamically linked shared library arm64, 515712 bytes). It's just not landed yet at the moment chmod runs.

Hypothesis

The platform's mpak.bundleCache.loadBundle(name) call (src/bundles/lifecycle.ts:161) returns control to startBundleSource before the .mcpb extraction has fully fsync'd. The downstream chmod walk (wherever it lives — exec-bit prep for the bundle's Python binary?) then walks the deps tree and hits a file whose directory entry exists but whose contents (or symlink target) haven't fully settled.

Two possible failure shapes:

  1. Async extraction race: extract returns when the in-progress writes are queued but not yet visible to a sibling process / async stat.
  2. Dangling symlink on a Pillow .dylibs/ directory: the macOS framework dylibs in Pillow's .dylibs/ are sometimes symlinks. If chmod is following symlinks (no O_NOFOLLOW) and the target isn't extracted yet, ENOENT fires.

The retry-succeeds behaviour points at (1) more strongly — by the time the user restarts, the FS has caught up.

Reproduce

  1. Wipe ~/.nimblebrain/apps/cache/nimblebraininc-synapse-research/ (or any sizable Python bundle's cache dir).
  2. Start bun run dev with a workspace that lists the bundle by registry name.
  3. Observe the ENOENT during initial bundle start.
  4. Ctrl-C, restart bun run dev, bundle starts normally.

Suggested fix

  1. await the extraction settle before returning from mpak.bundleCache.loadBundle. The cleanest place; if the cache layer guarantees "all files written + fsynced when this promise resolves," the chmod race goes away. Worth verifying mpak's contract here.
  2. Or, retry-with-backoff in the chmod walk for ENOENT specifically. Lower confidence — masks the real bug; risks hiding genuine missing-file cases.
  3. Or, skip chmod on broken symlinks if (2) above is the actual cause. chmod with no-follow + try/catch.

Option (1) is the principled fix.

Workaround

Currently: bundle authors / dev users restart the platform once after first install. Bad UX for first-time install of any registry bundle large enough for the race to fire.

Context

Surfaced during Phase 2c production-validation of @nimblebraininc/synapse-research@0.3.0 — the bug only fires on first install (sizable Python bundles with vendored Pillow / numpy / similar deps). Closing the production bug end-to-end required one restart workaround. Worth fixing before this becomes the first impression for every new user installing a Python bundle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions