Skip to content

Track upstream mlx-swift distributed-backend status (jaccl deviation) #193

Description

@anupsv

Why this issue exists

provider-swift depends on the jaccl distributed backend + Cmlx library product from our Layr-Labs/mlx-swift fork. Upstream ml-explore/mlx-swift has not yet enabled distributed support — the original Package.swift even says // do not build distributed support (yet). We're tracking ahead of upstream because the d-inference cluster work (commits d1266de3, 64b8c2e1, a5049fb7) already depends on it.

This issue exists so the deviation is documented and revisitable rather than silent.

Current state

  • Our fork enables: jaccl backend + mlx-c/mlx/c/distributed*.cpp wrappers + Cmlx library product (see Layr-Labs/mlx-swift#2)
  • Other backends (mpi, ring, nccl) remain excluded
  • d-inference uses jaccl only for small collective-op synchronization between rank 0 / rank 1; large activation / token transfers go over plain TCP + AES-256-GCM (ThunderboltLink in provider-swift/Sources/ProviderCore/P2P/)

Upstream work to track

  • ml-explore/mlx-swift#371 — "Add distributed communication framework for multi-device tensor parallelism." Open since 2026-03-15. This is the upstream PR that, when merged, lifts the "(yet)" caveat. It adds full Swift bindings (DistributedGroup, MLXDistributed.allSum / .send / .recv, plus sharded NN layers).
    • If/when merged: rebase our fork onto upstream, drop any local Swift bindings in provider-swift that overlap with upstream's MLXDistributed namespace, and migrate MLXDistributed.swift to use upstream symbols.
    • If upstream goes a different direction (different API surface, different backend selection mechanism, etc.): we may need to refactor the rank-coordination layer in ClusterSession / ClusterPeer.

Known upstream jaccl bugs

Tracked here so we notice if they bite us:

  • ml-explore/mlx#3149 — JACCL point-to-point send/recv with varying shape produces wrong data or hangs. Low risk for our shape because activation transfer doesn't use jaccl, but watch if we ever route variable-length tensors through jaccl.
  • ml-explore/mlx#3467 — "Changing queue pair to RTR failed with errno 22" on Apple Thunderbolt RDMA, GID-selection regression introduced in #3412. Affects connection setup. If two-Mac smoke tests fail with RTR errors, this is the most likely cause.
  • ml-explore/mlx#3442backend=\"any\" selects ring singleton instead of JACCL. Not affecting us; we invoke jaccl explicitly.
  • ml-explore/mlx#3162 — AppleThunderboltRDMA hard limit of 100 MRs caps JACCL multi-node scaling. Not affecting us; we're 2-Mac.

What this means operationally

  • The libs/mlx-swift submodule pointer in d-inference should track our fork's main (which will incorporate Layr-Labs/mlx-swift#2 once merged), not upstream's main.
  • When upstream #371 lands, file a separate issue to plan the migration and drop our local divergence.
  • If we ever consider migrating activation transfer onto jaccl (away from TCP + AES-GCM), re-evaluate #3149 + #3467 first.

Related code

  • provider-swift/Sources/ProviderCore/P2P/MLXDistributed.swift — Swift wrappers for jaccl C API
  • provider-swift/Sources/ProviderCore/P2P/ClusterSession.swift / ClusterPeer.swift — rank-coordination layer that uses jaccl collective ops
  • provider-swift/Sources/ProviderCore/P2P/ThunderboltLink.swift — plain TCP transport for activation tensors (the path that does NOT use jaccl)
  • provider-swift/Sources/ProviderCore/P2P/EncryptedPipelineInference.swift — AES-256-GCM sealing layer over ThunderboltLink

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions