Why this issue exists
provider-swift depends on the jaccl distributed backend + Cmlx library product from our Layr-Labs/mlx-swift fork. Upstream ml-explore/mlx-swift has not yet enabled distributed support — the original Package.swift even says // do not build distributed support (yet). We're tracking ahead of upstream because the d-inference cluster work (commits d1266de3, 64b8c2e1, a5049fb7) already depends on it.
This issue exists so the deviation is documented and revisitable rather than silent.
Current state
- Our fork enables: jaccl backend +
mlx-c/mlx/c/distributed*.cpp wrappers + Cmlx library product (see Layr-Labs/mlx-swift#2)
- Other backends (mpi, ring, nccl) remain excluded
- d-inference uses jaccl only for small collective-op synchronization between rank 0 / rank 1; large activation / token transfers go over plain TCP + AES-256-GCM (
ThunderboltLink in provider-swift/Sources/ProviderCore/P2P/)
Upstream work to track
- ml-explore/mlx-swift#371 — "Add distributed communication framework for multi-device tensor parallelism." Open since 2026-03-15. This is the upstream PR that, when merged, lifts the "(yet)" caveat. It adds full Swift bindings (
DistributedGroup, MLXDistributed.allSum / .send / .recv, plus sharded NN layers).
- If/when merged: rebase our fork onto upstream, drop any local Swift bindings in
provider-swift that overlap with upstream's MLXDistributed namespace, and migrate MLXDistributed.swift to use upstream symbols.
- If upstream goes a different direction (different API surface, different backend selection mechanism, etc.): we may need to refactor the rank-coordination layer in
ClusterSession / ClusterPeer.
Known upstream jaccl bugs
Tracked here so we notice if they bite us:
- ml-explore/mlx#3149 — JACCL point-to-point send/recv with varying shape produces wrong data or hangs. Low risk for our shape because activation transfer doesn't use jaccl, but watch if we ever route variable-length tensors through jaccl.
- ml-explore/mlx#3467 — "Changing queue pair to RTR failed with errno 22" on Apple Thunderbolt RDMA, GID-selection regression introduced in #3412. Affects connection setup. If two-Mac smoke tests fail with RTR errors, this is the most likely cause.
- ml-explore/mlx#3442 —
backend=\"any\" selects ring singleton instead of JACCL. Not affecting us; we invoke jaccl explicitly.
- ml-explore/mlx#3162 — AppleThunderboltRDMA hard limit of 100 MRs caps JACCL multi-node scaling. Not affecting us; we're 2-Mac.
What this means operationally
- The
libs/mlx-swift submodule pointer in d-inference should track our fork's main (which will incorporate Layr-Labs/mlx-swift#2 once merged), not upstream's main.
- When upstream #371 lands, file a separate issue to plan the migration and drop our local divergence.
- If we ever consider migrating activation transfer onto jaccl (away from TCP + AES-GCM), re-evaluate #3149 + #3467 first.
Related code
provider-swift/Sources/ProviderCore/P2P/MLXDistributed.swift — Swift wrappers for jaccl C API
provider-swift/Sources/ProviderCore/P2P/ClusterSession.swift / ClusterPeer.swift — rank-coordination layer that uses jaccl collective ops
provider-swift/Sources/ProviderCore/P2P/ThunderboltLink.swift — plain TCP transport for activation tensors (the path that does NOT use jaccl)
provider-swift/Sources/ProviderCore/P2P/EncryptedPipelineInference.swift — AES-256-GCM sealing layer over ThunderboltLink
Why this issue exists
provider-swiftdepends on the jaccl distributed backend +Cmlxlibrary product from our Layr-Labs/mlx-swift fork. Upstream ml-explore/mlx-swift has not yet enabled distributed support — the original Package.swift even says// do not build distributed support (yet). We're tracking ahead of upstream because the d-inference cluster work (commitsd1266de3,64b8c2e1,a5049fb7) already depends on it.This issue exists so the deviation is documented and revisitable rather than silent.
Current state
mlx-c/mlx/c/distributed*.cppwrappers +Cmlxlibrary product (see Layr-Labs/mlx-swift#2)ThunderboltLinkinprovider-swift/Sources/ProviderCore/P2P/)Upstream work to track
DistributedGroup,MLXDistributed.allSum/.send/.recv, plus sharded NN layers).provider-swiftthat overlap with upstream'sMLXDistributednamespace, and migrateMLXDistributed.swiftto use upstream symbols.ClusterSession/ClusterPeer.Known upstream jaccl bugs
Tracked here so we notice if they bite us:
backend=\"any\"selects ring singleton instead of JACCL. Not affecting us; we invoke jaccl explicitly.What this means operationally
libs/mlx-swiftsubmodule pointer in d-inference should track our fork'smain(which will incorporate Layr-Labs/mlx-swift#2 once merged), not upstream'smain.Related code
provider-swift/Sources/ProviderCore/P2P/MLXDistributed.swift— Swift wrappers for jaccl C APIprovider-swift/Sources/ProviderCore/P2P/ClusterSession.swift/ClusterPeer.swift— rank-coordination layer that uses jaccl collective opsprovider-swift/Sources/ProviderCore/P2P/ThunderboltLink.swift— plain TCP transport for activation tensors (the path that does NOT use jaccl)provider-swift/Sources/ProviderCore/P2P/EncryptedPipelineInference.swift— AES-256-GCM sealing layer overThunderboltLink