Track upstream mlx-swift distributed-backend status (jaccl deviation)

## Why this issue exists

`provider-swift` depends on the jaccl distributed backend + `Cmlx` library product from our [Layr-Labs/mlx-swift](https://github.com/Layr-Labs/mlx-swift) fork. Upstream [ml-explore/mlx-swift](https://github.com/ml-explore/mlx-swift) **has not yet enabled distributed support** — the original Package.swift even says `// do not build distributed support (yet)`. We're tracking ahead of upstream because the d-inference cluster work (commits `d1266de3`, `64b8c2e1`, `a5049fb7`) already depends on it.

This issue exists so the deviation is documented and revisitable rather than silent.

## Current state

- Our fork enables: jaccl backend + `mlx-c/mlx/c/distributed*.cpp` wrappers + `Cmlx` library product (see [Layr-Labs/mlx-swift#2](https://github.com/Layr-Labs/mlx-swift/pull/2))
- Other backends (mpi, ring, nccl) remain excluded
- d-inference uses jaccl **only for small collective-op synchronization** between rank 0 / rank 1; large activation / token transfers go over plain TCP + AES-256-GCM (`ThunderboltLink` in `provider-swift/Sources/ProviderCore/P2P/`)

## Upstream work to track

- **[ml-explore/mlx-swift#371](https://github.com/ml-explore/mlx-swift/pull/371)** — \"Add distributed communication framework for multi-device tensor parallelism.\" Open since 2026-03-15. This is the upstream PR that, when merged, lifts the \"(yet)\" caveat. It adds full Swift bindings (`DistributedGroup`, `MLXDistributed.allSum` / `.send` / `.recv`, plus sharded NN layers).
  - **If/when merged:** rebase our fork onto upstream, drop any local Swift bindings in `provider-swift` that overlap with upstream's `MLXDistributed` namespace, and migrate `MLXDistributed.swift` to use upstream symbols.
  - **If upstream goes a different direction** (different API surface, different backend selection mechanism, etc.): we may need to refactor the rank-coordination layer in `ClusterSession` / `ClusterPeer`.

## Known upstream jaccl bugs

Tracked here so we notice if they bite us:

- **[ml-explore/mlx#3149](https://github.com/ml-explore/mlx/issues/3149)** — JACCL point-to-point send/recv with varying shape produces wrong data or hangs. *Low risk for our shape because activation transfer doesn't use jaccl, but watch if we ever route variable-length tensors through jaccl.*
- **[ml-explore/mlx#3467](https://github.com/ml-explore/mlx/issues/3467)** — \"Changing queue pair to RTR failed with errno 22\" on Apple Thunderbolt RDMA, GID-selection regression introduced in #3412. *Affects connection setup. If two-Mac smoke tests fail with RTR errors, this is the most likely cause.*
- **[ml-explore/mlx#3442](https://github.com/ml-explore/mlx/issues/3442)** — `backend=\"any\"` selects ring singleton instead of JACCL. *Not affecting us; we invoke jaccl explicitly.*
- **[ml-explore/mlx#3162](https://github.com/ml-explore/mlx/issues/3162)** — AppleThunderboltRDMA hard limit of 100 MRs caps JACCL multi-node scaling. *Not affecting us; we're 2-Mac.*

## What this means operationally

- The `libs/mlx-swift` submodule pointer in d-inference should track our fork's `main` (which will incorporate [Layr-Labs/mlx-swift#2](https://github.com/Layr-Labs/mlx-swift/pull/2) once merged), not upstream's `main`.
- When upstream [#371](https://github.com/ml-explore/mlx-swift/pull/371) lands, file a separate issue to plan the migration and drop our local divergence.
- If we ever consider migrating activation transfer onto jaccl (away from TCP + AES-GCM), re-evaluate #3149 + #3467 first.

## Related code

- `provider-swift/Sources/ProviderCore/P2P/MLXDistributed.swift` — Swift wrappers for jaccl C API
- `provider-swift/Sources/ProviderCore/P2P/ClusterSession.swift` / `ClusterPeer.swift` — rank-coordination layer that uses jaccl collective ops
- `provider-swift/Sources/ProviderCore/P2P/ThunderboltLink.swift` — plain TCP transport for activation tensors (the path that does NOT use jaccl)
- `provider-swift/Sources/ProviderCore/P2P/EncryptedPipelineInference.swift` — AES-256-GCM sealing layer over `ThunderboltLink`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track upstream mlx-swift distributed-backend status (jaccl deviation) #193

Why this issue exists

Current state

Upstream work to track

Known upstream jaccl bugs

What this means operationally

Related code

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Track upstream mlx-swift distributed-backend status (jaccl deviation) #193

Description

Why this issue exists

Current state

Upstream work to track

Known upstream jaccl bugs

What this means operationally

Related code

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions