Provider MLX continuous-batching crash: [concatenate] shape mismatch merging concurrent requests (Qwen3.5 hybrid model)

## Summary

Under concurrent requests to a single model, the Swift provider's MLX continuous-batching engine crashes with a fatal `[concatenate]` shape-mismatch, killing the provider process. All in-flight requests then fail with `502 provider_error: "inference failed after N attempt(s): provider disconnected"`. It is **intermittent** — `master` passes the same test, but it recurs across PR branches.

## The crash

```
MLX/ErrorHandler.swift:345: Fatal error: [concatenate] All the input array dimensions
must match exactly except for the concatenation axis. However, the provided shapes are
(2,8,24,64), (1,8,24,192), and the concatenation axis is -1.
  at libs/mlx-swift/Source/Cmlx/mlx-c/mlx/c/ops.cpp:677
```

The two operands are `[batch, kv_heads, seq, head_dim]`:
- `(2, 8, 24, 64)`  — batch 2, head_dim **64**
- `(1, 8, 24, 192)` — batch 1, head_dim **192**

Concatenating on axis `-1` requires every *other* axis to match, but the **batch dims differ (2 vs 1)** *and* the head_dim differs (64 vs 192). Two KV-cache tensors of incompatible geometry are being merged in the same `concatenated` call.

## Why this is a real model-shape collision (not just a transient)

`mlx-community/Qwen3.5-0.8B-MLX-4bit` is a **hybrid architecture**. The provider's own KV estimator classifies its layers into distinct geometries (`BatchScheduler+KVEstimation.swift`):
- `linear_attention` → GatedDeltaNet **recurrent-state** layers (`recurrentLayerTypes`, ~line 39-41)
- full-attention layers with `head_dim` vs `global_head_dim` and a `sliding_window_pattern` (lines 86-99)

So `head_dim 64` vs `192` corresponds to genuinely different per-layer cache/state shapes. When two concurrent requests at **different stages** (one mid-prefill, one mid-decode — hence batch 1 vs 2) are merged into one `GenerationBatch`, the per-layer cache merge concatenates tensors whose item-shapes don't line up.

## Suspected location (hypothesis — needs confirmation)

The batch-merge path lives in the **`libs/mlx-swift-lm` submodule** (Layr-Labs fork), not in `d-inference` directly:

- `Libraries/MLXLMCommon/ContinuousBatching/Scheduler.swift` — `mergeIntoGenBatch(_:)` merges a newly-prefilled generation into the running batch.
- `Libraries/MLXLMCommon/GenerationBatch.swift` — `extend(_ other:)` per-layer merge.
- `Libraries/MLXLMCommon/KVCache.swift` — `ArraysCache.extend(other:)` → `concatenateOptional(...)` builds zero-padded operands from `itemShape = shape.dropFirst()`, taking the shape from whichever operand is non-nil. If the two operands have different item-shapes (different head_dim / recurrent-state dims across layer types), the resulting `MLX.concatenated([lhs, rhs])` mismatches.

> Note: `concatenateOptional` calls `MLX.concatenated([...])` with **no axis** (defaults to axis 0), while the crash reports **axis -1** — so the exact failing call may be a *different* concat in the attention/cache-update path rather than this one. The mechanism (incompatible per-layer geometries merged across requests at different prefill/decode stages) is the same regardless of which concat trips first. The submodule code should be the focus of the fix.

There appears to be **no guard** that two requests are shape-compatible (same per-layer cache geometry, same stage) before they're merged into a shared batched forward pass.

## Reproduction

- Test: `TestProfile_SingleProviderNonStreaming` (`e2e/profile_test.go`), which fires **3 concurrent non-streaming requests** at a single Qwen3.5 provider.
- Observed: provider emits the fatal error mid-batch, disconnects; `errors (10/10) … status 502 … provider disconnected`; the test's `require.Greater(SuccessCount, 0)` trips with `"0" is not greater than "0"`.
- Intermittent: timing-dependent on whether request N's prefill completes and merges while requests 1..N-1 are mid-decode.

## Evidence it's flaky, not a branch regression

- **`master` ran the identical test and passed** (run `27508483291`, `TestProfile_SingleProviderNonStreaming` PASS in 22.1s).
- Recurs on multiple unrelated PR branches (`fix/coordinator-plaintext-body-cap`, `feat/oom-protection-vlm-media-cap`, `worktree-provider-trust-reliability`, `devin/*`) whose diffs don't touch the inference/KV path.
- Confirmed crash line on run `27510260770` / job `81308924460`.

## Impact

- **Provider availability**: a single crash takes down the provider process and 502s every co-tenant in-flight request — exactly the multi-tenant blast radius the coordinator's pre-content failover and cancellation logic try to contain. A retry lands on the same (now-dead) provider until the registry evicts it.
- **CI flakiness**: red E2E runs on PRs that didn't touch inference, eroding signal.

## Suggested fix directions

1. **Guard the batch merge**: before merging a generation into a running `GenerationBatch`, assert per-layer cache item-shapes are identical; if not, keep it in a separate batch instead of concatenating. (Same model → same architecture → shapes *should* match, so a mismatch indicates a real merge-ordering/stage bug to fix, not just to skip.)
2. **Fix `concatenateOptional` item-shape derivation** so it never builds a zero operand whose item-shape disagrees with the real operand (don't take `itemShape` from "whichever is non-nil"); and add a precondition that both operands' non-concat axes match.
3. **Fail soft, not fatal**: a per-request inference error should reject *that* request (or split the batch), not `fatalError` the whole provider process. A guard that throws a Swift error here would convert a fleet-impacting crash into a single 4xx/5xx.

## Notes

- The crashing concat is in the **`libs/mlx-swift-lm` submodule** — a fix likely lands there and is then pulled into `d-inference` via a submodule bump.
- Filing this so the batching crash gets a real fix rather than being masked by CI retries; not a blocker for coordinator-only PRs whose other E2E tests pass.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provider MLX continuous-batching crash: [concatenate] shape mismatch merging concurrent requests (Qwen3.5 hybrid model) #344

Summary

The crash

Why this is a real model-shape collision (not just a transient)

Suspected location (hypothesis — needs confirmation)

Reproduction

Evidence it's flaky, not a branch regression

Impact

Suggested fix directions

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Provider MLX continuous-batching crash: [concatenate] shape mismatch merging concurrent requests (Qwen3.5 hybrid model) #344

Description

Summary

The crash

Why this is a real model-shape collision (not just a transient)

Suspected location (hypothesis — needs confirmation)

Reproduction

Evidence it's flaky, not a branch regression

Impact

Suggested fix directions

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions