Skip to content

Provider MLX continuous-batching crash: [concatenate] shape mismatch merging concurrent requests (Qwen3.5 hybrid model) #344

Description

@anupsv

Summary

Under concurrent requests to a single model, the Swift provider's MLX continuous-batching engine crashes with a fatal [concatenate] shape-mismatch, killing the provider process. All in-flight requests then fail with 502 provider_error: "inference failed after N attempt(s): provider disconnected". It is intermittentmaster passes the same test, but it recurs across PR branches.

The crash

MLX/ErrorHandler.swift:345: Fatal error: [concatenate] All the input array dimensions
must match exactly except for the concatenation axis. However, the provided shapes are
(2,8,24,64), (1,8,24,192), and the concatenation axis is -1.
  at libs/mlx-swift/Source/Cmlx/mlx-c/mlx/c/ops.cpp:677

The two operands are [batch, kv_heads, seq, head_dim]:

  • (2, 8, 24, 64) — batch 2, head_dim 64
  • (1, 8, 24, 192) — batch 1, head_dim 192

Concatenating on axis -1 requires every other axis to match, but the batch dims differ (2 vs 1) and the head_dim differs (64 vs 192). Two KV-cache tensors of incompatible geometry are being merged in the same concatenated call.

Why this is a real model-shape collision (not just a transient)

mlx-community/Qwen3.5-0.8B-MLX-4bit is a hybrid architecture. The provider's own KV estimator classifies its layers into distinct geometries (BatchScheduler+KVEstimation.swift):

  • linear_attention → GatedDeltaNet recurrent-state layers (recurrentLayerTypes, ~line 39-41)
  • full-attention layers with head_dim vs global_head_dim and a sliding_window_pattern (lines 86-99)

So head_dim 64 vs 192 corresponds to genuinely different per-layer cache/state shapes. When two concurrent requests at different stages (one mid-prefill, one mid-decode — hence batch 1 vs 2) are merged into one GenerationBatch, the per-layer cache merge concatenates tensors whose item-shapes don't line up.

Suspected location (hypothesis — needs confirmation)

The batch-merge path lives in the libs/mlx-swift-lm submodule (Layr-Labs fork), not in d-inference directly:

  • Libraries/MLXLMCommon/ContinuousBatching/Scheduler.swiftmergeIntoGenBatch(_:) merges a newly-prefilled generation into the running batch.
  • Libraries/MLXLMCommon/GenerationBatch.swiftextend(_ other:) per-layer merge.
  • Libraries/MLXLMCommon/KVCache.swiftArraysCache.extend(other:)concatenateOptional(...) builds zero-padded operands from itemShape = shape.dropFirst(), taking the shape from whichever operand is non-nil. If the two operands have different item-shapes (different head_dim / recurrent-state dims across layer types), the resulting MLX.concatenated([lhs, rhs]) mismatches.

Note: concatenateOptional calls MLX.concatenated([...]) with no axis (defaults to axis 0), while the crash reports axis -1 — so the exact failing call may be a different concat in the attention/cache-update path rather than this one. The mechanism (incompatible per-layer geometries merged across requests at different prefill/decode stages) is the same regardless of which concat trips first. The submodule code should be the focus of the fix.

There appears to be no guard that two requests are shape-compatible (same per-layer cache geometry, same stage) before they're merged into a shared batched forward pass.

Reproduction

  • Test: TestProfile_SingleProviderNonStreaming (e2e/profile_test.go), which fires 3 concurrent non-streaming requests at a single Qwen3.5 provider.
  • Observed: provider emits the fatal error mid-batch, disconnects; errors (10/10) … status 502 … provider disconnected; the test's require.Greater(SuccessCount, 0) trips with "0" is not greater than "0".
  • Intermittent: timing-dependent on whether request N's prefill completes and merges while requests 1..N-1 are mid-decode.

Evidence it's flaky, not a branch regression

  • master ran the identical test and passed (run 27508483291, TestProfile_SingleProviderNonStreaming PASS in 22.1s).
  • Recurs on multiple unrelated PR branches (fix/coordinator-plaintext-body-cap, feat/oom-protection-vlm-media-cap, worktree-provider-trust-reliability, devin/*) whose diffs don't touch the inference/KV path.
  • Confirmed crash line on run 27510260770 / job 81308924460.

Impact

  • Provider availability: a single crash takes down the provider process and 502s every co-tenant in-flight request — exactly the multi-tenant blast radius the coordinator's pre-content failover and cancellation logic try to contain. A retry lands on the same (now-dead) provider until the registry evicts it.
  • CI flakiness: red E2E runs on PRs that didn't touch inference, eroding signal.

Suggested fix directions

  1. Guard the batch merge: before merging a generation into a running GenerationBatch, assert per-layer cache item-shapes are identical; if not, keep it in a separate batch instead of concatenating. (Same model → same architecture → shapes should match, so a mismatch indicates a real merge-ordering/stage bug to fix, not just to skip.)
  2. Fix concatenateOptional item-shape derivation so it never builds a zero operand whose item-shape disagrees with the real operand (don't take itemShape from "whichever is non-nil"); and add a precondition that both operands' non-concat axes match.
  3. Fail soft, not fatal: a per-request inference error should reject that request (or split the batch), not fatalError the whole provider process. A guard that throws a Swift error here would convert a fleet-impacting crash into a single 4xx/5xx.

Notes

  • The crashing concat is in the libs/mlx-swift-lm submodule — a fix likely lands there and is then pulled into d-inference via a submodule bump.
  • Filing this so the batching crash gets a real fix rather than being masked by CI retries; not a blocker for coordinator-only PRs whose other E2E tests pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions