Skip to content

perf(server): reduce MoE expert-compute IPC overhead#388

Draft
weicj wants to merge 3 commits into
Luce-Org:mainfrom
weicj:perf-moe-expert-compute-ipc
Draft

perf(server): reduce MoE expert-compute IPC overhead#388
weicj wants to merge 3 commits into
Luce-Org:mainfrom
weicj:perf-moe-expert-compute-ipc

Conversation

@weicj

@weicj weicj commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR reduces cross-backend MoE expert-compute IPC overhead by batching prefill remote-expert work instead of inheriting the local hot-stack safety slice granularity. In the 4248 prompt / 128 completion checks below, the batched path cuts IPC calls by about 91-92% and payload by about 48-97%.

Because the best request shape depends on remote backend compute and memory headroom, auto defaults to the batched path while explicit stream keeps the conservative small-request path available.

Changes

  • Batch remote MoE expert compute over the prefill chunk instead of inheriting the local hot-stack safety slice size.
  • Send backend-local expert ids for new MoE expert-compute IPC commands while keeping compatibility with the older command shape.
  • Add typed input payload support for MoE expert-compute IPC prefill (f32, f16, bf16).
  • Add DFLASH_MOE_EXPERT_COMPUTE_IPC_MODE=auto|batched|stream; auto uses the batched path and explicit stream keeps the conservative path available.
  • Keep DFLASH_MOE_EXPERT_COMPUTE_IPC_TRANSPORT=stream|shared|auto as the payload transport selector, separate from execution granularity.

Notes

Mode behavior:

  • auto / batched: batches remote expert compute over the prefill chunk, capped by DFLASH_MOE_EXPERT_COMPUTE_IPC_BATCH_CAPACITY.
  • stream: follows the small hot-stack safety slices, typically up to 4 prefill tokens per remote expert-compute call on this path.

Empirical cases, 4248 prompt / 128 completion prefill check:

  • Dual Pro VII: HIP -> IPC -> HIP.
  • Pro VII + P4: HIP -> IPC -> CUDA.
Path Metric stream auto / batched Change
HIP -> IPC -> HIP IPC calls 29042 2657 -90.9%
HIP -> IPC -> HIP IPC payload 612.430 MiB 317.346 MiB -48.2%
HIP -> IPC -> CUDA IPC calls 29020 2280 -92.1%
HIP -> IPC -> CUDA IPC payload 612.370 MiB 17.943 MiB -97.1%

On dual Pro VII, batched mode significantly reduced prefill IPC traffic and reduced prefill time from 40.93s to 27.74s (-32.2%), while decode throughput stayed effectively flat (18.6 -> 18.5 tok/s). This is still a policy tradeoff rather than a universal win/loss: if the remote backend is too weak, batched mode can expose the remote-compute ceiling and increase prefill time, as shown by the Pro VII + P4 check from 41.64s to 81.20s (+95.0%). This is why the PR keeps explicit stream mode available instead of removing the conservative path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant