perf(server): reduce MoE expert-compute IPC overhead by weicj · Pull Request #388 · Luce-Org/lucebox-hub

weicj · 2026-06-15T05:45:02Z

Summary

This PR reduces cross-backend MoE expert-compute IPC overhead by batching prefill remote-expert work instead of inheriting the local hot-stack safety slice granularity. In the 4248 prompt / 128 completion checks below, the batched path cuts IPC calls by about 91-92% and payload by about 48-97%.

Because the best request shape depends on remote backend compute and memory headroom, auto defaults to the batched path while explicit stream keeps the conservative small-request path available.

Changes

Batch remote MoE expert compute over the prefill chunk instead of inheriting the local hot-stack safety slice size.
Send backend-local expert ids for new MoE expert-compute IPC commands while keeping compatibility with the older command shape.
Add typed input payload support for MoE expert-compute IPC prefill (f32, f16, bf16).
Add DFLASH_MOE_EXPERT_COMPUTE_IPC_MODE=auto|batched|stream; auto uses the batched path and explicit stream keeps the conservative path available.
Keep DFLASH_MOE_EXPERT_COMPUTE_IPC_TRANSPORT=stream|shared|auto as the payload transport selector, separate from execution granularity.

Notes

Mode behavior:

auto / batched: batches remote expert compute over the prefill chunk, capped by DFLASH_MOE_EXPERT_COMPUTE_IPC_BATCH_CAPACITY.
stream: follows the small hot-stack safety slices, typically up to 4 prefill tokens per remote expert-compute call on this path.

Empirical cases, 4248 prompt / 128 completion prefill check:

Dual Pro VII: HIP -> IPC -> HIP.
Pro VII + P4: HIP -> IPC -> CUDA.

Path	Metric	`stream`	`auto` / `batched`	Change
HIP -> IPC -> HIP	IPC calls	`29042`	`2657`	`-90.9%`
HIP -> IPC -> HIP	IPC payload	`612.430 MiB`	`317.346 MiB`	`-48.2%`
HIP -> IPC -> CUDA	IPC calls	`29020`	`2280`	`-92.1%`
HIP -> IPC -> CUDA	IPC payload	`612.370 MiB`	`17.943 MiB`	`-97.1%`

On dual Pro VII, batched mode significantly reduced prefill IPC traffic and reduced prefill time from 40.93s to 27.74s (-32.2%), while decode throughput stayed effectively flat (18.6 -> 18.5 tok/s). This is still a policy tradeoff rather than a universal win/loss: if the remote backend is too weak, batched mode can expose the remote-compute ceiling and increase prefill time, as shown by the Pro VII + P4 check from 41.64s to 81.20s (+95.0%). This is why the PR keeps explicit stream mode available instead of removing the conservative path.

weicj added 3 commits June 12, 2026 20:29

feat(server): add cross-backend MoE expert compute foundation

25c4260

fix(server): address MoE expert compute review feedback

e4e0d8d

perf(server): reduce MoE expert-compute IPC overhead

ce4ec06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(server): reduce MoE expert-compute IPC overhead#388

perf(server): reduce MoE expert-compute IPC overhead#388
weicj wants to merge 3 commits into
Luce-Org:mainfrom
weicj:perf-moe-expert-compute-ipc

weicj commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weicj commented Jun 15, 2026 •

edited

Loading