Skip to content

Commit bb29e36

Browse files
docs: document --stream-experts + --draft-model auto-cap strategy (Issue #72)
Three targeted README updates: 1. SSD Expert Streaming 'Important finding' callout (line 245): - Changed from blanket 'counterproductive / excluded' statement to explain the fan-out problem (5x I/O at default 4 draft tokens) and document the auto-cap-to-1 mitigation (2x I/O, net positive at >=50% acceptance) 2. Usage code block (line 274): - Added a '--stream-experts + --draft-model' example showing that num-draft-tokens is auto-capped to 1 at startup 3. CLI options table (line 407): - Updated --draft-model and --num-draft-tokens rows to mention the auto-cap behavior when combined with --stream-experts
1 parent 3f6bad5 commit bb29e36

1 file changed

Lines changed: 17 additions & 4 deletions

File tree

README.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,11 @@ SwiftLM implements a **rewritten SSD expert streaming pipeline** (engineered by
242242

243243
A novel aspect of this architecture is the **dual-model speculative decoding** pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs **entirely in RAM** while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.
244244

245-
> **Important finding:** Speculative decoding is **counterproductive for SSD-streaming MoE** specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. Speculative decoding is therefore routed exclusively to **in-RAM models**.
245+
> **Performance note:** Combining `--stream-experts` with `--draft-model` requires care. The verify pass sends N+1 tokens simultaneously, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. At the default `--num-draft-tokens 4` this creates a **5× I/O fan-out** that regresses throughput below solo SSD streaming.
246+
>
247+
> **Auto-cap strategy (Issue #72 fix):** SwiftLM automatically caps `--num-draft-tokens` to **1** when both flags are active. With 1 draft token the verify pass covers only 2 positions (2× fan-out). If the draft model's acceptance rate is ≥ 50% — typical for same-family models — the net throughput is still positive despite the 2× I/O overhead. A startup advisory is printed when the cap fires.
248+
>
249+
> For maximum throughput: use `--stream-experts` alone (no draft model).
246250
247251
### Optimization Techniques
248252

@@ -271,11 +275,20 @@ SWIFTLM_TOP_K=6 SwiftLM --port 8002 \
271275
SWIFTLM_TOP_K=4 SwiftLM --port 8002 \
272276
--model <path>/Qwen3.5-122B-A10B-4bit --stream-experts
273277

274-
# With speculative decoding (in-RAM models only):
278+
# With speculative decoding (in-RAM models only — both models fit in RAM):
275279
SwiftLM --port 8002 \
276280
--model <path>/Qwen3.5-27B-4bit \
277281
--draft-model <path>/Qwen3.5-9B-4bit \
278282
--num-draft-tokens 4
283+
284+
# With SSD streaming + draft model (auto-cap mode):
285+
# SwiftLM automatically caps --num-draft-tokens to 1 to minimise the
286+
# verify-pass I/O fan-out. Net positive if draft acceptance rate ≥ 50%.
287+
SwiftLM --port 8002 \
288+
--model <path>/Qwen3.5-122B-A10B-4bit \
289+
--stream-experts \
290+
--draft-model <path>/Qwen3.5-9B-4bit
291+
# ↑ num-draft-tokens is auto-capped to 1 at startup
279292
```
280293

281294
---
@@ -404,8 +417,8 @@ curl http://localhost:5413/v1/chat/completions \
404417
| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |
405418
| `--stream-experts` | `false` | Enable SSD expert streaming for MoE models (10x speedup) |
406419
| `--turbo-kv` | `false` | Enable TurboQuant 3-bit KV cache compression (activates after 2048 tokens, server-wide) |
407-
| `--draft-model` | (none) | Draft model path/ID for speculative decoding (in-RAM models only) |
408-
| `--num-draft-tokens` | `4` | Number of draft tokens per speculation round |
420+
| `--draft-model` | (none) | Draft model path/ID for speculative decoding. When used with `--stream-experts`, `--num-draft-tokens` is auto-capped to 1 to minimise SSD I/O fan-out (see performance note above). |
421+
| `--num-draft-tokens` | `4` | Tokens per speculation round. Auto-capped to 1 when combined with `--stream-experts`. |
409422

410423
## 🔧 Per-Request API Parameters
411424

0 commit comments

Comments
 (0)