L2 adapter: integrate vllm-project/router as dataplane for multi-replica and PD-disagg trials

# L2 adapter: integrate `vllm-project/router` as the dataplane for multi-replica and PD-disagg trials

> Full source-of-truth reference for this ticket: `docs/research/raw/08-vllm-router-dataplane.md` (628 LoC, all technical detail, cross-references). This ticket is the implementation tracker; read the note for the WHY.

**Thesis anchors.** C1 (engine surface has slack), C4 (PD-disagg Pareto-dominates), C6 (hybrid policy required), C9 (reference replica required). P1 (three layers), P3 (layers are adapters), P4 (cross-layer stale-signal), P8 (reference replica), P9 (typed failure), P10 (frozen/mutable boundary).

**Cross-refs.**
- `docs/research/references/00-hypothesis-seed.md` §4.2 — `Router policy` row now present in the L2 axis table.
- `docs/research/raw/references-L1-engine-config.md` §"Scope note — routing policy is not L1".
- `docs/research/raw/08-vllm-router-dataplane.md` — full source note.

---

## 1. WHAT — the problem in one paragraph

`src/autoinfer/target/basilica.py` and `src/autoinfer/harness/driver.py` are stubs today. Their implicit shape is "one worker URL per trial, driver hits it directly." That shape **silently produces wrong numbers** the moment L2 trials go to (a) more than one replica, or (b) prefill/decode disaggregation. The missing component is the **request dataplane**: a router that dispatches requests to workers under a named policy. vLLM's official reference implementation is [`vllm-project/router`](https://github.com/vllm-project/router) (Rust + PyO3); PrimeIntellect maintains a fork with production fixes we likely want ([`PrimeIntellect-ai/router`](https://github.com/PrimeIntellect-ai/router), +24 commits). This ticket tracks adding the router to autoinfer's L2 adapter so multi-replica and PD-disagg trials are measured through a realistic dispatch path.

## 2. WHY — what breaks without it

Concrete failure modes of the direct-to-worker status quo, ordered by severity:

1. **PD-disaggregation is literally unrunnable** without a dispatcher. The NIXL / NCCL connector protocol requires a component that wraps the request with bootstrap metadata (`bootstrap_host`, `bootstrap_port`, `bootstrap_room`), pairs a prefill leg with a decode leg, and merges their outputs. `src/routers/http/vllm_pd_router.rs` *is* that component. Any C4 evidence we produce without it is incomparable to DistServe / Mooncake headline numbers (7.4× and 59–498%, both assume router-mediated paths).
2. **Prefix-cache-hit rate becomes arrival-order dependent**, not policy dependent. Without `cache_aware` dispatching, two workers with identical engine configs will see different prompt-prefix distributions determined by the driver's concurrency pattern, not by any defensible routing choice. The resulting Pareto frontier is noise.
3. **Policy-independent claims about L1**. Any "this engine config beats vLLM defaults on tokens/s at 8 replicas" statement is under-specified without stating the router policy — switching from `round_robin` to `cache_aware` on the same engine config can swamp single-flag L1 wins (see LMCache PD bench 2025-04-29 for magnitude).
4. **Circuit-breaker state leaks across trials**. Workers that trip open in trial K start trial K+1 in `Open` state; policies filter them out via `get_healthy_worker_indices`. Without explicit router lifecycle, trial K+1 runs against a quietly reduced pool.
5. **Phantom load accumulation**. Upstream has a known bug in `cache_aware` load tracking (double-decrement on retry) that manifests only after hours of streaming traffic — workers silently get locked out with negative load. Fixed in PrimeIntellect fork PR #23; unmerged upstream at snapshot time (2026-04-23). Running benchmarks on upstream will produce drift we cannot explain.

## 3. HOW — implementation plan

### 3.1 Schema (`src/autoinfer/layers/l2_topology/surface.py`)

Add a `RouterConfig` sub-model (Pydantic; typed per P11):

```python
class RouterConfig(BaseModel):
    policy: Literal["cache_aware", "power_of_two", "consistent_hash", "round_robin", "random"] = "round_robin"
    intra_node_data_parallel_size: int = Field(default=1, ge=1, le=8)
    pd_disagg: bool = False
    connector: Literal["nixl", "nccl"] | None = None          # non-None only if pd_disagg
    prefill_policy: Literal[...] | None = None                # non-None only if pd_disagg
    decode_policy: Literal[...] | None = None                 # non-None only if pd_disagg
    # Held constant across trials (not in the search space):
    cb_failure_threshold: int = 5
    cb_timeout_duration_secs: int = 30
    retry_max_retries: int = 3
```

Assertions (fail-fast, per CLAUDE.md "assert early, fail/error/return-fast"):

- `pd_disagg` ⇒ `connector is not None and prefill_policy is not None and decode_policy is not None`.
- `not pd_disagg` ⇒ `connector is None and prefill_policy is None and decode_policy is None`.
- `intra_node_data_parallel_size` must match the engine config's `--data-parallel-size`.

### 3.2 Router lifecycle (`src/autoinfer/target/basilica.py`)

Deterministic per-trial process:

1. Provision N workers on Basilica (existing pattern, already stubbed).
2. **Launch `vllm-router` in front** with CLI flags derived from `RouterConfig`. Use a free port via `socket.bind(('', 0)); sock.getsockname()[1]`. Example (regular mode):
   ```bash
   vllm-router \
     --worker-urls $URL1 $URL2 ... \
     --policy cache_aware \
     --intra-node-data-parallel-size 1 \
     --host 127.0.0.1 --port $ROUTER_PORT \
     --prometheus-host 127.0.0.1 --prometheus-port $PROM_PORT
   ```
   PD-disagg:
   ```bash
   vllm-router \
     --vllm-pd-disaggregation \
     --prefill $P1 $P2 --decode $D1 $D2 $D3 $D4 \
     --prefill-policy consistent_hash --decode-policy round_robin \
     --host 127.0.0.1 --port $ROUTER_PORT
   ```
3. Health-probe `GET $ROUTER_URL/liveness` (auth-exempt per fork PR #10).
4. Return `$ROUTER_URL` to the driver; never return per-worker URLs.
5. On `teardown()`: SIGTERM the router, wait, SIGKILL on timeout.

Binary provenance: pin to PrimeIntellect fork until phantom-load (#23) and circuit-breaker (#24) fixes land upstream. Turn off JWT (`--jwt-public-key-path` unset) and API-key auth (`--api-key-validation-urls` unset) — trials are internal.

### 3.3 Driver (`src/autoinfer/harness/driver.py`)

`vllm bench serve` already takes a single URL; cosmetic change is to point it at `$ROUTER_URL`. The non-cosmetic change:

- **Inject `X-Session-ID: autoinfer-{trial_id}-{request_index}` on every request** so `consistent_hash` is reproducible. The router's hash-key priority is `X-Session-ID > X-User-ID > Authorization > client IP > body-hash`; without an explicit session header, the effective key is client IP (all requests go to one worker under the default bench harness).
- **Make the prompt stream deterministic per trial**. Seed the request-generator from `trial_id` so policy A/B is comparable on byte-identical inputs.
- Pass router Prometheus scrape URL back to the ledger writer so per-policy decision counts are persisted.

### 3.4 Failure typing (`src/autoinfer/harness/failure.py`) — per P9

New typed failure records:

- `RouterStartupFailed(reason: Literal["binary_missing", "port_in_use", "pem_invalid", "flag_invalid"])`.
- `RouterPoolExhausted(retries: int, pool_size: int, open_circuits: int)` — aggregate saturation, not per-worker failure. **Critical**: if we leave this untyped, the surrogate attributes it to the engine config and wastes iterations tuning irrelevant knobs.
- `PDConnectorTimeout(leg: Literal["prefill", "decode"], timeout_secs: int, bootstrap_room: int)` — NIXL/NCCL handoff timeout.
- `RouterPolicyInvariant(counter: str)` — e.g. negative load counter observed (sentinel that the phantom-load bug re-emerged; should never fire on fork).

### 3.5 Ledger (`src/autoinfer/harness/ledger.py`) — per P4

Every trial row where `replicas > 1` or `pd_disagg` persists the full `RouterConfig` alongside the engine args. Without this, comparing two rows with the same engine config but different router policies silently mis-attributes throughput deltas.

Schema addition:
```python
class LedgerRow(BaseModel):
    ...
    router_config: RouterConfig | None = None   # None only when replicas == 1 and not pd_disagg
```

### 3.6 Cross-layer stale-signal (`src/autoinfer/controller/stale.py`) — per P4

When any axis of `RouterConfig` changes for a given engine config, `Ledger.mark_stale()` on prior rows with that engine config. Reason: the *effective* workload each worker sees depends on the policy; the old rows' throughput and tail numbers are no longer comparable.

### 3.7 Fairness hazards the driver must defend against

Each one is a real trap confirmed from source:

1. **Warm vs cold `cache_aware` tree** — router restart gives cold tree; without restart the tree carries prior-trial placements. Recommend: **restart router per trial**, warm with `N = 100` priming requests outside the measured window, declare the regime in the ledger.
2. **Round-robin `AtomicUsize` carryover** — the counter is process-lifetime. `reset()` exists on the trait but no control endpoint exposes it. Restart is the safer lever.
3. **Circuit-breaker carryover** — any worker that tripped in trial K starts trial K+1 in `Open` for `cb_timeout_duration_secs`. Either wait, or restart.
4. **`power_of_two` load signal** — falls back to `Worker::load()` (local in-flight counter) if nothing pushes into `cached_loads`. Different from vLLM's actual queue depth. Declare which signal is in effect.
5. **PD connector vs fabric** — NIXL handoff latency is fabric-dominated; headline DistServe/Mooncake numbers assume NVLink. On Basilica TCP paths, PD-disagg *may be worse* than collocated.

## 4. Acceptance criteria

Trial-level:
- [ ] A trial with `replicas=2, router.policy="cache_aware"` and a trial with the same topology + `router.policy="round_robin"` on the same driver trace produce **measurably different** TTFT P50/P99 in the ledger. (If not, either the policy is inactive or the driver is not session-headered.)
- [ ] A PD-disagg trial (`prefill=2, decode=6, connector="nixl"`) runs end-to-end through the router and returns a quality-gated row against the FP16 reference replica (P8).
- [ ] Policy-change events emit a `mark_stale()` call on prior ledger rows with the same engine config (P4).

Negative / no-regression:
- [ ] Iteration-zero L1 single-replica trials (§8 of `00-hypothesis-seed.md`) are **unaffected** — no router is launched, no new dependency surfaces.
- [ ] `RouterPoolExhausted` is emitted (not silently mis-classified as a worker failure) when the pool is saturated.

Observability:
- [ ] Router Prometheus scrape URL is persisted in the ledger; `vllm_router_run_requests_total` and per-policy decision counts are queryable post-trial.
- [ ] Phantom-load sentinel (`Worker::load() < 0`) never fires under streaming workloads for >24h (regression guard for #23).

## 5. Decisions to make (please comment)

1. **Binary provenance** — pin to PrimeIntellect fork (has #23, #24, LoRA fixes) or wait for upstream merges? Fork carries JWT code we don't use but can disable. Recommend fork.
2. **Per-trial router restart vs long-lived router** — restart is safer for A/B fairness but adds ~1s per trial (binary startup, pod health wait). Acceptable given wall-clock ~minutes per trial.
3. **Router-policy parameter tuning** — `cache_threshold`, `balance_abs_threshold`, `balance_rel_threshold`, `eviction_interval_secs`, `max_tree_size` are `cache_aware` internals. Treat as constants (recommended) or expose as sub-axes? Recommend constants for iteration-one; add if evidence emerges that they matter.
4. **`intra-node-data-parallel-size`** — search axis or derived from engine config's `--data-parallel-size`? The two must match; derive.

## 6. Out of scope here

- **LoRA-aware dispatch** (PrimeIntellect fork only, fix in PR #15) — only becomes relevant if an L2 multi-adapter axis opens. Track separately.
- **gRPC router path** (feature-gated in `Cargo.toml`, `src/routers/grpc/`) — HTTP is the reference path; gRPC is not on the critical path.
- **Full `llm-d` integration** — heavier dependency, separate ticket if we need the resilience-operator surface.
- **Router-policy parameter auto-tuning** — see decision #3.
- **Tokenizer-side-effect dedup** — router bundles `tokenizers`, `tiktoken-rs`, `minijinja`, `hf-hub` for server-side chat-template rendering; we should not rely on that path in trials (determinism hazard) but also not disable it (requires a source edit).

## 7. Risk register

| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Router process OOMs mid-trial under `cache_aware` tree growth | Low | Trial lost | `max_tree_size=10000` (default); watchdog on router RSS; evict on boundary. |
| PD-disagg NIXL connector version skew between router and vLLM workers | Medium | Silent decode-side corruption | Pin vLLM and router versions together in the trial artefact; validate via quality gate (P8) — divergence would flag this. |
| Phantom-load bug re-emerges or new equivalent ships | Low (fork) / High (upstream) | Slow drift, invalid numbers | Pin fork version; regression test: synthetic 24h streaming run, assert no worker ends with `load() < 1`. |
| Driver hits `RouterPoolExhausted` → surrogate misattributes to engine config | High if untyped | Wasted iterations | Typed `FailureRecord` (§3.4); surrogate treats router-originated failures as non-informative on engine axes. |
| Fork carries auth code we don't use; CVE surface | Low | Ops overhead | Keep JWT/APIKey off via unset flags; track fork release security advisories. |

## 8. Prior art / references

Internal:
- `docs/research/raw/08-vllm-router-dataplane.md` — full technical note (structure, code internals, PR-by-PR diff).
- `docs/research/references/00-hypothesis-seed.md` §4.2 — L2 axis table.
- `docs/research/raw/references-L1-engine-config.md` §"Scope note — routing policy is not L1".
- `docs/research/raw/06-cloudflare-omni-gpu-multiplexing.md` — orthogonal "one GPU, many models" axis (reminder that tenancy is L2-adjacent).
- `docs/research/raw/07-vllm-v1-architecture.md` — vLLM V1 scheduler is downstream of the router.

External (all confirmed accessible 2026-04-23):
- Router source, upstream: https://github.com/vllm-project/router
- Router source, PrimeIntellect fork: https://github.com/PrimeIntellect-ai/router
- Load-balancing docs: https://github.com/vllm-project/router/blob/main/docs/load_balancing/README.md
- DistServe (PD-disagg, 7.4× headline): https://arxiv.org/abs/2401.09670
- Mooncake (KV-cache disagg): https://arxiv.org/abs/2407.00079
- Splitwise: https://arxiv.org/abs/2311.18677
- LMCache PD bench (2025-04-29): https://blog.lmcache.ai/2025-04-29-pdbench/
- vLLM disaggregated prefilling: https://docs.vllm.ai/en/latest/features/disagg_prefill/
- Mitzenmacher, *The Power of Two Choices in Randomized Load Balancing*: https://www.eecs.harvard.edu/~michaelm/postscripts/tpds2001.pdf
- SGLang RadixAttention (dispatcher-level prefix tree origin): https://arxiv.org/abs/2312.07104
- Ketama (consistent-hash virtual-node convention): https://github.com/RJ/ketama
- Facebook mcrouter (MurmurHash64A source): https://github.com/facebook/mcrouter

Fork PRs bearing on this ticket:
- #1 Model-aware routing: https://github.com/PrimeIntellect-ai/router/pull/1
- #6 JWT auth + per-run usage metrics: https://github.com/PrimeIntellect-ai/router/pull/6
- #12 JWT model-scope enforcement: https://github.com/PrimeIntellect-ai/router/pull/12
- #15 LoRA adapter routing fix: https://github.com/PrimeIntellect-ai/router/pull/15
- #23 Phantom load accumulation fix: https://github.com/PrimeIntellect-ai/router/pull/23
- #24 vLLM 500-as-400 for circuit breaker: https://github.com/PrimeIntellect-ai/router/pull/24


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2 adapter: integrate vllm-project/router as dataplane for multi-replica and PD-disagg trials #3

L2 adapter: integrate `vllm-project/router` as the dataplane for multi-replica and PD-disagg trials

1. WHAT — the problem in one paragraph

2. WHY — what breaks without it

3. HOW — implementation plan

3.1 Schema (`src/autoinfer/layers/l2_topology/surface.py`)

3.2 Router lifecycle (`src/autoinfer/target/basilica.py`)

3.3 Driver (`src/autoinfer/harness/driver.py`)

3.4 Failure typing (`src/autoinfer/harness/failure.py`) — per P9

3.5 Ledger (`src/autoinfer/harness/ledger.py`) — per P4

3.6 Cross-layer stale-signal (`src/autoinfer/controller/stale.py`) — per P4

3.7 Fairness hazards the driver must defend against

4. Acceptance criteria

5. Decisions to make (please comment)

6. Out of scope here

7. Risk register

8. Prior art / references

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Risk	Likelihood	Impact	Mitigation
Router process OOMs mid-trial under `cache_aware` tree growth	Low	Trial lost	`max_tree_size=10000` (default); watchdog on router RSS; evict on boundary.
PD-disagg NIXL connector version skew between router and vLLM workers	Medium	Silent decode-side corruption	Pin vLLM and router versions together in the trial artefact; validate via quality gate (P8) — divergence would flag this.
Phantom-load bug re-emerges or new equivalent ships	Low (fork) / High (upstream)	Slow drift, invalid numbers	Pin fork version; regression test: synthetic 24h streaming run, assert no worker ends with `load() < 1`.
Driver hits `RouterPoolExhausted` → surrogate misattributes to engine config	High if untyped	Wasted iterations	Typed `FailureRecord` (§3.4); surrogate treats router-originated failures as non-informative on engine axes.
Fork carries auth code we don't use; CVE surface	Low	Ops overhead	Keep JWT/APIKey off via unset flags; track fork release security advisories.

L2 adapter: integrate vllm-project/router as dataplane for multi-replica and PD-disagg trials #3

Description

L2 adapter: integrate vllm-project/router as the dataplane for multi-replica and PD-disagg trials

1. WHAT — the problem in one paragraph

2. WHY — what breaks without it

3. HOW — implementation plan

3.1 Schema (src/autoinfer/layers/l2_topology/surface.py)

3.2 Router lifecycle (src/autoinfer/target/basilica.py)

3.3 Driver (src/autoinfer/harness/driver.py)

3.4 Failure typing (src/autoinfer/harness/failure.py) — per P9

3.5 Ledger (src/autoinfer/harness/ledger.py) — per P4

3.6 Cross-layer stale-signal (src/autoinfer/controller/stale.py) — per P4

3.7 Fairness hazards the driver must defend against

4. Acceptance criteria

5. Decisions to make (please comment)

6. Out of scope here

7. Risk register

8. Prior art / references

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

L2 adapter: integrate `vllm-project/router` as the dataplane for multi-replica and PD-disagg trials

3.1 Schema (`src/autoinfer/layers/l2_topology/surface.py`)

3.2 Router lifecycle (`src/autoinfer/target/basilica.py`)

3.3 Driver (`src/autoinfer/harness/driver.py`)

3.4 Failure typing (`src/autoinfer/harness/failure.py`) — per P9

3.5 Ledger (`src/autoinfer/harness/ledger.py`) — per P4

3.6 Cross-layer stale-signal (`src/autoinfer/controller/stale.py`) — per P4