Skip to content

L2 adapter: integrate vllm-project/router as dataplane for multi-replica and PD-disagg trials #3

Description

@epappas

L2 adapter: integrate vllm-project/router as the dataplane for multi-replica and PD-disagg trials

Full source-of-truth reference for this ticket: docs/research/raw/08-vllm-router-dataplane.md (628 LoC, all technical detail, cross-references). This ticket is the implementation tracker; read the note for the WHY.

Thesis anchors. C1 (engine surface has slack), C4 (PD-disagg Pareto-dominates), C6 (hybrid policy required), C9 (reference replica required). P1 (three layers), P3 (layers are adapters), P4 (cross-layer stale-signal), P8 (reference replica), P9 (typed failure), P10 (frozen/mutable boundary).

Cross-refs.

  • docs/research/references/00-hypothesis-seed.md §4.2 — Router policy row now present in the L2 axis table.
  • docs/research/raw/references-L1-engine-config.md §"Scope note — routing policy is not L1".
  • docs/research/raw/08-vllm-router-dataplane.md — full source note.

1. WHAT — the problem in one paragraph

src/autoinfer/target/basilica.py and src/autoinfer/harness/driver.py are stubs today. Their implicit shape is "one worker URL per trial, driver hits it directly." That shape silently produces wrong numbers the moment L2 trials go to (a) more than one replica, or (b) prefill/decode disaggregation. The missing component is the request dataplane: a router that dispatches requests to workers under a named policy. vLLM's official reference implementation is vllm-project/router (Rust + PyO3); PrimeIntellect maintains a fork with production fixes we likely want (PrimeIntellect-ai/router, +24 commits). This ticket tracks adding the router to autoinfer's L2 adapter so multi-replica and PD-disagg trials are measured through a realistic dispatch path.

2. WHY — what breaks without it

Concrete failure modes of the direct-to-worker status quo, ordered by severity:

  1. PD-disaggregation is literally unrunnable without a dispatcher. The NIXL / NCCL connector protocol requires a component that wraps the request with bootstrap metadata (bootstrap_host, bootstrap_port, bootstrap_room), pairs a prefill leg with a decode leg, and merges their outputs. src/routers/http/vllm_pd_router.rs is that component. Any C4 evidence we produce without it is incomparable to DistServe / Mooncake headline numbers (7.4× and 59–498%, both assume router-mediated paths).
  2. Prefix-cache-hit rate becomes arrival-order dependent, not policy dependent. Without cache_aware dispatching, two workers with identical engine configs will see different prompt-prefix distributions determined by the driver's concurrency pattern, not by any defensible routing choice. The resulting Pareto frontier is noise.
  3. Policy-independent claims about L1. Any "this engine config beats vLLM defaults on tokens/s at 8 replicas" statement is under-specified without stating the router policy — switching from round_robin to cache_aware on the same engine config can swamp single-flag L1 wins (see LMCache PD bench 2025-04-29 for magnitude).
  4. Circuit-breaker state leaks across trials. Workers that trip open in trial K start trial K+1 in Open state; policies filter them out via get_healthy_worker_indices. Without explicit router lifecycle, trial K+1 runs against a quietly reduced pool.
  5. Phantom load accumulation. Upstream has a known bug in cache_aware load tracking (double-decrement on retry) that manifests only after hours of streaming traffic — workers silently get locked out with negative load. Fixed in PrimeIntellect fork PR docs(c03): RETRACT Q1 NEGATIVE — kernel-level audit shows 2-9× rmsnorm speedup #23; unmerged upstream at snapshot time (2026-04-23). Running benchmarks on upstream will produce drift we cannot explain.

3. HOW — implementation plan

3.1 Schema (src/autoinfer/layers/l2_topology/surface.py)

Add a RouterConfig sub-model (Pydantic; typed per P11):

class RouterConfig(BaseModel):
    policy: Literal["cache_aware", "power_of_two", "consistent_hash", "round_robin", "random"] = "round_robin"
    intra_node_data_parallel_size: int = Field(default=1, ge=1, le=8)
    pd_disagg: bool = False
    connector: Literal["nixl", "nccl"] | None = None          # non-None only if pd_disagg
    prefill_policy: Literal[...] | None = None                # non-None only if pd_disagg
    decode_policy: Literal[...] | None = None                 # non-None only if pd_disagg
    # Held constant across trials (not in the search space):
    cb_failure_threshold: int = 5
    cb_timeout_duration_secs: int = 30
    retry_max_retries: int = 3

Assertions (fail-fast, per CLAUDE.md "assert early, fail/error/return-fast"):

  • pd_disaggconnector is not None and prefill_policy is not None and decode_policy is not None.
  • not pd_disaggconnector is None and prefill_policy is None and decode_policy is None.
  • intra_node_data_parallel_size must match the engine config's --data-parallel-size.

3.2 Router lifecycle (src/autoinfer/target/basilica.py)

Deterministic per-trial process:

  1. Provision N workers on Basilica (existing pattern, already stubbed).
  2. Launch vllm-router in front with CLI flags derived from RouterConfig. Use a free port via socket.bind(('', 0)); sock.getsockname()[1]. Example (regular mode):
    vllm-router \
      --worker-urls $URL1 $URL2 ... \
      --policy cache_aware \
      --intra-node-data-parallel-size 1 \
      --host 127.0.0.1 --port $ROUTER_PORT \
      --prometheus-host 127.0.0.1 --prometheus-port $PROM_PORT
    PD-disagg:
    vllm-router \
      --vllm-pd-disaggregation \
      --prefill $P1 $P2 --decode $D1 $D2 $D3 $D4 \
      --prefill-policy consistent_hash --decode-policy round_robin \
      --host 127.0.0.1 --port $ROUTER_PORT
  3. Health-probe GET $ROUTER_URL/liveness (auth-exempt per fork PR docs(c03): mark RUNNING + hardware fallback to A100 (Q3 deferred) #10).
  4. Return $ROUTER_URL to the driver; never return per-worker URLs.
  5. On teardown(): SIGTERM the router, wait, SIGKILL on timeout.

Binary provenance: pin to PrimeIntellect fork until phantom-load (#23) and circuit-breaker (#24) fixes land upstream. Turn off JWT (--jwt-public-key-path unset) and API-key auth (--api-key-validation-urls unset) — trials are internal.

3.3 Driver (src/autoinfer/harness/driver.py)

vllm bench serve already takes a single URL; cosmetic change is to point it at $ROUTER_URL. The non-cosmetic change:

  • Inject X-Session-ID: autoinfer-{trial_id}-{request_index} on every request so consistent_hash is reproducible. The router's hash-key priority is X-Session-ID > X-User-ID > Authorization > client IP > body-hash; without an explicit session header, the effective key is client IP (all requests go to one worker under the default bench harness).
  • Make the prompt stream deterministic per trial. Seed the request-generator from trial_id so policy A/B is comparable on byte-identical inputs.
  • Pass router Prometheus scrape URL back to the ledger writer so per-policy decision counts are persisted.

3.4 Failure typing (src/autoinfer/harness/failure.py) — per P9

New typed failure records:

  • RouterStartupFailed(reason: Literal["binary_missing", "port_in_use", "pem_invalid", "flag_invalid"]).
  • RouterPoolExhausted(retries: int, pool_size: int, open_circuits: int) — aggregate saturation, not per-worker failure. Critical: if we leave this untyped, the surrogate attributes it to the engine config and wastes iterations tuning irrelevant knobs.
  • PDConnectorTimeout(leg: Literal["prefill", "decode"], timeout_secs: int, bootstrap_room: int) — NIXL/NCCL handoff timeout.
  • RouterPolicyInvariant(counter: str) — e.g. negative load counter observed (sentinel that the phantom-load bug re-emerged; should never fire on fork).

3.5 Ledger (src/autoinfer/harness/ledger.py) — per P4

Every trial row where replicas > 1 or pd_disagg persists the full RouterConfig alongside the engine args. Without this, comparing two rows with the same engine config but different router policies silently mis-attributes throughput deltas.

Schema addition:

class LedgerRow(BaseModel):
    ...
    router_config: RouterConfig | None = None   # None only when replicas == 1 and not pd_disagg

3.6 Cross-layer stale-signal (src/autoinfer/controller/stale.py) — per P4

When any axis of RouterConfig changes for a given engine config, Ledger.mark_stale() on prior rows with that engine config. Reason: the effective workload each worker sees depends on the policy; the old rows' throughput and tail numbers are no longer comparable.

3.7 Fairness hazards the driver must defend against

Each one is a real trap confirmed from source:

  1. Warm vs cold cache_aware tree — router restart gives cold tree; without restart the tree carries prior-trial placements. Recommend: restart router per trial, warm with N = 100 priming requests outside the measured window, declare the regime in the ledger.
  2. Round-robin AtomicUsize carryover — the counter is process-lifetime. reset() exists on the trait but no control endpoint exposes it. Restart is the safer lever.
  3. Circuit-breaker carryover — any worker that tripped in trial K starts trial K+1 in Open for cb_timeout_duration_secs. Either wait, or restart.
  4. power_of_two load signal — falls back to Worker::load() (local in-flight counter) if nothing pushes into cached_loads. Different from vLLM's actual queue depth. Declare which signal is in effect.
  5. PD connector vs fabric — NIXL handoff latency is fabric-dominated; headline DistServe/Mooncake numbers assume NVLink. On Basilica TCP paths, PD-disagg may be worse than collocated.

4. Acceptance criteria

Trial-level:

  • A trial with replicas=2, router.policy="cache_aware" and a trial with the same topology + router.policy="round_robin" on the same driver trace produce measurably different TTFT P50/P99 in the ledger. (If not, either the policy is inactive or the driver is not session-headered.)
  • A PD-disagg trial (prefill=2, decode=6, connector="nixl") runs end-to-end through the router and returns a quality-gated row against the FP16 reference replica (P8).
  • Policy-change events emit a mark_stale() call on prior ledger rows with the same engine config (P4).

Negative / no-regression:

  • Iteration-zero L1 single-replica trials (§8 of 00-hypothesis-seed.md) are unaffected — no router is launched, no new dependency surfaces.
  • RouterPoolExhausted is emitted (not silently mis-classified as a worker failure) when the pool is saturated.

Observability:

5. Decisions to make (please comment)

  1. Binary provenance — pin to PrimeIntellect fork (has docs(c03): RETRACT Q1 NEGATIVE — kernel-level audit shows 2-9× rmsnorm speedup #23, docs(c03): v3 three-way audit — Sonnet's Triton loses to vLLM-native CUDA at every shape #24, LoRA fixes) or wait for upstream merges? Fork carries JWT code we don't use but can disable. Recommend fork.
  2. Per-trial router restart vs long-lived router — restart is safer for A/B fairness but adds ~1s per trial (binary startup, pod health wait). Acceptable given wall-clock ~minutes per trial.
  3. Router-policy parameter tuningcache_threshold, balance_abs_threshold, balance_rel_threshold, eviction_interval_secs, max_tree_size are cache_aware internals. Treat as constants (recommended) or expose as sub-axes? Recommend constants for iteration-one; add if evidence emerges that they matter.
  4. intra-node-data-parallel-size — search axis or derived from engine config's --data-parallel-size? The two must match; derive.

6. Out of scope here

7. Risk register

Risk Likelihood Impact Mitigation
Router process OOMs mid-trial under cache_aware tree growth Low Trial lost max_tree_size=10000 (default); watchdog on router RSS; evict on boundary.
PD-disagg NIXL connector version skew between router and vLLM workers Medium Silent decode-side corruption Pin vLLM and router versions together in the trial artefact; validate via quality gate (P8) — divergence would flag this.
Phantom-load bug re-emerges or new equivalent ships Low (fork) / High (upstream) Slow drift, invalid numbers Pin fork version; regression test: synthetic 24h streaming run, assert no worker ends with load() < 1.
Driver hits RouterPoolExhausted → surrogate misattributes to engine config High if untyped Wasted iterations Typed FailureRecord (§3.4); surrogate treats router-originated failures as non-informative on engine axes.
Fork carries auth code we don't use; CVE surface Low Ops overhead Keep JWT/APIKey off via unset flags; track fork release security advisories.

8. Prior art / references

Internal:

  • docs/research/raw/08-vllm-router-dataplane.md — full technical note (structure, code internals, PR-by-PR diff).
  • docs/research/references/00-hypothesis-seed.md §4.2 — L2 axis table.
  • docs/research/raw/references-L1-engine-config.md §"Scope note — routing policy is not L1".
  • docs/research/raw/06-cloudflare-omni-gpu-multiplexing.md — orthogonal "one GPU, many models" axis (reminder that tenancy is L2-adjacent).
  • docs/research/raw/07-vllm-v1-architecture.md — vLLM V1 scheduler is downstream of the router.

External (all confirmed accessible 2026-04-23):

Fork PRs bearing on this ticket:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions