You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
L2 adapter: integrate vllm-project/router as the dataplane for multi-replica and PD-disagg trials
Full source-of-truth reference for this ticket: docs/research/raw/08-vllm-router-dataplane.md (628 LoC, all technical detail, cross-references). This ticket is the implementation tracker; read the note for the WHY.
docs/research/references/00-hypothesis-seed.md §4.2 — Router policy row now present in the L2 axis table.
docs/research/raw/references-L1-engine-config.md §"Scope note — routing policy is not L1".
docs/research/raw/08-vllm-router-dataplane.md — full source note.
1. WHAT — the problem in one paragraph
src/autoinfer/target/basilica.py and src/autoinfer/harness/driver.py are stubs today. Their implicit shape is "one worker URL per trial, driver hits it directly." That shape silently produces wrong numbers the moment L2 trials go to (a) more than one replica, or (b) prefill/decode disaggregation. The missing component is the request dataplane: a router that dispatches requests to workers under a named policy. vLLM's official reference implementation is vllm-project/router (Rust + PyO3); PrimeIntellect maintains a fork with production fixes we likely want (PrimeIntellect-ai/router, +24 commits). This ticket tracks adding the router to autoinfer's L2 adapter so multi-replica and PD-disagg trials are measured through a realistic dispatch path.
2. WHY — what breaks without it
Concrete failure modes of the direct-to-worker status quo, ordered by severity:
PD-disaggregation is literally unrunnable without a dispatcher. The NIXL / NCCL connector protocol requires a component that wraps the request with bootstrap metadata (bootstrap_host, bootstrap_port, bootstrap_room), pairs a prefill leg with a decode leg, and merges their outputs. src/routers/http/vllm_pd_router.rsis that component. Any C4 evidence we produce without it is incomparable to DistServe / Mooncake headline numbers (7.4× and 59–498%, both assume router-mediated paths).
Prefix-cache-hit rate becomes arrival-order dependent, not policy dependent. Without cache_aware dispatching, two workers with identical engine configs will see different prompt-prefix distributions determined by the driver's concurrency pattern, not by any defensible routing choice. The resulting Pareto frontier is noise.
Policy-independent claims about L1. Any "this engine config beats vLLM defaults on tokens/s at 8 replicas" statement is under-specified without stating the router policy — switching from round_robin to cache_aware on the same engine config can swamp single-flag L1 wins (see LMCache PD bench 2025-04-29 for magnitude).
Circuit-breaker state leaks across trials. Workers that trip open in trial K start trial K+1 in Open state; policies filter them out via get_healthy_worker_indices. Without explicit router lifecycle, trial K+1 runs against a quietly reduced pool.
Phantom load accumulation. Upstream has a known bug in cache_aware load tracking (double-decrement on retry) that manifests only after hours of streaming traffic — workers silently get locked out with negative load. Fixed in PrimeIntellect fork PR docs(c03): RETRACT Q1 NEGATIVE — kernel-level audit shows 2-9× rmsnorm speedup #23; unmerged upstream at snapshot time (2026-04-23). Running benchmarks on upstream will produce drift we cannot explain.
Add a RouterConfig sub-model (Pydantic; typed per P11):
classRouterConfig(BaseModel):
policy: Literal["cache_aware", "power_of_two", "consistent_hash", "round_robin", "random"] ="round_robin"intra_node_data_parallel_size: int=Field(default=1, ge=1, le=8)
pd_disagg: bool=Falseconnector: Literal["nixl", "nccl"] |None=None# non-None only if pd_disaggprefill_policy: Literal[...] |None=None# non-None only if pd_disaggdecode_policy: Literal[...] |None=None# non-None only if pd_disagg# Held constant across trials (not in the search space):cb_failure_threshold: int=5cb_timeout_duration_secs: int=30retry_max_retries: int=3
Assertions (fail-fast, per CLAUDE.md "assert early, fail/error/return-fast"):
pd_disagg ⇒ connector is not None and prefill_policy is not None and decode_policy is not None.
not pd_disagg ⇒ connector is None and prefill_policy is None and decode_policy is None.
intra_node_data_parallel_size must match the engine config's --data-parallel-size.
Provision N workers on Basilica (existing pattern, already stubbed).
Launch vllm-router in front with CLI flags derived from RouterConfig. Use a free port via socket.bind(('', 0)); sock.getsockname()[1]. Example (regular mode):
Return $ROUTER_URL to the driver; never return per-worker URLs.
On teardown(): SIGTERM the router, wait, SIGKILL on timeout.
Binary provenance: pin to PrimeIntellect fork until phantom-load (#23) and circuit-breaker (#24) fixes land upstream. Turn off JWT (--jwt-public-key-path unset) and API-key auth (--api-key-validation-urls unset) — trials are internal.
3.3 Driver (src/autoinfer/harness/driver.py)
vllm bench serve already takes a single URL; cosmetic change is to point it at $ROUTER_URL. The non-cosmetic change:
Inject X-Session-ID: autoinfer-{trial_id}-{request_index} on every request so consistent_hash is reproducible. The router's hash-key priority is X-Session-ID > X-User-ID > Authorization > client IP > body-hash; without an explicit session header, the effective key is client IP (all requests go to one worker under the default bench harness).
Make the prompt stream deterministic per trial. Seed the request-generator from trial_id so policy A/B is comparable on byte-identical inputs.
Pass router Prometheus scrape URL back to the ledger writer so per-policy decision counts are persisted.
3.4 Failure typing (src/autoinfer/harness/failure.py) — per P9
RouterPoolExhausted(retries: int, pool_size: int, open_circuits: int) — aggregate saturation, not per-worker failure. Critical: if we leave this untyped, the surrogate attributes it to the engine config and wastes iterations tuning irrelevant knobs.
RouterPolicyInvariant(counter: str) — e.g. negative load counter observed (sentinel that the phantom-load bug re-emerged; should never fire on fork).
3.5 Ledger (src/autoinfer/harness/ledger.py) — per P4
Every trial row where replicas > 1 or pd_disagg persists the full RouterConfig alongside the engine args. Without this, comparing two rows with the same engine config but different router policies silently mis-attributes throughput deltas.
Schema addition:
classLedgerRow(BaseModel):
...
router_config: RouterConfig|None=None# None only when replicas == 1 and not pd_disagg
3.6 Cross-layer stale-signal (src/autoinfer/controller/stale.py) — per P4
When any axis of RouterConfig changes for a given engine config, Ledger.mark_stale() on prior rows with that engine config. Reason: the effective workload each worker sees depends on the policy; the old rows' throughput and tail numbers are no longer comparable.
3.7 Fairness hazards the driver must defend against
Each one is a real trap confirmed from source:
Warm vs cold cache_aware tree — router restart gives cold tree; without restart the tree carries prior-trial placements. Recommend: restart router per trial, warm with N = 100 priming requests outside the measured window, declare the regime in the ledger.
Round-robin AtomicUsize carryover — the counter is process-lifetime. reset() exists on the trait but no control endpoint exposes it. Restart is the safer lever.
Circuit-breaker carryover — any worker that tripped in trial K starts trial K+1 in Open for cb_timeout_duration_secs. Either wait, or restart.
power_of_two load signal — falls back to Worker::load() (local in-flight counter) if nothing pushes into cached_loads. Different from vLLM's actual queue depth. Declare which signal is in effect.
PD connector vs fabric — NIXL handoff latency is fabric-dominated; headline DistServe/Mooncake numbers assume NVLink. On Basilica TCP paths, PD-disagg may be worse than collocated.
4. Acceptance criteria
Trial-level:
A trial with replicas=2, router.policy="cache_aware" and a trial with the same topology + router.policy="round_robin" on the same driver trace produce measurably different TTFT P50/P99 in the ledger. (If not, either the policy is inactive or the driver is not session-headered.)
A PD-disagg trial (prefill=2, decode=6, connector="nixl") runs end-to-end through the router and returns a quality-gated row against the FP16 reference replica (P8).
Policy-change events emit a mark_stale() call on prior ledger rows with the same engine config (P4).
Negative / no-regression:
Iteration-zero L1 single-replica trials (§8 of 00-hypothesis-seed.md) are unaffected — no router is launched, no new dependency surfaces.
RouterPoolExhausted is emitted (not silently mis-classified as a worker failure) when the pool is saturated.
Observability:
Router Prometheus scrape URL is persisted in the ledger; vllm_router_run_requests_total and per-policy decision counts are queryable post-trial.
Per-trial router restart vs long-lived router — restart is safer for A/B fairness but adds ~1s per trial (binary startup, pod health wait). Acceptable given wall-clock ~minutes per trial.
Router-policy parameter tuning — cache_threshold, balance_abs_threshold, balance_rel_threshold, eviction_interval_secs, max_tree_size are cache_aware internals. Treat as constants (recommended) or expose as sub-axes? Recommend constants for iteration-one; add if evidence emerges that they matter.
intra-node-data-parallel-size — search axis or derived from engine config's --data-parallel-size? The two must match; derive.
Tokenizer-side-effect dedup — router bundles tokenizers, tiktoken-rs, minijinja, hf-hub for server-side chat-template rendering; we should not rely on that path in trials (determinism hazard) but also not disable it (requires a source edit).
7. Risk register
Risk
Likelihood
Impact
Mitigation
Router process OOMs mid-trial under cache_aware tree growth
Low
Trial lost
max_tree_size=10000 (default); watchdog on router RSS; evict on boundary.
PD-disagg NIXL connector version skew between router and vLLM workers
Medium
Silent decode-side corruption
Pin vLLM and router versions together in the trial artefact; validate via quality gate (P8) — divergence would flag this.
Phantom-load bug re-emerges or new equivalent ships
Low (fork) / High (upstream)
Slow drift, invalid numbers
Pin fork version; regression test: synthetic 24h streaming run, assert no worker ends with load() < 1.
Driver hits RouterPoolExhausted → surrogate misattributes to engine config
High if untyped
Wasted iterations
Typed FailureRecord (§3.4); surrogate treats router-originated failures as non-informative on engine axes.
Fork carries auth code we don't use; CVE surface
Low
Ops overhead
Keep JWT/APIKey off via unset flags; track fork release security advisories.
8. Prior art / references
Internal:
docs/research/raw/08-vllm-router-dataplane.md — full technical note (structure, code internals, PR-by-PR diff).
L2 adapter: integrate
vllm-project/routeras the dataplane for multi-replica and PD-disagg trialsThesis anchors. C1 (engine surface has slack), C4 (PD-disagg Pareto-dominates), C6 (hybrid policy required), C9 (reference replica required). P1 (three layers), P3 (layers are adapters), P4 (cross-layer stale-signal), P8 (reference replica), P9 (typed failure), P10 (frozen/mutable boundary).
Cross-refs.
docs/research/references/00-hypothesis-seed.md§4.2 —Router policyrow now present in the L2 axis table.docs/research/raw/references-L1-engine-config.md§"Scope note — routing policy is not L1".docs/research/raw/08-vllm-router-dataplane.md— full source note.1. WHAT — the problem in one paragraph
src/autoinfer/target/basilica.pyandsrc/autoinfer/harness/driver.pyare stubs today. Their implicit shape is "one worker URL per trial, driver hits it directly." That shape silently produces wrong numbers the moment L2 trials go to (a) more than one replica, or (b) prefill/decode disaggregation. The missing component is the request dataplane: a router that dispatches requests to workers under a named policy. vLLM's official reference implementation isvllm-project/router(Rust + PyO3); PrimeIntellect maintains a fork with production fixes we likely want (PrimeIntellect-ai/router, +24 commits). This ticket tracks adding the router to autoinfer's L2 adapter so multi-replica and PD-disagg trials are measured through a realistic dispatch path.2. WHY — what breaks without it
Concrete failure modes of the direct-to-worker status quo, ordered by severity:
bootstrap_host,bootstrap_port,bootstrap_room), pairs a prefill leg with a decode leg, and merges their outputs.src/routers/http/vllm_pd_router.rsis that component. Any C4 evidence we produce without it is incomparable to DistServe / Mooncake headline numbers (7.4× and 59–498%, both assume router-mediated paths).cache_awaredispatching, two workers with identical engine configs will see different prompt-prefix distributions determined by the driver's concurrency pattern, not by any defensible routing choice. The resulting Pareto frontier is noise.round_robintocache_awareon the same engine config can swamp single-flag L1 wins (see LMCache PD bench 2025-04-29 for magnitude).Openstate; policies filter them out viaget_healthy_worker_indices. Without explicit router lifecycle, trial K+1 runs against a quietly reduced pool.cache_awareload tracking (double-decrement on retry) that manifests only after hours of streaming traffic — workers silently get locked out with negative load. Fixed in PrimeIntellect fork PR docs(c03): RETRACT Q1 NEGATIVE — kernel-level audit shows 2-9× rmsnorm speedup #23; unmerged upstream at snapshot time (2026-04-23). Running benchmarks on upstream will produce drift we cannot explain.3. HOW — implementation plan
3.1 Schema (
src/autoinfer/layers/l2_topology/surface.py)Add a
RouterConfigsub-model (Pydantic; typed per P11):Assertions (fail-fast, per CLAUDE.md "assert early, fail/error/return-fast"):
pd_disagg⇒connector is not None and prefill_policy is not None and decode_policy is not None.not pd_disagg⇒connector is None and prefill_policy is None and decode_policy is None.intra_node_data_parallel_sizemust match the engine config's--data-parallel-size.3.2 Router lifecycle (
src/autoinfer/target/basilica.py)Deterministic per-trial process:
vllm-routerin front with CLI flags derived fromRouterConfig. Use a free port viasocket.bind(('', 0)); sock.getsockname()[1]. Example (regular mode):GET $ROUTER_URL/liveness(auth-exempt per fork PR docs(c03): mark RUNNING + hardware fallback to A100 (Q3 deferred) #10).$ROUTER_URLto the driver; never return per-worker URLs.teardown(): SIGTERM the router, wait, SIGKILL on timeout.Binary provenance: pin to PrimeIntellect fork until phantom-load (#23) and circuit-breaker (#24) fixes land upstream. Turn off JWT (
--jwt-public-key-pathunset) and API-key auth (--api-key-validation-urlsunset) — trials are internal.3.3 Driver (
src/autoinfer/harness/driver.py)vllm bench servealready takes a single URL; cosmetic change is to point it at$ROUTER_URL. The non-cosmetic change:X-Session-ID: autoinfer-{trial_id}-{request_index}on every request soconsistent_hashis reproducible. The router's hash-key priority isX-Session-ID > X-User-ID > Authorization > client IP > body-hash; without an explicit session header, the effective key is client IP (all requests go to one worker under the default bench harness).trial_idso policy A/B is comparable on byte-identical inputs.3.4 Failure typing (
src/autoinfer/harness/failure.py) — per P9New typed failure records:
RouterStartupFailed(reason: Literal["binary_missing", "port_in_use", "pem_invalid", "flag_invalid"]).RouterPoolExhausted(retries: int, pool_size: int, open_circuits: int)— aggregate saturation, not per-worker failure. Critical: if we leave this untyped, the surrogate attributes it to the engine config and wastes iterations tuning irrelevant knobs.PDConnectorTimeout(leg: Literal["prefill", "decode"], timeout_secs: int, bootstrap_room: int)— NIXL/NCCL handoff timeout.RouterPolicyInvariant(counter: str)— e.g. negative load counter observed (sentinel that the phantom-load bug re-emerged; should never fire on fork).3.5 Ledger (
src/autoinfer/harness/ledger.py) — per P4Every trial row where
replicas > 1orpd_disaggpersists the fullRouterConfigalongside the engine args. Without this, comparing two rows with the same engine config but different router policies silently mis-attributes throughput deltas.Schema addition:
3.6 Cross-layer stale-signal (
src/autoinfer/controller/stale.py) — per P4When any axis of
RouterConfigchanges for a given engine config,Ledger.mark_stale()on prior rows with that engine config. Reason: the effective workload each worker sees depends on the policy; the old rows' throughput and tail numbers are no longer comparable.3.7 Fairness hazards the driver must defend against
Each one is a real trap confirmed from source:
cache_awaretree — router restart gives cold tree; without restart the tree carries prior-trial placements. Recommend: restart router per trial, warm withN = 100priming requests outside the measured window, declare the regime in the ledger.AtomicUsizecarryover — the counter is process-lifetime.reset()exists on the trait but no control endpoint exposes it. Restart is the safer lever.Openforcb_timeout_duration_secs. Either wait, or restart.power_of_twoload signal — falls back toWorker::load()(local in-flight counter) if nothing pushes intocached_loads. Different from vLLM's actual queue depth. Declare which signal is in effect.4. Acceptance criteria
Trial-level:
replicas=2, router.policy="cache_aware"and a trial with the same topology +router.policy="round_robin"on the same driver trace produce measurably different TTFT P50/P99 in the ledger. (If not, either the policy is inactive or the driver is not session-headered.)prefill=2, decode=6, connector="nixl") runs end-to-end through the router and returns a quality-gated row against the FP16 reference replica (P8).mark_stale()call on prior ledger rows with the same engine config (P4).Negative / no-regression:
00-hypothesis-seed.md) are unaffected — no router is launched, no new dependency surfaces.RouterPoolExhaustedis emitted (not silently mis-classified as a worker failure) when the pool is saturated.Observability:
vllm_router_run_requests_totaland per-policy decision counts are queryable post-trial.Worker::load() < 0) never fires under streaming workloads for >24h (regression guard for docs(c03): RETRACT Q1 NEGATIVE — kernel-level audit shows 2-9× rmsnorm speedup #23).5. Decisions to make (please comment)
cache_threshold,balance_abs_threshold,balance_rel_threshold,eviction_interval_secs,max_tree_sizearecache_awareinternals. Treat as constants (recommended) or expose as sub-axes? Recommend constants for iteration-one; add if evidence emerges that they matter.intra-node-data-parallel-size— search axis or derived from engine config's--data-parallel-size? The two must match; derive.6. Out of scope here
Cargo.toml,src/routers/grpc/) — HTTP is the reference path; gRPC is not on the critical path.llm-dintegration — heavier dependency, separate ticket if we need the resilience-operator surface.tokenizers,tiktoken-rs,minijinja,hf-hubfor server-side chat-template rendering; we should not rely on that path in trials (determinism hazard) but also not disable it (requires a source edit).7. Risk register
cache_awaretree growthmax_tree_size=10000(default); watchdog on router RSS; evict on boundary.load() < 1.RouterPoolExhausted→ surrogate misattributes to engine configFailureRecord(§3.4); surrogate treats router-originated failures as non-informative on engine axes.8. Prior art / references
Internal:
docs/research/raw/08-vllm-router-dataplane.md— full technical note (structure, code internals, PR-by-PR diff).docs/research/references/00-hypothesis-seed.md§4.2 — L2 axis table.docs/research/raw/references-L1-engine-config.md§"Scope note — routing policy is not L1".docs/research/raw/06-cloudflare-omni-gpu-multiplexing.md— orthogonal "one GPU, many models" axis (reminder that tenancy is L2-adjacent).docs/research/raw/07-vllm-v1-architecture.md— vLLM V1 scheduler is downstream of the router.External (all confirmed accessible 2026-04-23):
Fork PRs bearing on this ticket: