Skip to content

Load-test findings: requests shed at per-account admission guards before routing while fleet sits idle (+ KV/size-aware routing gaps) #342

Description

@devin-ai-integration

Summary

Over a multi-day, sustained dual-model load test against prod (api.darkbloom.dev, models gpt-oss-20b and gemma-4-26b), the router itself never shed a single request and output quality never degraded. Every rejection under load came from two per-account admission guards that run before provider selection, while the fleet had idle, capable providers, an empty queue, and 100% token budget free. This is the "machines are available but work isn't being done" symptom: the request is rejected before it ever reaches routing.

This issue documents (1) the two pre-router shedding mechanisms, (2) the KV/size-aware admission gaps that concentrate long requests onto too few warm slots, and (3) a memory-gate mismatch — all code-grounded with file:line. It also lists what is healthy, and explicitly separates out test-harness artifacts that are not product bugs.

Methodology: external OpenAI-compatible API client (key + base URL only). Capacity/warm-cold sampled from the public GET /v1/models/capacity every 3s; each error correlated with the fleet snapshot at the instant it fired. No coordinator code was changed to produce these findings.


What's healthy (baseline — so this is balanced)

  • Routing never shed: zero dispatch_503 (provider_error, "inference failed after N attempts") and zero capacity_429 ("all providers at capacity") across every burst and the entire sustained run.
  • Median routing is excellent: TTFB (first streamed token = routing + prefill) p50 ≈ 2s on both models throughout, even under continuous load.
  • Decode throughput is flat: ~60–74 tok/s/stream (gpt-oss), ~40–60 tok/s/stream (gemma) even on multi-minute 32k generations; cluster aggregate scaled to ~1,090–1,459 tok/s under burst.
  • No quality variance from routing: benchmark scores were stable across ~70 runs each (e.g. MMStar 0.44–0.46, gpt-oss MMLU 0.90–0.93). See Appendix B.

Issue 1 (highest priority for the OpenRouter listing): per-account billing-debit returns 503 under concurrency

Symptom. Under concurrency from a single API key, a fraction of requests get 503 {"code":"service_unavailable"} ("service temporarily unavailable"), while capacity shows idle cold providers, empty queue, 100% budget. It scales with request rate, not fleet load, and appears even at low concurrency (2–3 per 15-min stage at conc 3–4).

Root cause (pre-router). The pre-flight balance reservation is the only producer of that body:

  • coordinator/api/consumer.go:1452-1461s.ledger.Charge(consumerKey, reservedMicroUSD, ...); on a non-balance DB error it calls s.writeServiceUnavailable(w, model) (consumer.go:939). (Same pattern repeats at consumer.go:3937-3943 and :4195.)
  • s.ledger.Chargestore.Debit is a single-row UPDATE balances ... WHERE account_id=$1 under a short timeout — coordinator/store/postgres.go:1781 (Debit), SQL at :1792.

All traffic from one key → one account_id row → row-lock serialization + connection-pool pressure under concurrency → some debits exceed the timeout → DB error → 503. This runs before routing, so idle providers never get a chance.

Why this is the #1 risk for OpenRouter. OpenRouter funnels all its traffic through a single service account → a single balance row, so it is maximally exposed to this contention regardless of how big the provider fleet is.

Evidence. Burst B (max_tokens=512, 80 concurrent): 52/272 = 503, 0× 429; at every 503 the fleet had 33–44 cold idle, routable 60–65, queue 0/10, budget 100%, and warm rose 10→32 (router was actively warming). Burst matrix (Appendix A): billing_503 count tracks concurrency at short max_tokens (512: 0→2→3→38 across conc 8/16/32/64) and vanishes at 4k/32k where each worker fires ~1 long request.

Suggested fixes (pick one+): retry-on-serialization-failure inside Debit with small backoff; shard/partition the balance row (or use an append-only ledger + periodic settle instead of a hot single-row UPDATE); raise the Debit statement timeout and size the pool for expected per-account concurrency; treat reservation as best-effort/async for trusted service-tier accounts.

Note: CLAUDE.md → Common Pitfalls still says "Postgres store exists but is not used in production yet." That is stale — the infra table lists Prod DB = AWS RDS PostgreSQL, and this 503 (only producible by the Postgres Debit path, never by the in-memory store) was observed live in prod. Worth correcting the doc.


Issue 2: output-token rate limit charges the full bounded max_tokens upfront

Symptom. At high max_tokens the request 429s with output_tokens rate limit exceeded while the fleet is idle. Reasoning models that request a large budget but rarely consume it are penalized.

Root cause (pre-router). coordinator/api/server.go:376-381 (applyTokenRateLimit) admits "using the upfront input estimate and the bounded max_tokens" — i.e. it charges the full worst-case output budget against the per-account OTPM bucket before the request runs. Buckets (coordinator/ratelimit/config.go): consumer OTPM 500_000/burst 64_000 (:61-62); service OTPM 5_000_000/burst 512_000 (:67-68). At 32k max_tokens the consumer burst holds <2 requests; on the service tier the wall appears right at ~524k aggregate (conc16 × 32k).

Evidence. 32k×conc8 (~262k tokens) is clean; OTPM-429 only appears at 32k×conc16 (~524k ≈ the 512k service burst). The 8× headroom vs an earlier phase confirms the test key is now on the service tier (good for OpenRouter) — but the upfront full-reservation model still over-charges.

Suggested fixes: charge an expected output length (e.g. EWMA of recent completions, or a fraction of max_tokens) and reconcile actual usage afterward, instead of reserving full worst-case max_tokens; this is the same over-reservation pattern as Issue 3.


Issue 3: coordinator discards the provider's real KV size and reserves worst-case max_tokens (the size/KV-aware routing gap)

What the provider already gives us. Providers compute architecture-exact per-token KV cost and report it every heartbeat: coordinator/protocol/messages.go:208KVBytesPerToken int64 \json:"kv_bytes_per_token,omitempty"`` (comment: "provider-side only").

What the coordinator does with it. Nothing. snapshotProviderLocked (coordinator/registry/scheduler.go:661) never captures KVBytesPerToken. The per-request admission freeMemoryAdmits (scheduler.go:723-734) reserves requestTokens = reqPromptTokens + reqMaxTokens (full worst-case max_tokens) against activeTokenBudgetMax; the legacy fallback path uses a flat constant kvCacheBytesPerToken = 400_000 (scheduler.go:39-49, used at :752) regardless of model architecture.

Consequence. Long-budget requests over-reserve memory, so each warm provider admits fewer concurrent requests than it could actually serve → load concentrates onto a few slots → the TTFT/TTFB tail and the gemma stall in Issue 5. This is exactly the "route by request size + how much KV cache it needs, not min-2×" behavior we want.

Suggested fixes: (A) reserve expected output length, not full max_tokens; (B) consume the provider-reported kv_bytes_per_token (capture it in snapshotProviderLocked, use it in freeMemoryAdmits) instead of the flat 400k constant.


Issue 4: coordinator admission memory gate is looser than the provider's load gate

Root cause. The weights-fit fallback modelFitsHardware (coordinator/registry/scheduler.go:142-150) admits when modelSizeGB * modelMemoryHeadroomFactor <= totalMemoryGB, with modelMemoryHeadroomFactor = 2.0 (scheduler.go:126-133). But the provider's ensureModelLoaded requires estimatedMemoryGb * 3.0 headroom (documented in CLAUDE.md → State Model). So the coordinator can route to a cold node that then fails to load the model and 503s "insufficient memory" → cooldown → reroute.

Status. This path did not fire in the bursts (the Issue 1/2 admission guards shed first), so it's a structural latent risk rather than an observed failure. Worth a targeted test (cold-fleet, large model, force a load) before relying on the ×2 gate.

Suggested fix: align the coordinator fallback factor with the provider's (≈×3, or better, weights + real KV from Issue 3) so the coordinator never admits a node the provider will reject.


Issue 5 (symptom of Issue 3): gemma-4-26b long-request stall on a small running-slot set

Observation (live, this run). While gpt-oss-20b keeps a healthy running set (warm median 13, max 22), gemma-4-26b frequently sits at warm=3 with 56 cold idle and 7–8 active — long requests pile onto a few running slots while dozens of capable nodes stay cold. Result: gemma TTFB/TTFT tail stretches to ~123s under mixed load, and long text-reasoning evals occasionally stall out; gemma vision requests (short outputs) sail through unaffected.

Warm distribution over the run (capacity warm_providers = slots with State=="running" only; an "idle" loaded slot counts as cold here): gemma min=1 / median=8 / max=15; gpt-oss min=0 / median=13 / max=22.

Why. Bigger model (fewer hosts can load it) + worst-case max_tokens reservation (Issue 3) → each warm gemma slot admits fewer concurrent requests → demand concentrates. Smaller per-request reservations (Issue 3 fix) and/or more aggressive pre-warming for gemma would widen the effective slot count.


Appendix A — burst matrix (gpt-oss-20b, on top of the 24/7 loops; every error classified)

max_tok conc reqs err% latency p50/p95 clusterTPS errors
512 8 29 0% 11.6s / 14.3s 303
512 16 47 4.3% 15.0s / 22.0s 429 billing_503 ×2
512 32 99 3.0% 17.6s / 21.8s 843 billing_503 ×3
512 64 182 20.9% 19.5s / 28.6s 1246 billing_503 ×38
4096 8 8 0% 43s / 46s 704
4096 16 16 0% 60s / 69s 946
4096 32 32 0% 68s / 89s 1459
32768 4 4 0% 82s / 82s 248
32768 8 8 0% 120s / 123s 336
32768 16 22 36.4% 87s / 119s 170 otpm_429 ×6, client ×2

Error taxonomy: billing_503 = Issue 1; otpm_429 = Issue 2; dispatch_503/capacity_429 = genuine router/fleet path = never observed.

Appendix B — benchmark quality (stable across the run; routing adds no variance)

model benchmark mean n
gemma-4-26b MMStar (vision) 0.456 73
gemma-4-26b MathVista (vision) 0.618 77
gemma-4-26b MMMU (vision) 0.745 77
gpt-oss-20b MMLU ~0.92 6
gpt-oss-20b AgentDojo security / utility 0.554 / 0.153 27
gpt-oss-20b SuperGPQA ~0.51 7

(gpt-oss IFEval 0.70 strict / 0.72 loose; GPQA-Diamond ~0.581. AgentDojo utility 0.15 is a model-capability signal, not a network issue.)

Appendix C — test-harness artifacts (NOT product bugs; listed to avoid muddying triage)

  • OpenBench math 404 model_not_found: the math task's scorer uses a grader_model defaulting to an OpenAI GPT-4 model not hosted on darkbloom; pass -T grader_model=<a darkbloom model> (an openbench/inspect config flag). Not a coordinator bug.
  • Docker disk exhaustion on the test box: SWE-bench/terminal-bench container images fill local disk (vfs leak) — a property of my test harness, not the network.
  • Client-side timeouts (code: -1): a handful of aiohttp client timeouts on the longest 32k generations; the server returned 0 such errors.

Closing note

A KV/size-aware admission design (Issue 3, Gaps A+B) has been drafted with exact file:line and is ready to turn into a PR on request. The single highest-leverage fix for the OpenRouter launch is Issue 1 (single-account billing-debit contention), since OpenRouter concentrates all traffic on one balance row.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions