Summary
Over a multi-day, sustained dual-model load test against prod (api.darkbloom.dev, models gpt-oss-20b and gemma-4-26b), the router itself never shed a single request and output quality never degraded. Every rejection under load came from two per-account admission guards that run before provider selection, while the fleet had idle, capable providers, an empty queue, and 100% token budget free. This is the "machines are available but work isn't being done" symptom: the request is rejected before it ever reaches routing.
This issue documents (1) the two pre-router shedding mechanisms, (2) the KV/size-aware admission gaps that concentrate long requests onto too few warm slots, and (3) a memory-gate mismatch — all code-grounded with file:line. It also lists what is healthy, and explicitly separates out test-harness artifacts that are not product bugs.
Methodology: external OpenAI-compatible API client (key + base URL only). Capacity/warm-cold sampled from the public GET /v1/models/capacity every 3s; each error correlated with the fleet snapshot at the instant it fired. No coordinator code was changed to produce these findings.
What's healthy (baseline — so this is balanced)
- Routing never shed: zero
dispatch_503 (provider_error, "inference failed after N attempts") and zero capacity_429 ("all providers at capacity") across every burst and the entire sustained run.
- Median routing is excellent: TTFB (first streamed token = routing + prefill) p50 ≈ 2s on both models throughout, even under continuous load.
- Decode throughput is flat: ~60–74 tok/s/stream (gpt-oss), ~40–60 tok/s/stream (gemma) even on multi-minute 32k generations; cluster aggregate scaled to ~1,090–1,459 tok/s under burst.
- No quality variance from routing: benchmark scores were stable across ~70 runs each (e.g. MMStar 0.44–0.46, gpt-oss MMLU 0.90–0.93). See Appendix B.
Issue 1 (highest priority for the OpenRouter listing): per-account billing-debit returns 503 under concurrency
Symptom. Under concurrency from a single API key, a fraction of requests get 503 {"code":"service_unavailable"} ("service temporarily unavailable"), while capacity shows idle cold providers, empty queue, 100% budget. It scales with request rate, not fleet load, and appears even at low concurrency (2–3 per 15-min stage at conc 3–4).
Root cause (pre-router). The pre-flight balance reservation is the only producer of that body:
coordinator/api/consumer.go:1452-1461 — s.ledger.Charge(consumerKey, reservedMicroUSD, ...); on a non-balance DB error it calls s.writeServiceUnavailable(w, model) (consumer.go:939). (Same pattern repeats at consumer.go:3937-3943 and :4195.)
s.ledger.Charge → store.Debit is a single-row UPDATE balances ... WHERE account_id=$1 under a short timeout — coordinator/store/postgres.go:1781 (Debit), SQL at :1792.
All traffic from one key → one account_id row → row-lock serialization + connection-pool pressure under concurrency → some debits exceed the timeout → DB error → 503. This runs before routing, so idle providers never get a chance.
Why this is the #1 risk for OpenRouter. OpenRouter funnels all its traffic through a single service account → a single balance row, so it is maximally exposed to this contention regardless of how big the provider fleet is.
Evidence. Burst B (max_tokens=512, 80 concurrent): 52/272 = 503, 0× 429; at every 503 the fleet had 33–44 cold idle, routable 60–65, queue 0/10, budget 100%, and warm rose 10→32 (router was actively warming). Burst matrix (Appendix A): billing_503 count tracks concurrency at short max_tokens (512: 0→2→3→38 across conc 8/16/32/64) and vanishes at 4k/32k where each worker fires ~1 long request.
Suggested fixes (pick one+): retry-on-serialization-failure inside Debit with small backoff; shard/partition the balance row (or use an append-only ledger + periodic settle instead of a hot single-row UPDATE); raise the Debit statement timeout and size the pool for expected per-account concurrency; treat reservation as best-effort/async for trusted service-tier accounts.
Note: CLAUDE.md → Common Pitfalls still says "Postgres store exists but is not used in production yet." That is stale — the infra table lists Prod DB = AWS RDS PostgreSQL, and this 503 (only producible by the Postgres Debit path, never by the in-memory store) was observed live in prod. Worth correcting the doc.
Issue 2: output-token rate limit charges the full bounded max_tokens upfront
Symptom. At high max_tokens the request 429s with output_tokens rate limit exceeded while the fleet is idle. Reasoning models that request a large budget but rarely consume it are penalized.
Root cause (pre-router). coordinator/api/server.go:376-381 (applyTokenRateLimit) admits "using the upfront input estimate and the bounded max_tokens" — i.e. it charges the full worst-case output budget against the per-account OTPM bucket before the request runs. Buckets (coordinator/ratelimit/config.go): consumer OTPM 500_000/burst 64_000 (:61-62); service OTPM 5_000_000/burst 512_000 (:67-68). At 32k max_tokens the consumer burst holds <2 requests; on the service tier the wall appears right at ~524k aggregate (conc16 × 32k).
Evidence. 32k×conc8 (~262k tokens) is clean; OTPM-429 only appears at 32k×conc16 (~524k ≈ the 512k service burst). The 8× headroom vs an earlier phase confirms the test key is now on the service tier (good for OpenRouter) — but the upfront full-reservation model still over-charges.
Suggested fixes: charge an expected output length (e.g. EWMA of recent completions, or a fraction of max_tokens) and reconcile actual usage afterward, instead of reserving full worst-case max_tokens; this is the same over-reservation pattern as Issue 3.
Issue 3: coordinator discards the provider's real KV size and reserves worst-case max_tokens (the size/KV-aware routing gap)
What the provider already gives us. Providers compute architecture-exact per-token KV cost and report it every heartbeat: coordinator/protocol/messages.go:208 — KVBytesPerToken int64 \json:"kv_bytes_per_token,omitempty"`` (comment: "provider-side only").
What the coordinator does with it. Nothing. snapshotProviderLocked (coordinator/registry/scheduler.go:661) never captures KVBytesPerToken. The per-request admission freeMemoryAdmits (scheduler.go:723-734) reserves requestTokens = reqPromptTokens + reqMaxTokens (full worst-case max_tokens) against activeTokenBudgetMax; the legacy fallback path uses a flat constant kvCacheBytesPerToken = 400_000 (scheduler.go:39-49, used at :752) regardless of model architecture.
Consequence. Long-budget requests over-reserve memory, so each warm provider admits fewer concurrent requests than it could actually serve → load concentrates onto a few slots → the TTFT/TTFB tail and the gemma stall in Issue 5. This is exactly the "route by request size + how much KV cache it needs, not min-2×" behavior we want.
Suggested fixes: (A) reserve expected output length, not full max_tokens; (B) consume the provider-reported kv_bytes_per_token (capture it in snapshotProviderLocked, use it in freeMemoryAdmits) instead of the flat 400k constant.
Issue 4: coordinator admission memory gate is looser than the provider's load gate
Root cause. The weights-fit fallback modelFitsHardware (coordinator/registry/scheduler.go:142-150) admits when modelSizeGB * modelMemoryHeadroomFactor <= totalMemoryGB, with modelMemoryHeadroomFactor = 2.0 (scheduler.go:126-133). But the provider's ensureModelLoaded requires estimatedMemoryGb * 3.0 headroom (documented in CLAUDE.md → State Model). So the coordinator can route to a cold node that then fails to load the model and 503s "insufficient memory" → cooldown → reroute.
Status. This path did not fire in the bursts (the Issue 1/2 admission guards shed first), so it's a structural latent risk rather than an observed failure. Worth a targeted test (cold-fleet, large model, force a load) before relying on the ×2 gate.
Suggested fix: align the coordinator fallback factor with the provider's (≈×3, or better, weights + real KV from Issue 3) so the coordinator never admits a node the provider will reject.
Issue 5 (symptom of Issue 3): gemma-4-26b long-request stall on a small running-slot set
Observation (live, this run). While gpt-oss-20b keeps a healthy running set (warm median 13, max 22), gemma-4-26b frequently sits at warm=3 with 56 cold idle and 7–8 active — long requests pile onto a few running slots while dozens of capable nodes stay cold. Result: gemma TTFB/TTFT tail stretches to ~123s under mixed load, and long text-reasoning evals occasionally stall out; gemma vision requests (short outputs) sail through unaffected.
Warm distribution over the run (capacity warm_providers = slots with State=="running" only; an "idle" loaded slot counts as cold here): gemma min=1 / median=8 / max=15; gpt-oss min=0 / median=13 / max=22.
Why. Bigger model (fewer hosts can load it) + worst-case max_tokens reservation (Issue 3) → each warm gemma slot admits fewer concurrent requests → demand concentrates. Smaller per-request reservations (Issue 3 fix) and/or more aggressive pre-warming for gemma would widen the effective slot count.
Appendix A — burst matrix (gpt-oss-20b, on top of the 24/7 loops; every error classified)
| max_tok |
conc |
reqs |
err% |
latency p50/p95 |
clusterTPS |
errors |
| 512 |
8 |
29 |
0% |
11.6s / 14.3s |
303 |
— |
| 512 |
16 |
47 |
4.3% |
15.0s / 22.0s |
429 |
billing_503 ×2 |
| 512 |
32 |
99 |
3.0% |
17.6s / 21.8s |
843 |
billing_503 ×3 |
| 512 |
64 |
182 |
20.9% |
19.5s / 28.6s |
1246 |
billing_503 ×38 |
| 4096 |
8 |
8 |
0% |
43s / 46s |
704 |
— |
| 4096 |
16 |
16 |
0% |
60s / 69s |
946 |
— |
| 4096 |
32 |
32 |
0% |
68s / 89s |
1459 |
— |
| 32768 |
4 |
4 |
0% |
82s / 82s |
248 |
— |
| 32768 |
8 |
8 |
0% |
120s / 123s |
336 |
— |
| 32768 |
16 |
22 |
36.4% |
87s / 119s |
170 |
otpm_429 ×6, client ×2 |
Error taxonomy: billing_503 = Issue 1; otpm_429 = Issue 2; dispatch_503/capacity_429 = genuine router/fleet path = never observed.
Appendix B — benchmark quality (stable across the run; routing adds no variance)
| model |
benchmark |
mean |
n |
| gemma-4-26b |
MMStar (vision) |
0.456 |
73 |
| gemma-4-26b |
MathVista (vision) |
0.618 |
77 |
| gemma-4-26b |
MMMU (vision) |
0.745 |
77 |
| gpt-oss-20b |
MMLU |
~0.92 |
6 |
| gpt-oss-20b |
AgentDojo security / utility |
0.554 / 0.153 |
27 |
| gpt-oss-20b |
SuperGPQA |
~0.51 |
7 |
(gpt-oss IFEval 0.70 strict / 0.72 loose; GPQA-Diamond ~0.581. AgentDojo utility 0.15 is a model-capability signal, not a network issue.)
Appendix C — test-harness artifacts (NOT product bugs; listed to avoid muddying triage)
- OpenBench
math 404 model_not_found: the math task's scorer uses a grader_model defaulting to an OpenAI GPT-4 model not hosted on darkbloom; pass -T grader_model=<a darkbloom model> (an openbench/inspect config flag). Not a coordinator bug.
- Docker disk exhaustion on the test box: SWE-bench/terminal-bench container images fill local disk (vfs leak) — a property of my test harness, not the network.
- Client-side timeouts (
code: -1): a handful of aiohttp client timeouts on the longest 32k generations; the server returned 0 such errors.
Closing note
A KV/size-aware admission design (Issue 3, Gaps A+B) has been drafted with exact file:line and is ready to turn into a PR on request. The single highest-leverage fix for the OpenRouter launch is Issue 1 (single-account billing-debit contention), since OpenRouter concentrates all traffic on one balance row.
Summary
Over a multi-day, sustained dual-model load test against prod (
api.darkbloom.dev, modelsgpt-oss-20bandgemma-4-26b), the router itself never shed a single request and output quality never degraded. Every rejection under load came from two per-account admission guards that run before provider selection, while the fleet had idle, capable providers, an empty queue, and 100% token budget free. This is the "machines are available but work isn't being done" symptom: the request is rejected before it ever reaches routing.This issue documents (1) the two pre-router shedding mechanisms, (2) the KV/size-aware admission gaps that concentrate long requests onto too few warm slots, and (3) a memory-gate mismatch — all code-grounded with file:line. It also lists what is healthy, and explicitly separates out test-harness artifacts that are not product bugs.
What's healthy (baseline — so this is balanced)
dispatch_503(provider_error, "inference failed after N attempts") and zerocapacity_429("all providers at capacity") across every burst and the entire sustained run.Issue 1 (highest priority for the OpenRouter listing): per-account billing-debit returns 503 under concurrency
Symptom. Under concurrency from a single API key, a fraction of requests get
503 {"code":"service_unavailable"}("service temporarily unavailable"), while capacity shows idle cold providers, empty queue, 100% budget. It scales with request rate, not fleet load, and appears even at low concurrency (2–3 per 15-min stage at conc 3–4).Root cause (pre-router). The pre-flight balance reservation is the only producer of that body:
coordinator/api/consumer.go:1452-1461—s.ledger.Charge(consumerKey, reservedMicroUSD, ...); on a non-balance DB error it callss.writeServiceUnavailable(w, model)(consumer.go:939). (Same pattern repeats atconsumer.go:3937-3943and:4195.)s.ledger.Charge→store.Debitis a single-rowUPDATE balances ... WHERE account_id=$1under a short timeout —coordinator/store/postgres.go:1781(Debit), SQL at:1792.All traffic from one key → one
account_idrow → row-lock serialization + connection-pool pressure under concurrency → some debits exceed the timeout → DB error → 503. This runs before routing, so idle providers never get a chance.Why this is the #1 risk for OpenRouter. OpenRouter funnels all its traffic through a single service account → a single balance row, so it is maximally exposed to this contention regardless of how big the provider fleet is.
Evidence. Burst B (max_tokens=512, 80 concurrent): 52/272 = 503, 0× 429; at every 503 the fleet had 33–44 cold idle, routable 60–65, queue 0/10, budget 100%, and warm rose 10→32 (router was actively warming). Burst matrix (Appendix A):
billing_503count tracks concurrency at short max_tokens (512: 0→2→3→38 across conc 8/16/32/64) and vanishes at 4k/32k where each worker fires ~1 long request.Suggested fixes (pick one+): retry-on-serialization-failure inside
Debitwith small backoff; shard/partition the balance row (or use an append-only ledger + periodic settle instead of a hot single-row UPDATE); raise theDebitstatement timeout and size the pool for expected per-account concurrency; treat reservation as best-effort/async for trusted service-tier accounts.Issue 2: output-token rate limit charges the full bounded
max_tokensupfrontSymptom. At high
max_tokensthe request 429s withoutput_tokens rate limit exceededwhile the fleet is idle. Reasoning models that request a large budget but rarely consume it are penalized.Root cause (pre-router).
coordinator/api/server.go:376-381(applyTokenRateLimit) admits "using the upfront input estimate and the boundedmax_tokens" — i.e. it charges the full worst-case output budget against the per-account OTPM bucket before the request runs. Buckets (coordinator/ratelimit/config.go): consumer OTPM500_000/burst64_000(:61-62); service OTPM5_000_000/burst512_000(:67-68). At 32k max_tokens the consumer burst holds <2 requests; on the service tier the wall appears right at ~524k aggregate (conc16 × 32k).Evidence. 32k×conc8 (~262k tokens) is clean; OTPM-429 only appears at 32k×conc16 (~524k ≈ the 512k service burst). The 8× headroom vs an earlier phase confirms the test key is now on the service tier (good for OpenRouter) — but the upfront full-reservation model still over-charges.
Suggested fixes: charge an expected output length (e.g. EWMA of recent completions, or a fraction of max_tokens) and reconcile actual usage afterward, instead of reserving full worst-case
max_tokens; this is the same over-reservation pattern as Issue 3.Issue 3: coordinator discards the provider's real KV size and reserves worst-case
max_tokens(the size/KV-aware routing gap)What the provider already gives us. Providers compute architecture-exact per-token KV cost and report it every heartbeat:
coordinator/protocol/messages.go:208—KVBytesPerToken int64 \json:"kv_bytes_per_token,omitempty"`` (comment: "provider-side only").What the coordinator does with it. Nothing.
snapshotProviderLocked(coordinator/registry/scheduler.go:661) never capturesKVBytesPerToken. The per-request admissionfreeMemoryAdmits(scheduler.go:723-734) reservesrequestTokens = reqPromptTokens + reqMaxTokens(full worst-case max_tokens) againstactiveTokenBudgetMax; the legacy fallback path uses a flat constantkvCacheBytesPerToken = 400_000(scheduler.go:39-49, used at:752) regardless of model architecture.Consequence. Long-budget requests over-reserve memory, so each warm provider admits fewer concurrent requests than it could actually serve → load concentrates onto a few slots → the TTFT/TTFB tail and the gemma stall in Issue 5. This is exactly the "route by request size + how much KV cache it needs, not min-2×" behavior we want.
Suggested fixes: (A) reserve expected output length, not full
max_tokens; (B) consume the provider-reportedkv_bytes_per_token(capture it insnapshotProviderLocked, use it infreeMemoryAdmits) instead of the flat 400k constant.Issue 4: coordinator admission memory gate is looser than the provider's load gate
Root cause. The weights-fit fallback
modelFitsHardware(coordinator/registry/scheduler.go:142-150) admits whenmodelSizeGB * modelMemoryHeadroomFactor <= totalMemoryGB, withmodelMemoryHeadroomFactor = 2.0(scheduler.go:126-133). But the provider'sensureModelLoadedrequiresestimatedMemoryGb * 3.0headroom (documented inCLAUDE.md→ State Model). So the coordinator can route to a cold node that then fails to load the model and 503s "insufficient memory" → cooldown → reroute.Status. This path did not fire in the bursts (the Issue 1/2 admission guards shed first), so it's a structural latent risk rather than an observed failure. Worth a targeted test (cold-fleet, large model, force a load) before relying on the ×2 gate.
Suggested fix: align the coordinator fallback factor with the provider's (≈×3, or better, weights + real KV from Issue 3) so the coordinator never admits a node the provider will reject.
Issue 5 (symptom of Issue 3):
gemma-4-26blong-request stall on a small running-slot setObservation (live, this run). While
gpt-oss-20bkeeps a healthy running set (warm median 13, max 22),gemma-4-26bfrequently sits at warm=3 with 56 cold idle and 7–8 active — long requests pile onto a few running slots while dozens of capable nodes stay cold. Result: gemma TTFB/TTFT tail stretches to ~123s under mixed load, and long text-reasoning evals occasionally stall out; gemma vision requests (short outputs) sail through unaffected.Warm distribution over the run (capacity
warm_providers= slots withState=="running"only; an"idle"loaded slot counts as cold here): gemma min=1 / median=8 / max=15; gpt-oss min=0 / median=13 / max=22.Why. Bigger model (fewer hosts can load it) + worst-case
max_tokensreservation (Issue 3) → each warm gemma slot admits fewer concurrent requests → demand concentrates. Smaller per-request reservations (Issue 3 fix) and/or more aggressive pre-warming for gemma would widen the effective slot count.Appendix A — burst matrix (gpt-oss-20b, on top of the 24/7 loops; every error classified)
Error taxonomy:
billing_503= Issue 1;otpm_429= Issue 2;dispatch_503/capacity_429= genuine router/fleet path = never observed.Appendix B — benchmark quality (stable across the run; routing adds no variance)
(gpt-oss IFEval 0.70 strict / 0.72 loose; GPQA-Diamond ~0.581. AgentDojo utility 0.15 is a model-capability signal, not a network issue.)
Appendix C — test-harness artifacts (NOT product bugs; listed to avoid muddying triage)
math404model_not_found: themathtask's scorer uses agrader_modeldefaulting to an OpenAI GPT-4 model not hosted on darkbloom; pass-T grader_model=<a darkbloom model>(an openbench/inspect config flag). Not a coordinator bug.code: -1): a handful of aiohttp client timeouts on the longest 32k generations; the server returned 0 such errors.Closing note
A KV/size-aware admission design (Issue 3, Gaps A+B) has been drafted with exact file:line and is ready to turn into a PR on request. The single highest-leverage fix for the OpenRouter launch is Issue 1 (single-account billing-debit contention), since OpenRouter concentrates all traffic on one balance row.