Load-test findings: requests shed at per-account admission guards before routing while fleet sits idle (+ KV/size-aware routing gaps)

## Summary

Over a multi-day, sustained dual-model load test against prod (`api.darkbloom.dev`, models `gpt-oss-20b` and `gemma-4-26b`), the **router itself never shed a single request** and **output quality never degraded**. Every rejection under load came from **two per-account admission guards that run *before* provider selection**, while the fleet had idle, capable providers, an empty queue, and 100% token budget free. This is the "machines are available but work isn't being done" symptom: the request is rejected before it ever reaches routing.

This issue documents (1) the two pre-router shedding mechanisms, (2) the KV/size-aware admission gaps that concentrate long requests onto too few warm slots, and (3) a memory-gate mismatch — all code-grounded with file:line. It also lists what is healthy, and explicitly separates out test-harness artifacts that are **not** product bugs.

> Methodology: external OpenAI-compatible API client (key + base URL only). Capacity/warm-cold sampled from the public `GET /v1/models/capacity` every 3s; each error correlated with the fleet snapshot at the instant it fired. No coordinator code was changed to produce these findings.

---

## What's healthy (baseline — so this is balanced)

- **Routing never shed**: zero `dispatch_503` (`provider_error`, "inference failed after N attempts") and zero `capacity_429` ("all providers at capacity") across every burst and the entire sustained run.
- **Median routing is excellent**: TTFB (first streamed token = routing + prefill) p50 ≈ **2s** on both models throughout, even under continuous load.
- **Decode throughput is flat**: ~60–74 tok/s/stream (gpt-oss), ~40–60 tok/s/stream (gemma) even on multi-minute 32k generations; cluster aggregate scaled to ~1,090–1,459 tok/s under burst.
- **No quality variance from routing**: benchmark scores were stable across ~70 runs each (e.g. MMStar 0.44–0.46, gpt-oss MMLU 0.90–0.93). See Appendix B.

---

## Issue 1 (highest priority for the OpenRouter listing): per-account billing-debit returns 503 under concurrency

**Symptom.** Under concurrency from a single API key, a fraction of requests get `503 {"code":"service_unavailable"}` ("service temporarily unavailable"), while capacity shows idle cold providers, empty queue, 100% budget. It scales with **request rate**, not fleet load, and appears even at low concurrency (2–3 per 15-min stage at conc 3–4).

**Root cause (pre-router).** The pre-flight balance reservation is the only producer of that body:

- `coordinator/api/consumer.go:1452-1461` — `s.ledger.Charge(consumerKey, reservedMicroUSD, ...)`; on a **non-balance** DB error it calls `s.writeServiceUnavailable(w, model)` (`consumer.go:939`). (Same pattern repeats at `consumer.go:3937-3943` and `:4195`.)
- `s.ledger.Charge` → `store.Debit` is a single-row `UPDATE balances ... WHERE account_id=$1` under a short timeout — `coordinator/store/postgres.go:1781` (`Debit`), SQL at `:1792`.

All traffic from one key → **one `account_id` row** → row-lock serialization + connection-pool pressure under concurrency → some debits exceed the timeout → DB error → 503. This runs *before* routing, so idle providers never get a chance.

**Why this is the #1 risk for OpenRouter.** OpenRouter funnels all its traffic through a **single service account → a single balance row**, so it is maximally exposed to this contention regardless of how big the provider fleet is.

**Evidence.** Burst B (max_tokens=512, 80 concurrent): 52/272 = 503, **0× 429**; at every 503 the fleet had 33–44 cold idle, routable 60–65, queue 0/10, budget 100%, and warm *rose* 10→32 (router was actively warming). Burst matrix (Appendix A): `billing_503` count tracks concurrency at short max_tokens (512: 0→2→3→38 across conc 8/16/32/64) and vanishes at 4k/32k where each worker fires ~1 long request.

**Suggested fixes (pick one+):** retry-on-serialization-failure inside `Debit` with small backoff; shard/partition the balance row (or use an append-only ledger + periodic settle instead of a hot single-row UPDATE); raise the `Debit` statement timeout and size the pool for expected per-account concurrency; treat reservation as best-effort/async for trusted service-tier accounts.

> Note: `CLAUDE.md` → Common Pitfalls still says *"Postgres store exists but is not used in production yet."* That is **stale** — the infra table lists Prod DB = AWS RDS PostgreSQL, and this 503 (only producible by the Postgres `Debit` path, never by the in-memory store) was observed live in prod. Worth correcting the doc.

---

## Issue 2: output-token rate limit charges the full bounded `max_tokens` upfront

**Symptom.** At high `max_tokens` the request 429s with `output_tokens rate limit exceeded` while the fleet is idle. Reasoning models that request a large budget but rarely consume it are penalized.

**Root cause (pre-router).** `coordinator/api/server.go:376-381` (`applyTokenRateLimit`) admits "using the upfront input estimate and the **bounded `max_tokens`**" — i.e. it charges the full worst-case output budget against the per-account OTPM bucket before the request runs. Buckets (`coordinator/ratelimit/config.go`): consumer OTPM `500_000`/burst `64_000` (`:61-62`); service OTPM `5_000_000`/burst `512_000` (`:67-68`). At 32k max_tokens the consumer burst holds <2 requests; on the service tier the wall appears right at ~524k aggregate (conc16 × 32k).

**Evidence.** 32k×conc8 (~262k tokens) is clean; OTPM-429 only appears at 32k×conc16 (~524k ≈ the 512k service burst). The 8× headroom vs an earlier phase confirms the test key is now on the service tier (good for OpenRouter) — but the *upfront full-reservation* model still over-charges.

**Suggested fixes:** charge an **expected** output length (e.g. EWMA of recent completions, or a fraction of max_tokens) and reconcile actual usage afterward, instead of reserving full worst-case `max_tokens`; this is the same over-reservation pattern as Issue 3.

---

## Issue 3: coordinator discards the provider's real KV size and reserves worst-case `max_tokens` (the size/KV-aware routing gap)

**What the provider already gives us.** Providers compute architecture-exact per-token KV cost and report it every heartbeat: `coordinator/protocol/messages.go:208` — `KVBytesPerToken int64 \`json:"kv_bytes_per_token,omitempty"\`` (comment: *"provider-side only"*).

**What the coordinator does with it.** Nothing. `snapshotProviderLocked` (`coordinator/registry/scheduler.go:661`) never captures `KVBytesPerToken`. The per-request admission `freeMemoryAdmits` (`scheduler.go:723-734`) reserves `requestTokens = reqPromptTokens + reqMaxTokens` (full worst-case max_tokens) against `activeTokenBudgetMax`; the legacy fallback path uses a **flat constant** `kvCacheBytesPerToken = 400_000` (`scheduler.go:39-49`, used at `:752`) regardless of model architecture.

**Consequence.** Long-budget requests over-reserve memory, so each warm provider admits fewer concurrent requests than it could actually serve → load concentrates onto a few slots → the TTFT/TTFB tail and the gemma stall in Issue 5. This is exactly the "route by request size + how much KV cache it needs, not min-2×" behavior we want.

**Suggested fixes:** (A) reserve **expected** output length, not full `max_tokens`; (B) consume the provider-reported `kv_bytes_per_token` (capture it in `snapshotProviderLocked`, use it in `freeMemoryAdmits`) instead of the flat 400k constant.

---

## Issue 4: coordinator admission memory gate is looser than the provider's load gate

**Root cause.** The weights-fit fallback `modelFitsHardware` (`coordinator/registry/scheduler.go:142-150`) admits when `modelSizeGB * modelMemoryHeadroomFactor <= totalMemoryGB`, with `modelMemoryHeadroomFactor = 2.0` (`scheduler.go:126-133`). But the provider's `ensureModelLoaded` requires `estimatedMemoryGb * 3.0` headroom (documented in `CLAUDE.md` → State Model). So the coordinator can route to a cold node that then fails to load the model and 503s "insufficient memory" → cooldown → reroute.

**Status.** This path did **not** fire in the bursts (the Issue 1/2 admission guards shed first), so it's a structural latent risk rather than an observed failure. Worth a targeted test (cold-fleet, large model, force a load) before relying on the ×2 gate.

**Suggested fix:** align the coordinator fallback factor with the provider's (≈×3, or better, weights + real KV from Issue 3) so the coordinator never admits a node the provider will reject.

---

## Issue 5 (symptom of Issue 3): `gemma-4-26b` long-request stall on a small running-slot set

**Observation (live, this run).** While `gpt-oss-20b` keeps a healthy running set (warm median 13, max 22), `gemma-4-26b` frequently sits at **warm=3 with 56 cold idle and 7–8 active** — long requests pile onto a few running slots while dozens of capable nodes stay cold. Result: gemma TTFB/TTFT tail stretches to ~123s under mixed load, and long text-reasoning evals occasionally stall out; gemma **vision** requests (short outputs) sail through unaffected.

Warm distribution over the run (capacity `warm_providers` = slots with `State=="running"` only; an `"idle"` loaded slot counts as cold here): gemma min=1 / median=8 / max=15; gpt-oss min=0 / median=13 / max=22.

**Why.** Bigger model (fewer hosts can load it) + worst-case `max_tokens` reservation (Issue 3) → each warm gemma slot admits fewer concurrent requests → demand concentrates. Smaller per-request reservations (Issue 3 fix) and/or more aggressive pre-warming for gemma would widen the effective slot count.

---

## Appendix A — burst matrix (gpt-oss-20b, on top of the 24/7 loops; every error classified)

| max_tok | conc | reqs | err% | latency p50/p95 | clusterTPS | errors |
|--:|--:|--:|--:|--:|--:|--|
| 512 | 8 | 29 | 0% | 11.6s / 14.3s | 303 | — |
| 512 | 16 | 47 | 4.3% | 15.0s / 22.0s | 429 | billing_503 ×2 |
| 512 | 32 | 99 | 3.0% | 17.6s / 21.8s | 843 | billing_503 ×3 |
| 512 | 64 | 182 | 20.9% | 19.5s / 28.6s | 1246 | billing_503 ×38 |
| 4096 | 8 | 8 | 0% | 43s / 46s | 704 | — |
| 4096 | 16 | 16 | 0% | 60s / 69s | 946 | — |
| 4096 | 32 | 32 | 0% | 68s / 89s | 1459 | — |
| 32768 | 4 | 4 | 0% | 82s / 82s | 248 | — |
| 32768 | 8 | 8 | 0% | 120s / 123s | 336 | — |
| 32768 | 16 | 22 | 36.4% | 87s / 119s | 170 | otpm_429 ×6, client ×2 |

Error taxonomy: `billing_503` = Issue 1; `otpm_429` = Issue 2; `dispatch_503`/`capacity_429` = genuine router/fleet path = **never observed**.

## Appendix B — benchmark quality (stable across the run; routing adds no variance)

| model | benchmark | mean | n |
|---|---|--:|--:|
| gemma-4-26b | MMStar (vision) | 0.456 | 73 |
| gemma-4-26b | MathVista (vision) | 0.618 | 77 |
| gemma-4-26b | MMMU (vision) | 0.745 | 77 |
| gpt-oss-20b | MMLU | ~0.92 | 6 |
| gpt-oss-20b | AgentDojo security / utility | 0.554 / 0.153 | 27 |
| gpt-oss-20b | SuperGPQA | ~0.51 | 7 |

(gpt-oss IFEval 0.70 strict / 0.72 loose; GPQA-Diamond ~0.581. AgentDojo *utility* 0.15 is a model-capability signal, not a network issue.)

## Appendix C — test-harness artifacts (NOT product bugs; listed to avoid muddying triage)

- **OpenBench `math` 404 `model_not_found`**: the `math` task's scorer uses a `grader_model` defaulting to an OpenAI GPT-4 model not hosted on darkbloom; pass `-T grader_model=<a darkbloom model>` (an openbench/inspect config flag). Not a coordinator bug.
- **Docker disk exhaustion on the test box**: SWE-bench/terminal-bench container images fill local disk (vfs leak) — a property of my test harness, not the network.
- **Client-side timeouts** (`code: -1`): a handful of aiohttp client timeouts on the longest 32k generations; the server returned 0 such errors.

---

### Closing note
A KV/size-aware admission design (Issue 3, Gaps A+B) has been drafted with exact file:line and is ready to turn into a PR on request. The single highest-leverage fix for the OpenRouter launch is **Issue 1** (single-account billing-debit contention), since OpenRouter concentrates all traffic on one balance row.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load-test findings: requests shed at per-account admission guards before routing while fleet sits idle (+ KV/size-aware routing gaps) #342

Summary

What's healthy (baseline — so this is balanced)

Issue 1 (highest priority for the OpenRouter listing): per-account billing-debit returns 503 under concurrency

Issue 2: output-token rate limit charges the full bounded `max_tokens` upfront

Issue 3: coordinator discards the provider's real KV size and reserves worst-case `max_tokens` (the size/KV-aware routing gap)

Issue 4: coordinator admission memory gate is looser than the provider's load gate

Issue 5 (symptom of Issue 3): `gemma-4-26b` long-request stall on a small running-slot set

Appendix A — burst matrix (gpt-oss-20b, on top of the 24/7 loops; every error classified)

Appendix B — benchmark quality (stable across the run; routing adds no variance)

Appendix C — test-harness artifacts (NOT product bugs; listed to avoid muddying triage)

Closing note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

max_tok	conc	reqs	err%	latency p50/p95	clusterTPS	errors
512	8	29	0%	11.6s / 14.3s	303	—
512	16	47	4.3%	15.0s / 22.0s	429	billing_503 ×2
512	32	99	3.0%	17.6s / 21.8s	843	billing_503 ×3
512	64	182	20.9%	19.5s / 28.6s	1246	billing_503 ×38
4096	8	8	0%	43s / 46s	704	—
4096	16	16	0%	60s / 69s	946	—
4096	32	32	0%	68s / 89s	1459	—
32768	4	4	0%	82s / 82s	248	—
32768	8	8	0%	120s / 123s	336	—
32768	16	22	36.4%	87s / 119s	170	otpm_429 ×6, client ×2

model	benchmark	mean	n
gemma-4-26b	MMStar (vision)	0.456	73
gemma-4-26b	MathVista (vision)	0.618	77
gemma-4-26b	MMMU (vision)	0.745	77
gpt-oss-20b	MMLU	~0.92	6
gpt-oss-20b	AgentDojo security / utility	0.554 / 0.153	27
gpt-oss-20b	SuperGPQA	~0.51	7

Load-test findings: requests shed at per-account admission guards before routing while fleet sits idle (+ KV/size-aware routing gaps) #342

Description

Summary

What's healthy (baseline — so this is balanced)

Issue 1 (highest priority for the OpenRouter listing): per-account billing-debit returns 503 under concurrency

Issue 2: output-token rate limit charges the full bounded max_tokens upfront

Issue 3: coordinator discards the provider's real KV size and reserves worst-case max_tokens (the size/KV-aware routing gap)

Issue 4: coordinator admission memory gate is looser than the provider's load gate

Issue 5 (symptom of Issue 3): gemma-4-26b long-request stall on a small running-slot set

Appendix A — burst matrix (gpt-oss-20b, on top of the 24/7 loops; every error classified)

Appendix B — benchmark quality (stable across the run; routing adds no variance)

Appendix C — test-harness artifacts (NOT product bugs; listed to avoid muddying triage)

Closing note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue 2: output-token rate limit charges the full bounded `max_tokens` upfront

Issue 3: coordinator discards the provider's real KV size and reserves worst-case `max_tokens` (the size/KV-aware routing gap)

Issue 5 (symptom of Issue 3): `gemma-4-26b` long-request stall on a small running-slot set