fix: phantom request accumulation in cache_aware load tracking by JannikSt · Pull Request #23 · PrimeIntellect-ai/router

JannikSt · 2026-04-14T11:56:23Z

Fixes phantom request accumulation that causes workers to get locked out when using cache_aware routing policy.

remove double decrement of load counter on retryable failures
add 300s per-chunk timeout on streaming load tracking task (sends error to client on timeout)
add get_workers() accessor on Router for diagnostics
add 9 tests covering load counter invariants and phantom request detection

Note

Medium Risk
Touches core request routing/load-tracking paths for both retry and streaming responses; mistakes could miscount load or prematurely terminate long-lived streams.

Overview
Fixes phantom request/load-counter drift in the cache_aware policy by removing retry-path load cleanup that could double-decrement the per-worker load counter.

Hardens streaming load tracking by adding an inactivity timeout (300s) in the SSE forwarding task that closes stalled streams and ensures the worker load is decremented on exit.

Adds a Router::get_workers() diagnostic accessor and a new load_tracking_test.rs suite with unit + end-to-end tests asserting load counters return to zero across success, failure, streaming, and repeated-request scenarios.

^{Reviewed by Cursor Bugbot for commit 93b9417. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Fix phantom request accumulation in cache-aware load tracking

Removes the manual per-worker load decrement in the retry loop of router.rs, which was causing double-decrements and phantom load accumulation under retryable failures.
Adds a 300-second inactivity timeout to the streaming forwarder task; the stream closes with an error if no data arrives within that window.
Ensures per-worker load is decremented exactly once across all exit paths (normal end, [DONE] detection, or timeout).
Adds a get_workers() method on Router to expose registered workers for diagnostics.
Adds integration and unit tests in tests/load_tracking_test.rs covering load counter underflow, double-decrement drift, and end-to-end load returning to zero after streaming, non-streaming, and failed requests.
Behavioral Change: retryable failures no longer trigger a load decrement at the retry site; only the request completion path adjusts worker load.

^{Macroscope summarized 93b9417.}

- remove double decrement on retryable request failures - add 5min timeout to streaming load tracking spawned task - add 9 tests for phantom request detection

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b69db8f1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

macroscopeapp · 2026-04-14T12:04:05Z

Approvability

Verdict: Needs human review

1 blocking correctness issue found. Multiple reviewers have identified that the integration tests always pass trivially due to a downcast bug (get_total_internal_load always returns 0), meaning the phantom request detection tests provide no regression protection. Additionally, the new 300-second streaming timeout exits silently without notifying clients, which could cause them to treat partial output as successful completion.

^{You can customize Macroscope's approvability policy. Learn more.}

…owncast

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 013bb60. Configure here.}

Includes #23 fix for phantom request accumulation in cache_aware load tracking — unblocks cache_aware and power_of_two routing policies that depend on accurate per-worker load counters.

#23 removed the retry-cleanup decrement in route_typed_request, expecting send_typed_request to own load lifecycle. The early-return 500 path (input-validation rewrite branch) only decrements when status is rewritten to 400 — for actual 500s it skips, so each retry attempt's increment leaks into phantom load. Always decrement on this path when load_incremented; fixes the two pre-existing load_tracking_test failures. Also drops a few 'billing' references from the new comments.

* fix: attribute transparent-proxy traffic to run_id Catch-all transparent proxy was bypassing per-run usage counters, so renderer rollouts hitting vLLM /v1/generate (no explicit route on the router) appeared as 0 inference cost on training-run dashboards. Thread run_id through route_transparent and call the existing record_run_request + extract_and_record_usage_buffered helpers. * fix: keep SSE responses streaming in transparent proxy Initial transparent-proxy attribution buffered the entire upstream body when run_id was set, regressing /v1/generate (and any other catch-all SSE endpoint) from incremental token delivery to end-of-request delivery and inflating memory for long generations. Branch on Content-Type: text/event-stream → spawn forwarder + tee SseUsageExtractor (mirrors route_typed_request's streaming path). Non-SSE responses still buffer once for the JSON usage extraction. * fix: decrement load on genuine 500s in send_typed_request early return #23 removed the retry-cleanup decrement in route_typed_request, expecting send_typed_request to own load lifecycle. The early-return 500 path (input-validation rewrite branch) only decrements when status is rewritten to 400 — for actual 500s it skips, so each retry attempt's increment leaks into phantom load. Always decrement on this path when load_incremented; fixes the two pre-existing load_tracking_test failures. Also drops a few 'billing' references from the new comments.

fix: phantom request accumulation in cache_aware load tracking

9b69db8

- remove double decrement on retryable request failures - add 5min timeout to streaming load tracking spawned task - add 9 tests for phantom request detection

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/routers/http/router.rs Outdated

macroscopeapp Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/load_tracking_test.rs

Comment thread src/routers/http/router.rs Outdated

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/load_tracking_test.rs

address review: 60s stream timeout with error propagation, fix test d…

013bb60

…owncast

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/routers/http/router.rs Outdated

bump stream timeout to 300s for thinking models and long prefills

93b9417

JannikSt merged commit 551caba into main Apr 16, 2026
7 checks passed

JannikSt added a commit that referenced this pull request Apr 16, 2026

release: v0.1.22

459397d

Includes #23 fix for phantom request accumulation in cache_aware load tracking — unblocks cache_aware and power_of_two routing policies that depend on accurate per-worker load counters.

JannikSt mentioned this pull request Apr 16, 2026

release: v0.1.22 #25

Merged

JannikSt added a commit that referenced this pull request Apr 16, 2026

release: v0.1.22 (#25)

0a0d49e

Includes #23 fix for phantom request accumulation in cache_aware load tracking — unblocks cache_aware and power_of_two routing policies that depend on accurate per-worker load counters.

epappas mentioned this pull request Apr 23, 2026

L2 adapter: integrate vllm-project/router as dataplane for multi-replica and PD-disagg trials epappas/autoinfer#3

Open

7 tasks

JannikSt mentioned this pull request May 3, 2026

fix: attribute transparent-proxy traffic to run_id #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: phantom request accumulation in cache_aware load tracking#23

fix: phantom request accumulation in cache_aware load tracking#23
JannikSt merged 3 commits into
mainfrom
bugfix/phantom-request-load-tracking

JannikSt commented Apr 14, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Apr 14, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JannikSt commented Apr 14, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix phantom request accumulation in cache-aware load tracking

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Apr 14, 2026

Approvability

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JannikSt commented Apr 14, 2026 •

edited by macroscopeapp Bot

Loading