fix: phantom request accumulation in cache_aware load tracking#23
Conversation
- remove double decrement on retryable request failures - add 5min timeout to streaming load tracking spawned task - add 9 tests for phantom request detection
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b69db8f1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ApprovabilityVerdict: Needs human review 1 blocking correctness issue found. Multiple reviewers have identified that the integration tests always pass trivially due to a downcast bug (get_total_internal_load always returns 0), meaning the phantom request detection tests provide no regression protection. Additionally, the new 300-second streaming timeout exits silently without notifying clients, which could cause them to treat partial output as successful completion. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 013bb60. Configure here.
Includes #23 fix for phantom request accumulation in cache_aware load tracking — unblocks cache_aware and power_of_two routing policies that depend on accurate per-worker load counters.
Includes #23 fix for phantom request accumulation in cache_aware load tracking — unblocks cache_aware and power_of_two routing policies that depend on accurate per-worker load counters.
#23 removed the retry-cleanup decrement in route_typed_request, expecting send_typed_request to own load lifecycle. The early-return 500 path (input-validation rewrite branch) only decrements when status is rewritten to 400 — for actual 500s it skips, so each retry attempt's increment leaks into phantom load. Always decrement on this path when load_incremented; fixes the two pre-existing load_tracking_test failures. Also drops a few 'billing' references from the new comments.
* fix: attribute transparent-proxy traffic to run_id Catch-all transparent proxy was bypassing per-run usage counters, so renderer rollouts hitting vLLM /v1/generate (no explicit route on the router) appeared as 0 inference cost on training-run dashboards. Thread run_id through route_transparent and call the existing record_run_request + extract_and_record_usage_buffered helpers. * fix: keep SSE responses streaming in transparent proxy Initial transparent-proxy attribution buffered the entire upstream body when run_id was set, regressing /v1/generate (and any other catch-all SSE endpoint) from incremental token delivery to end-of-request delivery and inflating memory for long generations. Branch on Content-Type: text/event-stream → spawn forwarder + tee SseUsageExtractor (mirrors route_typed_request's streaming path). Non-SSE responses still buffer once for the JSON usage extraction. * fix: decrement load on genuine 500s in send_typed_request early return #23 removed the retry-cleanup decrement in route_typed_request, expecting send_typed_request to own load lifecycle. The early-return 500 path (input-validation rewrite branch) only decrements when status is rewritten to 400 — for actual 500s it skips, so each retry attempt's increment leaks into phantom load. Always decrement on this path when load_incremented; fixes the two pre-existing load_tracking_test failures. Also drops a few 'billing' references from the new comments.

Fixes phantom request accumulation that causes workers to get locked out when using cache_aware routing policy.
Note
Medium Risk
Touches core request routing/load-tracking paths for both retry and streaming responses; mistakes could miscount load or prematurely terminate long-lived streams.
Overview
Fixes phantom request/load-counter drift in the
cache_awarepolicy by removing retry-path load cleanup that could double-decrement the per-worker load counter.Hardens streaming load tracking by adding an inactivity timeout (300s) in the SSE forwarding task that closes stalled streams and ensures the worker load is decremented on exit.
Adds a
Router::get_workers()diagnostic accessor and a newload_tracking_test.rssuite with unit + end-to-end tests asserting load counters return to zero across success, failure, streaming, and repeated-request scenarios.Reviewed by Cursor Bugbot for commit 93b9417. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Fix phantom request accumulation in cache-aware load tracking
router.rs, which was causing double-decrements and phantom load accumulation under retryable failures.[DONE]detection, or timeout).get_workers()method onRouterto expose registered workers for diagnostics.tests/load_tracking_test.rscovering load counter underflow, double-decrement drift, and end-to-end load returning to zero after streaming, non-streaming, and failed requests.Macroscope summarized 93b9417.