Skip to content

fix: phantom request accumulation in cache_aware load tracking#23

Merged
JannikSt merged 3 commits into
mainfrom
bugfix/phantom-request-load-tracking
Apr 16, 2026
Merged

fix: phantom request accumulation in cache_aware load tracking#23
JannikSt merged 3 commits into
mainfrom
bugfix/phantom-request-load-tracking

Conversation

@JannikSt

@JannikSt JannikSt commented Apr 14, 2026

Copy link
Copy Markdown
Member

Fixes phantom request accumulation that causes workers to get locked out when using cache_aware routing policy.

  • remove double decrement of load counter on retryable failures
  • add 300s per-chunk timeout on streaming load tracking task (sends error to client on timeout)
  • add get_workers() accessor on Router for diagnostics
  • add 9 tests covering load counter invariants and phantom request detection

Note

Medium Risk
Touches core request routing/load-tracking paths for both retry and streaming responses; mistakes could miscount load or prematurely terminate long-lived streams.

Overview
Fixes phantom request/load-counter drift in the cache_aware policy by removing retry-path load cleanup that could double-decrement the per-worker load counter.

Hardens streaming load tracking by adding an inactivity timeout (300s) in the SSE forwarding task that closes stalled streams and ensures the worker load is decremented on exit.

Adds a Router::get_workers() diagnostic accessor and a new load_tracking_test.rs suite with unit + end-to-end tests asserting load counters return to zero across success, failure, streaming, and repeated-request scenarios.

Reviewed by Cursor Bugbot for commit 93b9417. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Fix phantom request accumulation in cache-aware load tracking

  • Removes the manual per-worker load decrement in the retry loop of router.rs, which was causing double-decrements and phantom load accumulation under retryable failures.
  • Adds a 300-second inactivity timeout to the streaming forwarder task; the stream closes with an error if no data arrives within that window.
  • Ensures per-worker load is decremented exactly once across all exit paths (normal end, [DONE] detection, or timeout).
  • Adds a get_workers() method on Router to expose registered workers for diagnostics.
  • Adds integration and unit tests in tests/load_tracking_test.rs covering load counter underflow, double-decrement drift, and end-to-end load returning to zero after streaming, non-streaming, and failed requests.
  • Behavioral Change: retryable failures no longer trigger a load decrement at the retry site; only the request completion path adjusts worker load.

Macroscope summarized 93b9417.

- remove double decrement on retryable request failures
- add 5min timeout to streaming load tracking spawned task
- add 9 tests for phantom request detection

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b69db8f1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/routers/http/router.rs Outdated
Comment thread tests/load_tracking_test.rs
Comment thread src/routers/http/router.rs Outdated
Comment thread tests/load_tracking_test.rs
@macroscopeapp

macroscopeapp Bot commented Apr 14, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

1 blocking correctness issue found. Multiple reviewers have identified that the integration tests always pass trivially due to a downcast bug (get_total_internal_load always returns 0), meaning the phantom request detection tests provide no regression protection. Additionally, the new 300-second streaming timeout exits silently without notifying clients, which could cause them to treat partial output as successful completion.

You can customize Macroscope's approvability policy. Learn more.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 013bb60. Configure here.

Comment thread src/routers/http/router.rs Outdated
@JannikSt JannikSt merged commit 551caba into main Apr 16, 2026
7 checks passed
JannikSt added a commit that referenced this pull request Apr 16, 2026
Includes #23 fix for phantom request accumulation in cache_aware load
tracking — unblocks cache_aware and power_of_two routing policies that
depend on accurate per-worker load counters.
@JannikSt JannikSt mentioned this pull request Apr 16, 2026
JannikSt added a commit that referenced this pull request Apr 16, 2026
Includes #23 fix for phantom request accumulation in cache_aware load
tracking — unblocks cache_aware and power_of_two routing policies that
depend on accurate per-worker load counters.
JannikSt added a commit that referenced this pull request May 3, 2026
#23 removed the retry-cleanup decrement in route_typed_request,
expecting send_typed_request to own load lifecycle. The early-return
500 path (input-validation rewrite branch) only decrements when
status is rewritten to 400 — for actual 500s it skips, so each
retry attempt's increment leaks into phantom load. Always decrement
on this path when load_incremented; fixes the two pre-existing
load_tracking_test failures.

Also drops a few 'billing' references from the new comments.
JannikSt added a commit that referenced this pull request May 4, 2026
* fix: attribute transparent-proxy traffic to run_id

Catch-all transparent proxy was bypassing per-run usage counters,
so renderer rollouts hitting vLLM /v1/generate (no explicit route on
the router) appeared as 0 inference cost on training-run dashboards.
Thread run_id through route_transparent and call the existing
record_run_request + extract_and_record_usage_buffered helpers.

* fix: keep SSE responses streaming in transparent proxy

Initial transparent-proxy attribution buffered the entire upstream body
when run_id was set, regressing /v1/generate (and any other catch-all
SSE endpoint) from incremental token delivery to end-of-request
delivery and inflating memory for long generations.

Branch on Content-Type: text/event-stream → spawn forwarder + tee
SseUsageExtractor (mirrors route_typed_request's streaming path).
Non-SSE responses still buffer once for the JSON usage extraction.

* fix: decrement load on genuine 500s in send_typed_request early return

#23 removed the retry-cleanup decrement in route_typed_request,
expecting send_typed_request to own load lifecycle. The early-return
500 path (input-validation rewrite branch) only decrements when
status is rewritten to 400 — for actual 500s it skips, so each
retry attempt's increment leaks into phantom load. Always decrement
on this path when load_incremented; fixes the two pre-existing
load_tracking_test failures.

Also drops a few 'billing' references from the new comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant