fix(router): return mid-stream and copy errors from proxyRequest#1049
fix(router): return mid-stream and copy errors from proxyRequest#1049Sanchit2662 wants to merge 1 commit into
Conversation
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request updates error handling in the proxyRequest function to return errors for both streaming and non-streaming responses, ensuring failures are surfaced to the caller. Previously, errors during streaming or response copying were logged but often returned nil. The reviewer noted that returning these errors triggers a retry mechanism in the caller, which is problematic if data has already been sent to the client. Specifically, mid-stream retries can lead to corrupted responses and double-counting of token usage. It is recommended to only allow retries if no data has been written to the downstream writer.
| } | ||
| return true | ||
| }) | ||
| return streamErr |
There was a problem hiding this comment.
Returning streamErr here correctly surfaces the error for metrics and cache management, but it also triggers the retry loop in the proxy function. For mid-stream failures, retrying is problematic for two reasons: 1) The client receives a corrupted response as the new attempt's data is appended to the partial data already sent. 2) Token usage recorded via onUsage (lines 858-860) will be double-counted if the retry succeeds. Retries should only be attempted if no data has been written to the client yet (i.e., !c.Writer.Written()).
| if err != nil { | ||
| klog.Errorf("copy response to downstream failed: %v", err) | ||
| return nil | ||
| return err |
There was a problem hiding this comment.
Returning err here will cause the proxy function to retry the request. If io.Copy has already written data to the downstream writer, this retry will result in a corrupted response. While returning the error is necessary for accurate metrics and to prevent poisoning the prefix cache, the retry logic in the caller should be guarded by a check to see if the response has already started.
There was a problem hiding this comment.
Pull request overview
Fixes error propagation in the router’s proxyRequest() so that mid-stream upstream read failures and non-streaming response copy failures are no longer silently treated as success. This is intended to let the upstream retry loop behave correctly and avoid marking failed pods as successful in post-hooks/prefix-cache.
Changes:
- Streaming: capture non-EOF stream read errors inside the
c.Streamcallback and return them after streaming ends. - Non-streaming: return the
io.Copyerror instead of logging and returningnil.
Comments suppressed due to low confidence (1)
pkg/kthena-router/router/router.go:889
- Returning io.Copy errors here will also trigger proxy() retries, but io.Copy can fail after partially writing to the downstream client (or due to downstream disconnects). Retrying in those cases can send a corrupted/duplicated response and add unnecessary upstream load. Consider only retrying if no bytes were written (io.Copy’s returned byte count == 0 / c.Writer.Written()==false) and classifying downstream write errors as non-retryable.
_, err := io.Copy(c.Writer, ttee)
if err != nil {
klog.Errorf("copy response to downstream failed: %v", err)
return err
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -869,11 +870,13 @@ func proxyRequest( | |||
| if err != nil { | |||
| if err != io.EOF { | |||
| klog.Errorf("error reading stream body: %v", err) | |||
| streamErr = err | |||
| } | |||
| return false | |||
| } | |||
| return true | |||
| }) | |||
| return streamErr | |||
| } else { | |||
| @@ -869,11 +870,13 @@ func proxyRequest( | |||
| if err != nil { | |||
| if err != io.EOF { | |||
| klog.Errorf("error reading stream body: %v", err) | |||
| streamErr = err | |||
| } | |||
| return false | |||
| } | |||
| @@ -882,7 +885,7 @@ func proxyRequest( | |||
| _, err := io.Copy(c.Writer, ttee) | |||
| if err != nil { | |||
| klog.Errorf("copy response to downstream failed: %v", err) | |||
| return nil | |||
| return err | |||
| } | |||
What type of PR is this?
/kind bug
What this PR does / why we need it:
So I was going through the router code to understand how requests get dispatched, and I found something that felt off in proxyRequest in
pkg/kthena-router/router/router.go.When a streaming response from a backend pod breaks mid-way (like if the pod gets killed during a rolling update), the stream read error was being logged but then just... ignored. The function returned nil like nothing went wrong. Same thing for non-streaming responses - if io.Copy failed while copying the response body, it would log the error and return nil.
The problem with that is the retry loop in proxy() only tries the next pod if it gets a non-nil error back. So when a pod dies mid-response, we never retry to a healthy pod, we don't record any error in metrics, and we still call RunPostHooks which writes the dead pod into the prefix cache as if it handled the request successfully. That last part is pretty bad because future requests with similar prompts will keep getting scored toward a pod that's already gone.
The fix is pretty simple honestly. For streaming I added a streamErr variable that captures any non-EOF error inside the c.Stream closure and returns it after the stream finishes. For non-streaming I just changed return nil to return err in the io.Copy error path.
Now mid-stream failures and copy failures actually surface as errors, so the retry logic works as intended, metrics reflect what actually happened, and the prefixcache doesn't get poisoned with dead pod entries.
Does this PR introduce a user-facing change?: