Skip to content

fix(router): return mid-stream and copy errors from proxyRequest#1049

Open
Sanchit2662 wants to merge 1 commit into
volcano-sh:mainfrom
Sanchit2662:fix/proxy-mid-stream-silent-error
Open

fix(router): return mid-stream and copy errors from proxyRequest#1049
Sanchit2662 wants to merge 1 commit into
volcano-sh:mainfrom
Sanchit2662:fix/proxy-mid-stream-silent-error

Conversation

@Sanchit2662
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

So I was going through the router code to understand how requests get dispatched, and I found something that felt off in proxyRequest in pkg/kthena-router/router/router.go.

When a streaming response from a backend pod breaks mid-way (like if the pod gets killed during a rolling update), the stream read error was being logged but then just... ignored. The function returned nil like nothing went wrong. Same thing for non-streaming responses - if io.Copy failed while copying the response body, it would log the error and return nil.

The problem with that is the retry loop in proxy() only tries the next pod if it gets a non-nil error back. So when a pod dies mid-response, we never retry to a healthy pod, we don't record any error in metrics, and we still call RunPostHooks which writes the dead pod into the prefix cache as if it handled the request successfully. That last part is pretty bad because future requests with similar prompts will keep getting scored toward a pod that's already gone.

The fix is pretty simple honestly. For streaming I added a streamErr variable that captures any non-EOF error inside the c.Stream closure and returns it after the stream finishes. For non-streaming I just changed return nil to return err in the io.Copy error path.

Now mid-stream failures and copy failures actually surface as errors, so the retry logic works as intended, metrics reflect what actually happened, and the prefixcache doesn't get poisoned with dead pod entries.

Does this PR introduce a user-facing change?:

Fixed a bug where broken streaming or non-streaming responses from backend pods were silently treated as successful requests. Mid-stream upstream failures now correctly surface as errors, enabling retries to healthy pods and accurate error metrics.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Copilot AI review requested due to automatic review settings May 14, 2026 11:05
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates error handling in the proxyRequest function to return errors for both streaming and non-streaming responses, ensuring failures are surfaced to the caller. Previously, errors during streaming or response copying were logged but often returned nil. The reviewer noted that returning these errors triggers a retry mechanism in the caller, which is problematic if data has already been sent to the client. Specifically, mid-stream retries can lead to corrupted responses and double-counting of token usage. It is recommended to only allow retries if no data has been written to the downstream writer.

}
return true
})
return streamErr
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning streamErr here correctly surfaces the error for metrics and cache management, but it also triggers the retry loop in the proxy function. For mid-stream failures, retrying is problematic for two reasons: 1) The client receives a corrupted response as the new attempt's data is appended to the partial data already sent. 2) Token usage recorded via onUsage (lines 858-860) will be double-counted if the retry succeeds. Retries should only be attempted if no data has been written to the client yet (i.e., !c.Writer.Written()).

if err != nil {
klog.Errorf("copy response to downstream failed: %v", err)
return nil
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning err here will cause the proxy function to retry the request. If io.Copy has already written data to the downstream writer, this retry will result in a corrupted response. While returning the error is necessary for accurate metrics and to prevent poisoning the prefix cache, the retry logic in the caller should be guarded by a check to see if the response has already started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes error propagation in the router’s proxyRequest() so that mid-stream upstream read failures and non-streaming response copy failures are no longer silently treated as success. This is intended to let the upstream retry loop behave correctly and avoid marking failed pods as successful in post-hooks/prefix-cache.

Changes:

  • Streaming: capture non-EOF stream read errors inside the c.Stream callback and return them after streaming ends.
  • Non-streaming: return the io.Copy error instead of logging and returning nil.
Comments suppressed due to low confidence (1)

pkg/kthena-router/router/router.go:889

  • Returning io.Copy errors here will also trigger proxy() retries, but io.Copy can fail after partially writing to the downstream client (or due to downstream disconnects). Retrying in those cases can send a corrupted/duplicated response and add unnecessary upstream load. Consider only retrying if no bytes were written (io.Copy’s returned byte count == 0 / c.Writer.Written()==false) and classifying downstream write errors as non-retryable.
		_, err := io.Copy(c.Writer, ttee)
		if err != nil {
			klog.Errorf("copy response to downstream failed: %v", err)
			return err
		}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 848 to 880
@@ -869,11 +870,13 @@ func proxyRequest(
if err != nil {
if err != io.EOF {
klog.Errorf("error reading stream body: %v", err)
streamErr = err
}
return false
}
return true
})
return streamErr
} else {
Comment on lines 850 to 876
@@ -869,11 +870,13 @@ func proxyRequest(
if err != nil {
if err != io.EOF {
klog.Errorf("error reading stream body: %v", err)
streamErr = err
}
return false
}
Comment on lines 879 to 889
@@ -882,7 +885,7 @@ func proxyRequest(
_, err := io.Copy(c.Writer, ttee)
if err != nil {
klog.Errorf("copy response to downstream failed: %v", err)
return nil
return err
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants