fix(router): return mid-stream and copy errors from proxyRequest by Sanchit2662 · Pull Request #1049 · volcano-sh/kthena

Sanchit2662 · 2026-05-14T11:05:53Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

So I was going through the router code to understand how requests get dispatched, and I found something that felt off in proxyRequest in pkg/kthena-router/router/router.go.

When a streaming response from a backend pod breaks mid-way (like if the pod gets killed during a rolling update), the stream read error was being logged but then just... ignored. The function returned nil like nothing went wrong. Same thing for non-streaming responses - if io.Copy failed while copying the response body, it would log the error and return nil.

The problem with that is the retry loop in proxy() only tries the next pod if it gets a non-nil error back. So when a pod dies mid-response, we never retry to a healthy pod, we don't record any error in metrics, and we still call RunPostHooks which writes the dead pod into the prefix cache as if it handled the request successfully. That last part is pretty bad because future requests with similar prompts will keep getting scored toward a pod that's already gone.

The fix is pretty simple honestly. For streaming I added a streamErr variable that captures any non-EOF error inside the c.Stream closure and returns it after the stream finishes. For non-streaming I just changed return nil to return err in the io.Copy error path.

Now mid-stream failures and copy failures actually surface as errors, so the retry logic works as intended, metrics reflect what actually happened, and the prefixcache doesn't get poisoned with dead pod entries.

Does this PR introduce a user-facing change?:

Fixed a bug where broken streaming or non-streaming responses from backend pods were silently treated as successful requests. Mid-stream upstream failures now correctly surface as errors, enabling retries to healthy pods and accurate error metrics.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>

volcano-sh-bot · 2026-05-14T11:05:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request updates error handling in the proxyRequest function to return errors for both streaming and non-streaming responses, ensuring failures are surfaced to the caller. Previously, errors during streaming or response copying were logged but often returned nil. The reviewer noted that returning these errors triggers a retry mechanism in the caller, which is problematic if data has already been sent to the client. Specifically, mid-stream retries can lead to corrupted responses and double-counting of token usage. It is recommended to only allow retries if no data has been written to the downstream writer.

gemini-code-assist · 2026-05-14T11:07:52Z

 			}
 			return true
 		})
+		return streamErr


Returning streamErr here correctly surfaces the error for metrics and cache management, but it also triggers the retry loop in the proxy function. For mid-stream failures, retrying is problematic for two reasons: 1) The client receives a corrupted response as the new attempt's data is appended to the partial data already sent. 2) Token usage recorded via onUsage (lines 858-860) will be double-counted if the retry succeeds. Retries should only be attempted if no data has been written to the client yet (i.e., !c.Writer.Written()).

gemini-code-assist · 2026-05-14T11:07:52Z

 		if err != nil {
 			klog.Errorf("copy response to downstream failed: %v", err)
-			return nil
+			return err


Returning err here will cause the proxy function to retry the request. If io.Copy has already written data to the downstream writer, this retry will result in a corrupted response. While returning the error is necessary for accurate metrics and to prevent poisoning the prefix cache, the retry logic in the caller should be guarded by a check to see if the response has already started.

Copilot

Pull request overview

Fixes error propagation in the router’s proxyRequest() so that mid-stream upstream read failures and non-streaming response copy failures are no longer silently treated as success. This is intended to let the upstream retry loop behave correctly and avoid marking failed pods as successful in post-hooks/prefix-cache.

Changes:

Streaming: capture non-EOF stream read errors inside the c.Stream callback and return them after streaming ends.
Non-streaming: return the io.Copy error instead of logging and returning nil.

Comments suppressed due to low confidence (1)

pkg/kthena-router/router/router.go:889

Returning io.Copy errors here will also trigger proxy() retries, but io.Copy can fail after partially writing to the downstream client (or due to downstream disconnects). Retrying in those cases can send a corrupted/duplicated response and add unnecessary upstream load. Consider only retrying if no bytes were written (io.Copy’s returned byte count == 0 / c.Writer.Written()==false) and classifying downstream write errors as non-retryable.

		_, err := io.Copy(c.Writer, ttee)
		if err != nil {
			klog.Errorf("copy response to downstream failed: %v", err)
			return err
		}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -869,11 +870,13 @@ func proxyRequest(
 			if err != nil {
 				if err != io.EOF {
 					klog.Errorf("error reading stream body: %v", err)
+					streamErr = err
 				}
 				return false
 			}
 			return true
 		})
+		return streamErr
 	} else {


@@ -869,11 +870,13 @@ func proxyRequest(
 			if err != nil {
 				if err != io.EOF {
 					klog.Errorf("error reading stream body: %v", err)
+					streamErr = err
 				}
 				return false
 			}


@@ -882,7 +885,7 @@ func proxyRequest(
 		_, err := io.Copy(c.Writer, ttee)
 		if err != nil {
 			klog.Errorf("copy response to downstream failed: %v", err)
-			return nil
+			return err
 		}


fix(router): return mid-stream and copy errors from proxyRequest

afe783f

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>

Copilot AI review requested due to automatic review settings May 14, 2026 11:05

volcano-sh-bot added the kind/bug label May 14, 2026

volcano-sh-bot requested review from YaoZengzeng and git-malu May 14, 2026 11:05

volcano-sh-bot added the size/XS label May 14, 2026

Copilot started reviewing on behalf of Sanchit2662 May 14, 2026 11:06 View session

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): return mid-stream and copy errors from proxyRequest#1049

fix(router): return mid-stream and copy errors from proxyRequest#1049
Sanchit2662 wants to merge 1 commit into
volcano-sh:mainfrom
Sanchit2662:fix/proxy-mid-stream-silent-error

Sanchit2662 commented May 14, 2026

Uh oh!

volcano-sh-bot commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sanchit2662 commented May 14, 2026

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?:

Uh oh!

volcano-sh-bot commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants