-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outbound: implement OutboundPolicy retries #2446
base: main
Are you sure you want to change the base?
Conversation
// these tests doesn't currently work as we would expect, because backend | ||
// request timeouts are not turned into synthesized 504s until a layer above the | ||
// retry layer. instead of being `Ok(Response)`s with a failure status, they're | ||
// `Err(Error)`s when the retry layer sees them, so they don't get retried. | ||
// TODO(eliza): figure that out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will require either moving the actual backend_request
timeout below the ClientRescue
layer, so that the timeout can be turned into a 504 (like I initially wanted to do in #2419 before I was talked out of it), or adding a special errors::respond
layer just for turning that timeout into a 504, which feels gross...
/// | ||
/// This is necessary because we want these timeouts to be retried, and the | ||
/// retry layer only retries requests that fail with an HTTP status code, rather | ||
/// than for requests that fail with Rust `Err`s. Errors returned by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing this to satisfy a deficiency of the retry layer, is there any reason that we can't move this mapping closer to that decision? I.e. why does this belong in the backend stack? Why does the backend stack care about the retry layer?
Furthermore, how does this interact with error metrics? How/where are request timeout errors recorded? Is it consistent with other error metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing this to satisfy a deficiency of the retry layer, is there any reason that we can't move this mapping closer to that decision? I.e. why does this belong in the backend stack? Why does the backend stack care about the retry layer?
This is a good point, IIRC, I don't think there's any particular reason for this to be in the backend stack rather than in the route stack right below the retry layer. I'll see about moving it.
Furthermore, how does this interact with error metrics? How/where are request timeout errors recorded? Is it consistent with other error metrics?
Regarding error metrics, I believe we don't currently track error metrics per-backend, only request count metrics. Therefore, the timeout will only be recorded by error metrics if it is not retried and the request fails. Because the design for HTTPRoute outbound policy metrics is still ongoing, I didn't implement any new metrics as part of this PR. When we do add HTTPRoute-specific error metrics, I would expect these timeouts to be recorded by any per-backend error counters but not any per-route error counters if the timeout is retried.
// Eagerly synthesize 504 responses for backend_request timeout | ||
// errors. | ||
// See the doc comment on `TimeoutRescue` for details. | ||
.push(TimeoutRescue::layer()) | ||
.push_on_service(http::BoxResponse::<_>::layer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated to this change but highlights something that may appear spooky to inner layers: the timeout layer will simply drop inner request futures, so inner layers (like the balancer) need to handle being dropped as a timeout of some sort. It would be better for us to move the timeout stack down into the endpoint (probably by fetching timeouts out of a request extension).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving timeouts into the endpoint stack using a request extension was the initial design I used for #2419 (see commit 64aca6a).
At the time, the conclusion in code review was that this approach was too complex. I'd be happy to bring that back, but I think it would make sense to do so in a separate PR?
|
||
pub fn layer<N>( | ||
metrics: metrics::HttpProfileRouteRetry, | ||
metrics: Option<metrics::HttpProfileRouteRetry>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that the existing retry metrics are a bit odd, but do we understand what visibility we are sacrificing by not porting them forward? What questions can we no longer answer for policy retries?
aa79e92
to
6975ae9
Compare
Depends on linkerd/linkerd2-proxy-api#256
This branch changes the outbound proxy to honor the retry configurations
added to the proxy API in linkerd/linkerd2-proxy-api#256. In particular,
this involves the following:
OutboundPolicies
client and converting them to internalrepresentations,
support both
ServiceProfiles
andOutboundPolicies
-based retryconfigurations,
retry middleware, in order to support the per-request retry limit
provided by the proxy API,
MatchedRoute
stack,In addition, in order to support the intended behavior for the
timeouts.backendRequest
timeout configured on HTTPRoutes, whererequests which hit this timeout can be retried if 504 status codes are
retryable, it was also necessary to add a small
errors::Respond
middleware to the
MatchedBackend
stack, above the backend-requesttimeout. This is because our retry policies currently only retry failure
classifications that are generated from real responses (i.e., failure
status codes), and do not retry any Rust
Err
s:linkerd2-proxy/linkerd/app/outbound/src/http/retry.rs
Lines 85 to 100 in 3224560
Since
Error
s generated by the backend request timeout are currentlyconverted to synthesized HTTP error responses by the
ServerRescue
layer, near the top of the outbound proxy stack, these timeouts are
still
Err
s when encountered by theretry
layer in theMatchedRoute
stack. In order to retry backend request timeouts, we add a very small
error response layer that just synthesizes HTTP 504s from the timeout
error returned by the backend request timeout middleware.
There are a couple of alternatives to this approach:
below the
ClientRescue
layer in the endpoint stack, like I initiallyproposed in outbound: implement
OutboundPolicies
backend request timeouts #2419. This would allow these errors to be converted tosynthesized HTTP responses long before they're seen by the retry
layer. However, the code for applying this timeout in the endpoint
stack, which involved storing the timeout in a request extension in
the
MatchedBackend
stack (where we actually know the configuredtimeout value), was considered too complex previously, so I didn't
want to go with this approach.
Err(Error)
values can beretried, in addition to
Ok(Response)
s which are considered failures.In order to allow the user to configure which failures are retryable
using the current configuration interface, which is status-code-based,
this would require making our
StatusRanges
response classifier awareof how
Error
s map to HTTP status codes...which is what theerrors::Respond
middleware already does when synthesizing errorresponses. So, it seemed nicer to not duplicate that information.
However, I'm open to alternative suggestions for this.