diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..24a8e87 --- /dev/null +++ b/.gitattributes @@ -0,0 +1 @@ +*.png filter=lfs diff=lfs merge=lfs -text diff --git a/slime/.docs/cdf.png b/slime/.docs/cdf.png new file mode 100644 index 0000000..4984402 --- /dev/null +++ b/slime/.docs/cdf.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b5af00f613aad2a4660209dd5c8bdc902fc500d862988b4a42bf52fe84505d70 +size 25330 diff --git a/slime/.docs/hedging.png b/slime/.docs/hedging.png new file mode 100644 index 0000000..122d70b --- /dev/null +++ b/slime/.docs/hedging.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2079d71be18c793612bedf7bbbab063eeb18839d875dc5fb6c32acd5804f0b68 +size 309759 diff --git a/slime/.docs/loadbalance.png b/slime/.docs/loadbalance.png new file mode 100644 index 0000000..d5cc822 --- /dev/null +++ b/slime/.docs/loadbalance.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:552009dc9371b806bfd95ed2e0c4b2fc03b7b79eeae1928d5cc7ac0d26027e39 +size 82516 diff --git a/slime/.docs/logs.png b/slime/.docs/logs.png new file mode 100644 index 0000000..6305069 --- /dev/null +++ b/slime/.docs/logs.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fd0f2242b4438cd12328d8b14c5adc97b643226bad884b6c261730d280924a9b +size 225763 diff --git a/slime/.docs/retry.png b/slime/.docs/retry.png new file mode 100644 index 0000000..a70277b --- /dev/null +++ b/slime/.docs/retry.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ff62d92a0e309bd2e8f01c45a9da8ee7ff232dffd3c3aeb1f3259743d20fbe60 +size 248002 diff --git a/slime/README.md b/slime/README.md index 493262b..00548b2 100644 --- a/slime/README.md +++ b/slime/README.md @@ -1,10 +1,406 @@ +English | [中文](./README_zh_CN.md) -# Slime +## Supported Protocols +**DO NOT enable retry/hedging for non-idempotent requests**. +Not all protocols can use retry/hedging. +Yaml config in Chapter Slime may not be suitable for non tRPC protocols. In this case, you could use the +basic package in sesion "Introduce to Basic Retry/Hedging Packages" directly. -Slime is the implementation fo tRPC-Go retry/hedging policy. +| protocols | retry | hedging | note | +|:---:|:---:|:---:|:---| +|tRPC|✓|✓| Native tRPC protocol. | +|trpc SendOnly|✗|✗| Does not support, retry/hedging depends on the return error, however, SendOnly request has no response. | +|trpc Stream|✗|✗| Does not support. | +|[http](https://github.com/trpc-group/trpc-go/tree/main/http)|✓|✓|| +|[Kafaka](https://github.com/trpc-ecosystem/go-database/tree/main/kafka)|✓|✗| Hedging is not supported. | +|[MySQL](https://github.com/trpc-ecosystem/go-database/tree/main/mysql)|★|★| support all method except [Query](https://github.com/trpc-ecosystem/go-database/blob/6f75e87fecfc5411e54d93fd1aad5e7afa9a0fcf/mysql/client.go#L40) and [Transaction](https://github.com/trpc-ecosystem/go-database/blob/6f75e87fecfc5411e54d93fd1aad5e7afa9a0fcf/mysql/client.go#L42). These two methods use lambda as parameters, and slime cannot guarantee concurrency safety. You can use `slime.WithDisabled` to disable retry/hedging. | -A request in application layer may spans multiple sub-requests after retry/hedging interceptor. -At last, The first success response or last failed response will be delivered to application. -Because the procedure look like a slime, splitting and fusing, we call it Slime. -See [iwiki](https://trpc.group/trpc-go/trpc-wiki/blob/main/user_guide/retry_hedging.md) for how to use Slime. +## Background +Retry is a very simple idea. When the original request fails, a retry request is initiated. In a narrow sense, retry is a conservative strategy, and a new request will only be triggered when the last request fails. Users who prefer less response time may wish to use a more aggressive strategy, **hedging**. Jeffrey Dean first mentioned hedging in [the tail at scale](https://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/pdf) to solve the impact of long-tail requests when the number of fan-outs is large. + +Simply put, the hedging strategy is not passively waiting for the last request to time out or fail. A new request will be triggered if a successful replay packet is not received within the hedging delay time(less than the request timeout). Unlike the retry strategy, there may be multiple in-flight requests at the same time. The first successful request will be handed over to the application layer, and the responses of other requests will be ignored. + +Note that these two strategies are mutually exclusive, and users can only choose one of them. + +The implementation of the retry strategy is relatively simple. There are also some implementations of the hedging +strategy in industry: +* [gRPC](https://github.com/grpc/grpc-java): [A6-client-retries.md](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) +gives a very detailed introduction of gRPC design. gRPC-java has implemented it. +* [bRPC](https://github.com/apache/incubator-brpc): In bRPC, a hedging request is called a backup request. This [doc](https://github.com/apache/incubator-brpc/blob/master/docs/cn/backup_request.md) gives a brief introduction, and its c++ implementation is relative simple. +* [finagle](https://github.com/twitter/finagle): finagle is a java RPC open source framework, it also implements [backup request](https://twitter.github.io/finagle/guide/MethodBuilder.html#backup-requests). +* [pegasus](https://github.com/apache/incubator-pegasus): Pegasus is a kv database that supports simultaneous reading data from multiple copies. [Backup request] is used to improve performance. +* [envoy](https://www.envoyproxy.io/docs/envoy/latest/): Envoy, as a proxy service, is widely used in cloud native. It also supports [request hedging](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_routing#request-hedging). + +This article will introduce retry/hedging of the tRPC framework. In next section, we briefly introduced the rationale for retry hedging. The next two sectinos describe more implementation details. We introduced the basic package of retry/hedging, and the slime is a manager based on these basic capabilities, which provides you with yaml-based configuration. Finally, we list some problems you may encounter. + +## Principles +In this chapter, we will show the basic principles of hedging and retrying through examples, and briefly introduce some other capabilities that you may need to pay attention to. + +### Retry Strategy +As the name suggests, retry on error. + +![ 'image.png'](./.docs/retry.png) + +In the figure above, the client has tried three times: orange, blue, and green. The first two failed, and a random backoff was made before each attempt to prevent request glitches. Eventually the third attempt succeeded and returned to the application layer. For each attempt, we try to send it to a different node. + +Generally, a retry policy requires the following configurations: +- Maximum number of retries: Once exhausted, the last error is returned. +- Backoff time: The actual backoff time is random(0, delay). +- Retryable error code: If the returned error is not retryable, stop retrying immediately and return the error to the application layer. + +### Hedging Strategy +As we introduced in the background, hedging can be seen as a more aggressive and complex type of retry. + +![ 'image.png'](./.docs/hedging.png) + +In the picture above, the client has tried 4 times in total: orange, blue, green, and purple. +- Orange is the first attempt. After being initiated by the client, server2 received it soon. However, due to network and other problems of server2, its correct response packet was not late until the green request was successful and returned to the application layer. Although it was successful, we have to discard it because we have already sent another success response back to the application layer. +- Blue is the second attempt. We initiated a new attempt since the orange request has not been returned after the hedging delay. This time, server1 was chosen for this attempt (we will try to choose a different node for each attempt as much as possible). The response of the blue attempt is faster and returns before the hedging delay. But it failed. We **immediately** start a new attempt. +Green is the third attempt. Although its response may be a bit slow (beyond the hedging delay, thus triggering a new attempt), but it worked! As soon as we receive the first successful response, we immediately return it to the application layer. +Purple is the fourth attempt. As soon as it was initiated, we received a success response of green. For purple request, it may be in many states: the request is still in the client tRPC, at this time, we have the opportunity to cancel it; the request has entered the client's kernel or has been sent by the network card, in any case, we have no chance to cancel it. on a purple request indicates that we will suppress purple requests whenever possible. Note that even though the purple request eventually makes it to server2 successfully, its response is dropped like orange. + +As you can see, hedging is more like a **concurrent** retry with a **waiting time**. Hedging has no back-off mechanism, once it receives an error response, it will immediately initiate a new attempt. In general, we recommend using hedging strategies only when you need to address the long tail problem. For ordinary error retry, please use a simpler and clearer retry mechanism. + +Generally, hedging will have the following configuration: +- Maximum number of retries: Once exhausted, wait for and return the last response, regardless of whether it succeeded or failed. +- Hedging delay: If no response is received within the hedging delay, a new attempt will be initiated immediately. +- Non-fatal errors: Returning a fatal error will abort hedging immediately, waiting for and returning the last response, regardless of whether it succeeded or failed. Returning a non-fatal error will immediately trigger a new attempt (the hedging delay timer will be reset). + +### the Order of Interceptor +In tRPC-Go, the hedging/retrying is implemented in interceptors. + +When a request passes through the retry/hedging interceptor, it may generate multiple sub-requests, and each sub-request executes subsequent interceptors. +For monitoring interceptors, you must pay attention to their relative position to retry/hedging interceptors. If they are located before the retry/hedging, they will only be counted once for each request of the application layer; if they are located after the retry/hedging, then they will be counted for each retrying hedging request. + +When you use a retry/hedging interceptor, be sure to give some thought to how it relates to other interceptors. + +### Server Pushback +Server pushback is used by the server to explicitly control the client's retry/hedging strategy. +When the load on the server is relatively high, and you want the client to reduce the retry/hedging frequency, you can specify a delay time T in the return packet, and the client will delay the next retry/hedging sub-request for T time. This function is more commonly used by the server to instruct the client to stop retrying/hedging, by setting delay to `-1`. + +In general, you shouldn't care whether you need to set server pushback or not. In subsequent planning, the framework will automatically determine how to set server pushback according to the current load of the service. + +### Load Balance +Because retry/hedging is implemented as an interceptor, and load balancing occurs after the interceptor, each +sub-request will trigger a load balancing. + +!['image.png'](./.docs/loadbalance.png) + +For hedging requests, you may want each sub-request to be sent to a different node. We implemented a mechanism that allows multiple sub-requests to communicate to get nodes that other sub-requests have already visited. A load balancer can take advantage of this mechanism and only return unvisited nodes. Of course, this requires the cooperation of the load balancer. Currently, there are only two built-in random load balancing strategies in the framework to support it. +Don't get discouraged if you're using a load balancer that doesn't support skipping already visited nodes. Under normal circumstances, the round-robin or random load balancer itself realizes that sub-requests are sent to different nodes in a sense, even if they are sent to the same node occasionally, there will be no major problems. For a special hash-type load balancer (routing to a specific node according to a specific key, rather than a class of nodes), it may not support this function at all. In fact, using hedging on this type of load balancer Strategies are pointless. + +## Introduce to Basic Retry/Hedging Packages +This chapter only briefly introduces the basic package of retry/hedging as the basis of next section. Although we provide some usage examples, please try to avoid using them directly at the application layer. You should use slime to enable retry/hedging. + +### [retry](./retry) +[retry](./retry) provides the basic retry strategy. + +`New` creates a new retry strategy, you must specify the maximum number of retries and retryable error codes. You can also customize the retryable error through `WithRetryableErr`, which has an OR relationship with the retryable error codes. + +Retry provides two default backoff strategies: `WithExpBackoff` and `WithLinearBackoff`. You can also customize the backoff strategy via `WithBackoff`. At least one of these three backoff strategies needs to +be provided. If you provide more than one, their priority is: +`WithBackoff` > `WithExpBackoff` > `WithLinearBackoff` + +You may be wondering why `WithSkipVisitedNodes(skip bool)` has an extra `skip` boolean variable? In fact, we distinguish three situations here: +1. The user does not explicitly specify whether to skip already visited nodes; +2. The user explicitly specifies to skip already visited nodes; +3. The user explicitly specifies not to skip already visited nodes. + +These three states have different impacts on load balancing. +For the first case, the load balancer should return unvisited nodes as much as possible. We allow it to return a visited node if all nodes have been visited. This is the default policy. +For the second case, the load balancer must return unvisited nodes. If all nodes have already been visited, it should return no nodes available error. +For the third case, the load balancer can return to any node at will. + +As described in previous section, `WithSkipVisitedNodes` requires the cooperation of load balancing. If the load balancer does not implement this function, no matter whether the user invokes this option or not, it finally corresponds to the third situation. + +`WithThrottle` can specify a throttle for this strategy. + +You can specify a retry policy for an RPC request in the following ways: +```Go +r, _ := retry.New(4, []int{errs.RetClientNetErr}, retry.WithLinearBackoff(time.Millisecond*5)) +rsp, _ := clientProxy.Hello(ctx, req, client.WithFilter(r.Invoke)) +``` + +### [Hedging](./hedging) +[Hedging](./hedging) provides the basic hedging strategy. + +`New` creates a new hedging strategy. You must specify the maximum number of retries and non-fatal error codes. You can also customize non-fatal errors through `WithNonFatalError`, which has an OR relationship with non-fatal error codes. + +The hedging package provides two ways to set the hedging delay. `WithStaticHedgingDelay` sets a static delay. +`WithDynamicHedgingDelay` allows you to register a function that returns a time each time it is called as the hedging delay. These two methods are mutually exclusive, and the latter overrides the former when specified multiple times. + +`WithSkipVisitedNodes` behaves the same as retry, please refer to the previous section. + +`WithThrottle` can specify a throttle for the hedging strategy. + +You can specify a hedging strategy for an RPC request in the following ways: +```Go +h, _ := hedging.New(2, []int{errs.RetClientNetErr}, hedging.WithStaticHedgingDelay(time.Millisecond*5)) +rsp, _ := clientProxy.Hello(ctx, req, client.WithFilter(h.Invoke)) +``` + +### [Throttle](./throttle) +[Throttle](./throttle) is a used to avoid retry/hedging write amplification. + +The `throttler` interface provides three methods: +```Go +type throttler interface { + Allow() bool + OnSuccess() + OnFailure() +} +``` +Every time a retry/hedging sub-request is sent (excluding the first request), `Allow` will be called. If `false` is returned, all subsequent sub-requests requested by the application layer will not be executed again, which is regarded as "maximum The number of hedges has been exhausted". + +Whenever a retry/hedge sub-request response is received, `OnSuccess` or `OnFailure` will be called as appropriate. + +Hedging/Retrying will generate write amplification, while rate limiting is to avoid service avalanche caused by retrying/hedging. When you initialize a `throt` like below, and bind it to a `Hello` RPC, +```Go +throt, _ := throttle.NewTokenBucket(10, 0.1) +r, _ := retry.New(3, []int{errs.RetClientNetErr}, retry.WithLinearBackoff(time.Millisecond*5)) +tr := r.NewThrottledRetry(throt) +rsp, _ := clientProxy.Hello(ctx, req, client.WithFilter(tr.Invoke)) +``` +the total number of `Hello` requests due to retry/hedging will not exceed 110% of the number of application layers (each successful request will add 0.1 to the token, and each failed request will reduce the token by 1, which is equivalent to 10 Only one successful request can be exchanged for a retry/hedging opportunity), and the number of retry/hedging requests (continuous failure) will not be greater than 5 (5 = 10 / 2, only when the number of tokens is greater than half, `Allow ` will return `true`). + +### about Timeout Error +In tRPC-Go, [`RetClientTimeout`](https://github.com/trpc-group/trpc-go/blob/71941c0f7e32cec11d48c0d6b5c28122788f57e8/errs/errs.go#L47), namely 101 error, +corresponds to the application layer time out. Retry/hedging follows this mechanism and returns an error as soon as `ctx` times out. Therefore, it doesn't make sense to use 101 as a retry/hedge error code. In this case, we recommend that you use the hedging function and configure a reasonable hedging delay (the hedging delay is your expected timeout). Note that the hedging delay should be less than the application layer timeout. + +## Slime + +Slime provides file configuration functions on top of the two basic packages retry and hedging. With slime, you can manage the retry/hedging strategy in the framework configuration. Like other tRPC-Go plugins, first import the slime package anonymously: +```go +import _ "trpc.group/trpc-go/trpc-filter/slime" +``` +Then, config the following yaml: +```yaml +--- # retry/hedging strategy +retry1: &retry1 # this is a yaml reference syntax that allows different services to use the same retry strategy + # use a random name if omitted. + # if you need to customize backoff or retryable business errors, you must explicitly provide a name, which will be + # used as the first parameter of the slime.SetXXX method. + name: retry1 + # default as 2 when omitted. + # no more than 5, truncate to 5 if exceeded. + max_attempts: 4 + backoff: # must provide one of exponential or linear + exponential: + initial: 10ms + maximum: 1s + multiplier: 2 + # when omitted, the following four framework errors are retried by default: + # 21: RetServerTimeout + # 111: RetClientConnectFail + # 131: RetClientRouteErr + # 141: RetClientNetErr + # for tRPC-Go framework error codes, please refer to: https://github.com/trpc-group/trpc-go/tree/main/errs + retryable_error_codes: [ 141 ] + +retry2: &retry2 + name: retry2 + max_attempts: 4 + backoff: + linear: [100ms, 500ms] + retryable_error_codes: [ 141 ] + skip_visited_nodes: false # omit, false and true correspond to three different cases + +hedging1: &hedging1 + # use a random name if omitted. + # if you need to customize hedging_delay or non-fatal errors, you must explicitly provide a name, which will be used + # as the first parameter of the slime.SetHedgingXXX method. + name: hedging1 + # default as 2 when omitted. + # no more than 5, truncate to 5 if exceeded. + max_attempts: 4 + hedging_delay: 0.5s + # when omitted, the following four errors default to non-fatal errors: + # 21: RetServerTimeout + # 111: RetClientConnectFail + # 131: RetClientRouteErr + # 141: RetClientNetErr + non_fatal_error_codes: [ 141 ] + +hedging2: &hedging2 + name: hedging2 + max_attempts: 4 + hedging_delay: 1s + non_fatal_error_codes: [ 141 ] + skip_visited_nodes: true # omit, false and true correspond to three different cases. + +--- # client config +client: &client + filter: [slime] # filter must cooperate with plugin, both are indispensable + service: + - name: trpc.app.server.Welcome + retry_hedging_throttle: # all retry/hedging strategies under this service will be bound to this rate limit + max_tokens: 100 + token_ratio: 0.5 + retry_hedging: # service uses policy retry1 by default + retry: *retry1 # dereference retry1 + methods: + - callee: Hello # use retry policy retry2 instead of retry1 of parent service + retry_hedging: + retry: *retry2 + - callee: Hi # use hedging policy hedging1 instead of retry1 of parent service + retry_hedging: + hedging: *hedging1 + - callee: Greet # empty retry_hedging means no retry/hedging policy + retry_hedging: {} + - callee: Yo # retry_hedging is missing, use retry1 of parent service by default + - name: trpc.app.server.Greeting + retry_hedging_throttle: {} # forcibly turn of rate limit + retry_hedging: # service uses hedging2 by default + hedging: *hedging2 + - name: trpc.app.server.Bye + # missing rate limit, use the default one. + # there's no retry/hedging policy at service level. + methods: + - callee: SeeYou # SeeYou use retry1 as its own retry policy + retry_hedging: + retry: *retry1 + +plugins: + slime: + # we reference the entire client here. Of course, you can configure client.service separately under default. + default: *client +``` + +> The above configuration file uses an important feature in yaml, namely [reference](https://en.wikipedia.org/wiki/YAML#Advanced_components). +For duplicate nodes, you can reuse them by reference. + +### Retry/Hedging Policy as an [Entity](https://en.wikipedia.org/wiki/Domain-driven_design#Building_blocks) + +In the configuration above, we defined four retry/hedging policies and referenced them in `client`. Each strategy, in addition to the required parameters, has a new field `name`, which is used as a **unique** identifier for the entity. +In the previous chapters, we mentioned some options, such as `WithDynamicHedgingDelay`, they cannot be configured in the file and need to be used in the code, where `name` is the key to using these options in the code. In slime, we provide the following functions to set additional options. +```Go +func SetHedgingDynamicDelay(name string, dynamicDelay func() time.Duration) error +func SetHedgingNonFatalError(name string, nonFatalErr func(error) bool) +func SetRetryBackoff(name string, backoff func(attempt int) time.Duration) error +func SetRetryRetryableErr(name string, retryableErr func(error) bool) error +``` + +Note that for the `backoff` of the retry strategy, you can only choose between `exponential` and `linear`. If you provide both, we'll take `exponential` whichever. + +### Unification with Framework Config + +In the plugin configuration `plugins`, the plugin type must be `slime` and the plugin name must be `default`. slime will load all retry/hedging strategies into a plugin, namely default, according to the configuration file. default provides an interceptor, which automatically takes effect for +all services or methods configured with retry/hedging. + +As you may have noticed, the `client` key is similar to the client framework configuration, except that it has some new keys, such as `retry_hedging`, `methods`, etc. We deliberately designed it this way, in order to be able to reuse the original framework configuration. If you plan to introduce slime into the existing client, then you only need to add some key values under the `client` key of the framework configuration. + +Hedging is a more aggressive retry strategy. When configuring retry/hedging strategies, you can only choose one of them: +```yaml +retry_hedging: + retry: *retry1 + # hedging: *hedging1 # do not config hedging if you have chosen retry +``` +If you config both retry and hedging, then we will use hedging instead of retry. +If you config `retry_hedging: {}`, then the strategy is equivalent to disable retry/hedging. Note that this is different from `retry_hedging:`, the former is configured with the key `retry_hedging`, but its content is empty, the latter is equivalent to no key `retry_hedging`. + +You can specify a retry/hedging strategy for the entire service, just add the `retry_hedging` key under `service`, or you can refine it to a specific method by adding `callee` in `method`. + +In the configuration file, the service `trpc.app.server.Welcome` uses `retry1` as the retry strategy. +`Hello` overrides service retry strategy `retry1` with retry strategy `retry2`. +`Hi` overrides service retry strategy `retry1` with hedging strategy `hedging1`. +`Greeter` overrides service policy `retry1` with **null policy**. +`Yo` inherits the service policy `retry1`. +Other methods that are not explicitly configured inherit the service policy `retry1` by default. +All methods of the service `trpc.app.server.Greeting` use the hedging strategy `hedging2`. + +### Throttle +In slime, rate limit is based on service. +By default, slime enables the rate limit for each service, configured as `max_tokens: 10` and `token_ratio: 0.1`. +You can also customize `max_tokens` and `token_ratio` like service `trpc.app.server.Welcome`. +If you want to turn off throttle, you need to configure it like this: `retry_hedging_throttle: {}`. + +### Interceptor +When the slime plugin is initialized, it will automatically register the slime interceptor. +The interceptor `slime` must be added to `filter` to enable slime plugin. +```yaml +client: + filter: [slime] + service: + - # you can also add interceptor inside the service + #filter: [slime] +``` +Slime will generate multiple sub-requests, please pay attention to its order with other interceptors. + +### Skip the Visited Nodes +As we described in previous section, you can also specify in the configuration whether to skip nodes that have already sent a request. +`retry1` and `hedging1` are not configured with `skip_visited_nodes`, they correspond to the first case. `retry2` explicitly specifies `skip_visited_nodes` to be `false`, which corresponds to the third case. `hedging2` explicitly specifies `skip_visited_nodes` to be `true`, which corresponds to the second case. + +Note that this feature requires the cooperation of a load balancer. If the load balancer does not implement that, then it will correspond to the third case. + +### Disable Retry/Hedging for a Single Request +Slime support to disable retry/hedge for a single request by creating a new context. +This function usually cooperates with trpc-database to make the retry/hedging configuration take effect only for read requests (or idempotent requests), while skipping write requests. For example, for trpc-database/mysql: +```go +c := mysql.NewClientProxy(/* omitted args */) +err := c.QueryRow(trpc.BackgroundContext(), /* omitted args */) // retry/hedging is default enabled by config +_, err = c.Exec(slime.WithDisabled(trpc.BackgroundContext()), /* omitted args */) // disable retry/hedging for this request by ctx +``` +Note that this function is only available for slime, and slime/retry or slime/hedging do not provide this function. + +## Visualization +Slime provides two visualization capabilities, one is conditional log and the other is metrics. +### Conditional Log +Whether hedging or retrying, they have an option called `WithConditionalLog`. +[this](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/retry/retry.go#L237) is for retry, and +[this](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/hedging/hedging.go#L188) is for hedging, these two +([retry](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/opts.go#L247), +[hedging](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/opts.go#L96)) are for slime. + +Conditional logging requires two parameters, one is `log.Logger`: +```go +type Logger interface { + Println(string) +} +``` +another is `func(stat view.Stat) bool`. + +`view.Stat` in the condition function provides a status of the application layer request execution process. You can decide whether to output retry/hedge logs based on these data. For example, the following conditional function tells slime to only output the log when the first two retries fail and the third one succeeds after a total of three retries: +```go +var condition = func(stat view.Stat) bool { + attempts := stat.Attempts() + return len(attempts) == 3 && + attempts[0].Inflight() && + attempts[1].Inflight() && + !attempts[2].Inflight() && + attempts[2].Error() == nil +} +``` + +`Logger` only needs a simple `Println(string)` method. You can wrap one based on any log library. For example, the following is a console-based log: +```go +type ConsoleLog struct{} + +func (l *ConsoleLog) Println(s string) { + log.Println(s) +} +``` +Here is a slime log on the console: +!['image.png'](./.docs/logs.png) +There are a few points you need to pay attention to: +* All slime logs requested by an application layer correspond to one `Println` in `log.Logger`, which is called lazy log in slime, as shown in the first line in the screenshot. +* Slime's logs are formatted with newlines, tabs, etc. +* The last slime log is a summary of all attempts. + +### Metrics +Similar to conditional logs, retry/hedge monitoring is also based on +[`view.Stat`](./view/stat.go). + +slime provides four metrics: application layer request number, actual request number, application layer time +consumption, and actual time consumption. +All monitoring items have three tags: caller, callee, method. +For the number of application layer requests and time consumption, they have the following additional tags: total number of attempts, error code of the final error, whether it is throttled, the number of outstanding requests (only if hedging can be non-zero), whether the backend prohibits retry/hedging. +For the actual number of requests and the actual time-consuming, they have the following additional tags: error code, whether it is not completed, whether the backend explicitly prohibits retry/hedging. + +#### Prometheus +Slime supports reporting metrics to Prometheus. Import dependencies: +```go +import prom "trpc.group/trpc-go/trpc-filter/slime/view/metrics/prometheus" +``` +Use `prom.NewEmitter` to initialize an Emitter. +How to use prometheus can refer to [Official Documentation](https://prometheus.io/docs/guides/go-application/). diff --git a/slime/README_zh_CN.md b/slime/README_zh_CN.md new file mode 100644 index 0000000..1a89e92 --- /dev/null +++ b/slime/README_zh_CN.md @@ -0,0 +1,388 @@ +[English](./README.md) | 中文 + +## 支持的协议 +**请不要对非幂等请求开启重试/对冲功能**。 +并非所有协议都能使用重试/对冲。 +对于非 trpc 协议,可能并不适用 Slime 一章的 yaml 配置,这时,你可以直接使用基础包。 + +| 协议 | 重试 | 对冲 | 备注 | +|:-:|:-:|:-:|:-| +|trpc|✓|✓| 原生的 trpc 协议。 | +|trpc SendOnly|✗|✗| 不支持,重试/对冲根据返回的错误码进行判断,而 SendOnly 请求不会回包。 | +|trpc 流式|✗|✗| 暂不支持。 | +|[http](https://github.com/trpc-group/trpc-go/tree/main/http)|✓|✓|| +|[Kafaka](https://github.com/trpc-ecosystem/go-database/tree/main/kafka)|✓|✗| 不支持对冲功能。 | +|[MySQL](https://github.com/trpc-ecosystem/go-database/tree/main/mysql)|★|★| 除 [Query](https://github.com/trpc-ecosystem/go-database/blob/6f75e87fecfc5411e54d93fd1aad5e7afa9a0fcf/mysql/client.go#L40) 和 [Transaction](https://github.com/trpc-ecosystem/go-database/blob/6f75e87fecfc5411e54d93fd1aad5e7afa9a0fcf/mysql/client.go#L42) 两个方法外,其他都支持。这两个方法以函数闭包作为参数,slime 无法保证数据的并发安全性,可以使用 `slime.WithDisabled` 关闭重试/对冲。 | + +## 前言 +重试是一个很朴素的想法,当原请求失败时,发起重试请求。狭义的重试是一个比较保守的策略,只有当上次请求失败后,才会触发新的请求。对响应时间有要求的用户可能希望使用一种更加激进的策略,**对冲策略**。Jeffrey Dean 在 [the tail at scale](https://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/pdf) 中首次提到了策略,以解决扇出数很大时,长尾请求对整个请求时延的影响。 + +简单地讲,对冲策略并不是被动地等待上一次请求超时或失败。在对冲延迟时间(小于超时时间)内,如果未收到成功的回包,就会再触发一个新的请求。与重试策略不同的是,同一时间可能有多个 in-flight 请求。第一个成功的请求会被交给应用层,其他请求的回包会被忽略。 + +注意,这两种策略具有互斥性,用户只能二选一。 + +重试策略实现比较简单。对冲策略业界也有了一些实现: +* [gRPC](https://github.com/grpc/grpc-java):[A6-client-retries.md](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) 详细介绍了 gRPC 的设计方案。gRPC-java 已经实现了该方案。 +* [bRPC](https://github.com/apache/incubator-brpc):在 bRPC 中,hedging request 被称为 backup request。这个[文档](https://github.com/apache/incubator-brpc/blob/master/docs/cn/backup_request.md)作了粗略的介绍,其 c++ 实现也比较简单。 +* [finagle](https://github.com/twitter/finagle):finagle 是一个 java 的 RPC 开源框架,它也实现了 [backup request](https://twitter.github.io/finagle/guide/MethodBuilder.html#backup-requests)。 +* [pegasus](https://github.com/apache/incubator-pegasus):pegasus 是一个 kv 型数据库,它通过 [backup request](https://github.com/apache/incubator-pegasus/issues/251) 来支持从多副本同时读取数据以提高性能。 +* [envoy](https://www.envoyproxy.io/docs/envoy/latest/):envoy 作为一个代理服务,在云原生中有广泛应用。它也支持了 [request hedging](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_routing#request-hedging)。 + +本文将介绍 tRPC 框架的重试和对冲能力。在下一章,我们简要介绍了重试对冲的基本原理。随后两章介绍了更多的实现细节。我们介绍了重试/对冲的基础包,然后是对基础包进行管理的 slime,它可以为你提供基于 yaml 的配置能力。最后,我们列举了一些你可能会有疑问的点。 + +## 原理 +在本章中,我们将通过两张图展示对冲和重试的基本原理,并简要介绍一些其他你可能需要关注的能力。 + +### 重试策略 +顾名思义,对错误的回包进行重试。 + +!['image.png'](./.docs/retry.png) + +上图中,client 一共尝试了三次:橙、蓝、绿。前两次都失败了,并且在每一次尝试前都会随机退避一段时间,以防止请求毛刺。最终第三次尝试成功了,并返回给了应用层。另外,也可以看到,对于每次尝试,我们都会尽可能地将请求发往不同的节点。 + +一般,重试策略需要有以下配置: +- 最大重试次数:一旦耗尽,便返回最后一个错误。 +- 退避时间:实际的退避时间取的是 random(0, delay)。 +- 可重试错误码:如果返回的错误是不可重试的,就立刻停止重试,并将错误返回给应用层。 + +### 对冲策略 +正如我们在前言中介绍的,对冲可以看作是一种更加激进的重试,它比重试更复杂。 + +!['image.png'](./.docs/hedging.png) + +上图中,client 一共尝试了 4 次:橙、蓝、绿、紫。 +橙色是第一次尝试。在由 client 发起后,server2 很快便收到了。但是 server2 的因为网络等问题,直到绿色请求成功,并返回给应用层后,它的正确回包才姗姗来迟。尽管它成功了,但我们必须丢弃它,因为我们已经将另一个成功的回包返回给应用层了。 +蓝色是第二次尝试。因为橙色请求在对冲时延(hedging delay)后还没有回包,因此我们发起了一次新的尝试。这次尝试选择了 server1(我们会尽可能地为每次尝试选择不同的节点)。蓝色尝试的回包比较快,在对冲时延之前便返回了。但是却失败了。我们**立刻**发起了新一次尝试。 +绿色是第三次尝试。尽管它的回包可能有点慢(超过了对冲时延,因此又触发了一次新的尝试),但是它成功了!一旦我们收到第一个成功的回包,便立刻将它返回给了应用层。 +紫色是第四次尝试。刚发起后,我们便收到了绿色成功的回包。对紫色来说,它可能处于很多状态:请求还在 client tRPC 内,这时,我们有机会取消它;请求已经进入了 client 的内核或者已经由网卡发出,无论如何,我们已经没有机会取消它了。紫色请求上的 表示我们会尽可能地取消紫色请求。注意,即使紫色请求最终成功地到达了 server2,它的回包也会像橙色一样被丢弃。 + +可以看到,对冲更像是一种添加了**等待时间**的**并发**重试。需要注意的是,对冲没有退避机制,一旦它收到一个错误回包,就会立刻发起新的尝试。通常,我们建议,只有当你需要解决请求的长尾问题时,才使用对冲策略。普通的错误重试请使用更加简单明了的重试机制。 + +一般,对冲会有以下配置: +- 最大重试次数:一旦耗尽,便等待并返回最后一个回包,无论它是否成功或失败。 +- 对冲时延:在对对冲时延内没有收到回包时便会立刻发起新的尝试。 +- 非致命错误:返回致命错误会立刻中止对冲,等待并返回最后一个回包,无论它是否成功或失败。返回非致命错误会立刻触发一次新的尝试(对冲时延计时器会被重置)。 + +### 拦截器次序 +在 tRPC-Go 中,对冲/重试功能是在拦截器中实现的。 + +一个应用层请求在经过重试/对冲拦截器后,可能会产生多个子请求,每个子请求都执行一遍后续的拦截器。 +对于监控类拦截器,你必须注意它们与重试/对冲拦截器的相对位置。如果它们位于重试/对冲之前,那么应用层每一个请求它们只会统计一次;如果它们位于重试/对冲之后,那么,每一次重试对冲请求它们都会统计。 + +当你使用重试/对冲拦截器时,请务必多思考一下它与其他拦截器的相对关系。 + +### Server Pushback +server pushback 用于服务端显式地控制客户端的重试/对冲策略。 +当服务端负载比较高,希望客户端降低重试/对冲频率时,可以在回包中指定延迟时间 T,客户端会将下一次重试/对冲子请求延迟 T 时间后执行。 +该功能更常用于服务端指示客户端停止重试/对冲,通过将 delay 设置为 `-1` 即可。 + +一般情况下,你不应该关心是否需要设置 server pushback。在后续规划中,框架会根据服务当前的负载情况,自动决定如何设置 server pushback。 + +### 负载均衡 +因为重试/对冲是以拦截器的方式实现的,而负载均衡发生在拦截器之后,因此,每一个子请求都会触发一次负载均衡。 + +![ 'image.png'](./.docs/loadbalance.png) + +对于对冲请求,你可能希望每个子请发往不同的节点。我们实现了一个机制,允许多个子请求间进行通信,以获取其他子请求已经访问过的节点。负载均衡器可以利用该机制,只返回未访问过的节点。当然,这需要负载均衡器的配合,目前只有两个框架内置的随机负载均衡策略支持。 +如果你使用的负载均衡器不支持跳过已经访问过的节点,也不用灰心丧气。一般情况下,轮询或随机的负载均衡器本身就在某种意义上实现了子请求发往不同的节点,即使偶尔发往了同一个节点,也不会有什么大问题。而对于特殊的 hash 类负载均衡器(按某个特定的 key 路由到特定的一个节点,而非一类节点),它可能根本无法支持这个功能,事实上,在这类负载均衡器上使用对冲策略是没有意义的。 + +## retry hedging 基础包介绍 +本章只是简要介绍重试/对冲的基础包,以作为后一章的基础。尽管我们提供了一些使用范例,但还是请尽量避免直接在应用层使用它们。你应该通过 Slime 来使用重试/对冲功能。 + +### [retry](./retry) +[retry](./retry) 包提供了基础的重试策略。 + +`New` 创建一个新的重试策略,你必须指定最大重式次数和可重试错误码。你也可以通过 `WithRetryableErr` 自定义可重试错误,它和可重试错误码是或关系。 + +retry 提供了两种默认的退避策略:`WithExpBackoff` 和 `WithLinearBackoff`。你也可以通过 `WithBackoff` 自定义退避策略。这三种退避策略至少需要提供一种,如果你提供了多个,它们的优先级为: +`WithBackoff` > `WithExpBackoff` > `WithLinearBackoff` + +你可能会奇怪,为什么 `WithSkipVisitedNodes(skip bool)` 有一个额外的 `skip` 布尔变量?事实上,我们在这里区分了三种情形: +1. 用户未显式地指定是否跳过已访问过的节点; +2. 用户显式地指定跳过已访问过的节点; +3. 用户显式地指定不要跳过已访问过的节点。 + +这三种状态会对负载均衡产生不同的影响。 +对第一种情形,负载均衡应该尽可能地返回未访问过的节点。如果所有节点都已经访问过了,我们允许它返回一个已经访问过的节点。这是默认策略。 +对第二种情形,负载均衡必须返回未访问过的节点。如果所有节点都已经访问过了,它应该返回无可用节点错误。 +对第三种情形,负载均衡可以随意返回任何节点。 +如 2.5 节中描述的,`WithSkipVisitedNodes` 需要负载均衡的配合。如果负载均衡器未实现该功能,无论用户是否调用了该 option,最终都对应于第三种情形。 + +`WithThrottle` 可以为该策略指定限流器。 + +你可以通过以下方式为某次 RPC 请求指定重试策略: +```Go +r, _ := retry.New(4, []int{errs.RetClientNetErr}, retry.WithLinearBackoff(time.Millisecond*5)) +rsp, _ := clientProxy.Hello(ctx, req, client.WithFilter(r.Invoke)) +``` + +### [hedging](./hedging) +[hedging](./hedging) 包提供了基础的对冲策略。 + +`New` 创建一个新的对冲策略。你必须指定最大重试次数和非致命错误码。你也可以通过 `WithNonFatalError` 自定义非致命错误,它和非致命错误码是或关系。 + +hedging 包提供两种方式来设置对冲延时。`WithStaticHedgingDelay` 设置一个静态的延迟。`WithDynamicHedgingDelay` 允许你注册一个函数,每次调用时返回一个时间作为对冲延时。这两种方法是互斥的,多次指定时,后者会覆盖前者。 + +`WithSkipVisitedNodes` 的行为与 retry 一致,请参考上节。 + +`WithThrottle` 可以为对冲策略指定限流器。 + +你可以通过以下方式为某次 RPC 请求指定对冲策略: +```Go +h, _ := hedging.New(2, []int{errs.RetClientNetErr}, hedging.WithStaticHedgingDelay(time.Millisecond*5)) +rsp, _ := clientProxy.Hello(ctx, req, client.WithFilter(h.Invoke)) +``` + +### [throttle](./throttle) +[throttle](./throttle) 用来限制重试/对冲时的写放大。 + +`throttler` interface 提供了三个方法: +```Go +type throttler interface { + Allow() bool + OnSuccess() + OnFailure() +} +``` +每次发送重试/对冲子请求(不包括第一次请求),都会调用 `Allow`,如果返回 `false`,那么这个应用层请求的所有后续子请求都不会再执行,视作「最大对冲次数已经耗尽」。 +每当收到重试/对冲子请求的回包时,会根据情况调用 `OnSuccess` 或 `OnFailure`。更多细节还请参考 proposal。 + +对冲/重试会产生写放大,而限流则是为了避免因重试/对冲造成服务雪崩。当你初始化一个如下 throt,并将它绑定到一个 `Hello` RPC 时, +```Go +throt, _ := throttle.NewTokenBucket(10, 0.1) +r, _ := retry.New(3, []int{errs.RetClientNetErr}, retry.WithLinearBackoff(time.Millisecond*5)) +tr := r.NewThrottledRetry(throt) +rsp, _ := clientProxy.Hello(ctx, req, client.WithFilter(tr.Invoke)) +``` +因重试/对冲产生的总 `Hello` 请求数不会超过应用层次数的 110%(每一个成功的请求会使令牌加 0.1,每一个失败的请求会使令牌减少 1,相当于 10 个成功的请求才能换取来一次重试/对冲的机会),突增的重试/对冲请求数(连续失败)不会大于 5(5 = 10 / 2,只有令牌数大于一半时,`Allow` 才会返回 `true`)。 + +### 关于超时错误 +在 tRPC-Go 中,[`RetClientTimeout`](https://github.com/trpc-group/trpc-go/blob/71941c0f7e32cec11d48c0d6b5c28122788f57e8/errs/errs.go#L47),即 101 错误,对应应用层超时。重试/对冲遵循该机制,只要 `ctx` 超时,就会立刻返回错误。因此,将 101 作为可重试/对冲错误码是没有意义的。对这种情况,我们建议你使用对冲功能,并配置合理的对冲时延(相当于对冲时延即为你期望的超时时间)。注意,对冲时延应该小于应用层超时时间。 + +## slime + +Slime 在 retry 和 hedging 两个基础包之上,提供了文件配置功能。利用 slime,你可以将重试/对冲策略统一管理在框架配置中。和其他 tRPC-Go 的插件一样,首先匿名导入 slime 包: +```go +import _ "trpc.group/trpc-go/trpc-filter/slime" +``` + +我们以下面这个 yaml 文件为例,介绍 slime 是如何解析配置文件的。 +```yaml +--- # 重试/对冲策略 +retry1: &retry1 # 这是 yaml 引用语法,可以允许不同 service 使用相同的重试策略 + # 省略时,将会随机生成一个名字。 + # 如果需要自定义 backoff 或可重试业务错误,必须显式地提供一个名字,它会用于 slime.SetXXX 方法的第一个参数。 + name: retry1 + # 省略时,将取默认值 2。 + # 最大不超过 5。超过时,将自动截断为 5。 + max_attempts: 4 + backoff: # 必须提供 exponential 或 linear 中的一个 + exponential: + initial: 10ms + maximum: 1s + multiplier: 2 + # 省略时,会默认重试以下四种框架错误: + # 21: RetServerTimeout + # 111: RetClientConnectFail + # 131: RetClientRouteErr + # 141: RetClientNetErr + # tRPC-Go 的框架错误码请参考:https://github.com/trpc-group/trpc-go/tree/main/errs + retryable_error_codes: [ 141 ] + +retry2: &retry2 + name: retry2 + max_attempts: 4 + backoff: + linear: [100ms, 500ms] + retryable_error_codes: [ 141 ] + skip_visited_nodes: false # 省略、false 和 true 对应三种不同情形 + +hedging1: &hedging1 + # 省略时,将会随机生成一个名字。 + # 如果需要自定义 hedging_delay 或者非致命错误,必须显式地提供一个名字,它会用于 slime.SetHedgingXXX 方法的第一个参数。 + name: hedging1 + # 省略时,将取默认值 2。 + # 最大不超过 5。超过时,将自动截断为 5。 + max_attempts: 4 + hedging_delay: 0.5s + # 省略时,以下四种错误默认为非致命错误: + # 21: RetServerTimeout + # 111: RetClientConnectFail + # 131: RetClientRouteErr + # 141: RetClientNetErr + non_fatal_error_codes: [ 141 ] + +hedging2: &hedging2 + name: hedging2 + max_attempts: 4 + hedging_delay: 1s + non_fatal_error_codes: [ 141 ] + skip_visited_nodes: true # 省略、false 和 true 对应三种不同情形 + +--- # 配置 +client: &client + filter: [slime] # filter 要和 plugin 相互配合,缺一不可 + service: + - name: trpc.app.server.Welcome + retry_hedging_throttle: # 该 service 下的所有重试/对冲策略都会和该限流绑定 + max_tokens: 100 + token_ratio: 0.5 + retry_hedging: # service 默认使用策略 retry1 + retry: *retry1 # dereference retry1 + methods: + - callee: Hello # 使用重试策略 retry2 覆盖 service 策略 retry1 + retry_hedging: + retry: *retry2 + - callee: Hi # 使用对冲策略 hedging1 覆盖 service 策略 retry1 + retry_hedging: + hedging: *hedging1 + - callee: Greet # retry_hedging 的内容为空,即不使用任何重试/对冲策略 + retry_hedging: {} + - callee: Yo # 没有 retry_hedging,采用 service 默认策略 retry1 + - name: trpc.app.server.Greeting + retry_hedging_throttle: {} # 强制关闭限流功能 + retry_hedging: # service 默认使用策略 hedging2 + hedging: *hedging2 + - name: trpc.app.server.Bye + # 没有配置限流,使用默认限流 + # 没有配置 service 级别的重试/对冲策略 + methods: + - callee: SeeYou # 为 SeeYou 方法单独配置了重试策略 + retry_hedging: + retry: *retry1 + +plugins: + slime: + # 这里引用了整个 client。当然,你可以将 client.service 单独配在 default 下。 + default: *client +``` + +> 上面的配置文件用到了 yaml 中的一个重要的特性,即[引用](https://en.wikipedia.org/wiki/YAML#Advanced_components)。对于重复节点,你可以通过引用复用它们。 + +### 作为 [Entity](https://en.wikipedia.org/wiki/Domain-driven_design#Building_blocks) 的重试/对冲策略 + +在上面的配置中,我们定义了四个重试/对冲策略,并在 `client` 中引用了它们。每种策略,除了必要的参数外,都有一个新的字段 `name`,用作实体的**唯一**标识。在上一章中,我们提到一些 option,如 `WithDynamicHedgingDelay`,它们无法在文件中配置,需要在代码中使用,这里的 `name` 就是在代码中使用这些 optioin 的关键。在 slime 中,我们提供了下面几种函数,来设置额外的 options。 +```Go +func SetHedgingDynamicDelay(name string, dynamicDelay func() time.Duration) error +func SetHedgingNonFatalError(name string, nonFatalErr func(error) bool) +func SetRetryBackoff(name string, backoff func(attempt int) time.Duration) error +func SetRetryRetryableErr(name string, retryableErr func(error) bool) error +``` + +注意,对于重试策略的 `backoff`,你只能在 `exponential` 和 `linear` 之间二选一。如果你同时提供了两个,我们将以 `exponential` 为准。 + +### 与框架配置的统一 + +在插件配置 `plugins` 中,插件类型必须是 `slime`,插件名必须是 `default`。slime 会根据配置文件,将所有的重试/对冲策略加载到一个插件中,即 default。default 则提供了拦截器,自动对所有配置了重试/对冲的 service 或方法生效。 + +你可能发现了,`client` 键与客户端框架配置很像,除了它多了一些新的键,如 `retry_hedging`,`methods` 等。我们是刻意这么设计的,为了能够复用原始的框架配置。如果你打算在现有 client 中引入 slime,那么,你只需要在框架配置的 `client` 键下新增一些键值即可。 + +对冲是一种更加激进的重试策略。配置重试/对冲策略时,你只能在它们之间二选一: +```yaml +retry_hedging: + retry: *retry1 + # hedging: *hedging1 # 选择了 retry 就不要再填 hedging 了 +``` +如果你即填了 retry,又填了 hedging,那么,我们会以 hedging 为准。 +如果你这么填 `retry_hedging: {}`,那么该策略等同于没有配置重试/对冲。注意,这与 `retry_hedging:` 不同,前者是配置了键 `retry_hedging`,但它的内容是空的,后者相当于没有键 `retry_hedging`。 + +你可以为整个 service 指定一个重试/对冲策略,在 `service` 下添加 `retry_hedging` 键即可,也可以精细到具体某个方法,在 `method` 中添加 `callee`。 +在配置文件中,service `trpc.app.server.Welcome` 使用了 `retry1` 作为重试策略。 +`Hello` 使用重试策略 `retry2` 覆盖了 service 重试策略 `retry1`。 +`Hi` 使用对冲策略 `hedging1` 覆盖了 service 重试策略 `retry1`。 +`Greeter` 则使用**空策略**覆盖了 service 策略 `retry1`。 +`Yo` 显式地继承了 service 的策略 `retry1`。 +其他未显式配置的方法都默认继承了 service 的策略 `retry1`。 +服务 `trpc.app.server.Greeting` 的所有方法都使用对冲策略 `hedging2`。 + +### 限流 +在 slime 中,限流是以 service 为单位的。 +slime 默认为每个 service 都开启限流功能,配置为 `max_tokens: 10` 和 `token_ratio: 0.1`。 +你也可以像 service `trpc.app.server.Welcome` 一样,自定义 `max_tokens` 和 `token_ratio`。 +如果你想关闭限流,需要这样配置:`retry_hedging_throttle: {}`。 + +### 拦截器 +slime 插件在初始化时,会自动注册 slime 拦截器。 +要使 slime 插件生效,你必须在 `filter` 中指定 `slime` 拦截器: +```yaml +client: + filter: [slime] + service: + - # 你也可以将拦截器注册在服务内 + #filter: [slime] +``` +slime 会产生多个子请求,请注意它与其他拦截器的次序。 + +### 跳过已访问过的节点 +正如我们在 4.1 节中描述的,你也可以在配置中指定是否跳过已经发送过请求的节点。 +`retry1` 和 `hedging1` 没有配置 `skip_visited_nodes`,它们对应第一种情形。`retry2` 显式地指定 `skip_visited_nodes` 为 `false`,它对应第三种情形。`hedging2` 显式地指定 `skip_visited_nodes` 为 `true`,它对应第二种情形。 + +请注意,该功能需要负载均衡器配合。如果负载均衡器没有实现对应能力,那么都会对应到情形三。 + +### 对某次请求关闭重试/对冲 +Slime 支持通过创建一个新的 context 来关闭某次请求的重试/对冲。 +该功能通常与 trpc-database 配合,让重试/对冲配置只对读请求(或者幂等请求)生效,而跳过写请求。比如,对于 trpc-database/mysql: +```go +c := mysql.NewClientProxy(/* omitted args */) +err := c.QueryRow(trpc.BackgroundContext(), /* omitted args */) // 默认配置了重试/对冲 +_, err = c.Exec(slime.WithDisabled(trpc.BackgroundContext()), /* omitted args */) // 通过 ctx 为本次请求关闭重试/对冲 +``` +注意,该功能只对 slime 生效,slime/retry 和 slime/hedging 并不提供该功能。 + +## 可视化 +Slime 提供两种可视化能力,一个是条件日志,一个是监控打点。 +### 条件日志 +无论是对冲还是重试,它们都有一个名为 `WithConditionalLog` 的选项。[这](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/retry/retry.go#L237)是重试的,[这](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/hedging/hedging.go#L188)是对冲的,这两个([retry](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/opts.go#L247),[hedging](https://github.com/trpc-ecosystem/go-filter/blob/779875aa0e61af72a7fe5792c7833ceef110adad/slime/opts.go#L96))是 slime 的。 +条件日志需要两个参数,一个是 `log.Logger` +```go +type Logger interface { + Println(string) +} +``` +一个是条件函数 `func(stat view.Stat) bool`。 + +条件函数中的 `view.Stat` 提供了一个应用层请求执行过程的状态。你可以根据这些数据,决定是否输出重试/对冲日志。比如,下面的条件函数告诉 slime,只有当一共重试了三次,且前两次都没有回包,而第三次成功时,才输出日志: +```go +var condition = func(stat view.Stat) bool { + attempts := stat.Attempts() + return len(attempts) == 3 && + attempts[0].Inflight() && + attempts[1].Inflight() && + !attempts[2].Inflight() && + attempts[2].Error() == nil +} +``` + +`Logger` 只需要一个简单的 `Println(string)` 方法。你可以基于任何 log 库包装一个出来。比如,下面这个是基于控制台的 log: +```go +type ConsoleLog struct{} + +func (l *ConsoleLog) Println(s string) { + log.Println(s) +} +``` +这是一个 slime 在控制台输出的日志: +!['image.png'](./.docs/logs.png) +有几点你需要特别关注: +* 一个应用层请求的所有 slime 日志对应 `log.Logger` 中的一次 `Println`,这在 slime 中称为 lazy log,就像截图中第一行显式的那样。 +* slime 的日志通过换行制表符等进行了格式化。 +* 最后一条 slime 日志是对所有尝试的汇总。 + +### 监控 +与条件日志类似,重试/对冲的监控也是基于 [`view.Stat`](./view/stat.go) 的。 + +slime 提供了四个监控项:应用层请求数、实际请求数、应用层耗时、实际耗时。 + +所有监控项都有三种标签:caller、callee、method。 + +对于应用层请求数与应用层耗时,它们具有以下额外标签:总尝试次数、最终错误的错误码、是否被限流、未完成的请求数(只有对冲才可能非零)、后端是否显式禁止重试/对冲。 + +对实际请求数与实际耗时,它们具有以下额外标签:错误码、是否未完成、后端是否显式禁止重试/对冲。 + +#### Prometheus +Slime 支持 Prometheus 监控。引入依赖: +```go +import prom "trpc.group/trpc-go/trpc-filter/slime/view/metrics/prometheus" +``` +使用 `prom.NewEmitter` 来初始化一个 Emitter。 +prometheus 的使用方式可以参考[官方文档](https://prometheus.io/docs/guides/go-application/)。