Implement TCP_USER_TIMEOUT to detect half-opened TCP connections leading to 15min of dangling connections #13023

UsingCoding · 2024-09-04T20:29:01Z

What problem are you trying to solve?

Recently, we faced the problem of failure of our application, consisting of gRPC-related services, during ungraceful node termination.
The problem manifested itself by hanging TCP connections (which were open to gRPC traffic) up to 15 minutes. The application itself or Linkerd could not identify in any way these connections as hung up, for 15 minutes such connections accumulated and led to degradation of application performance.

Others have faced a similar problem in istio/istio#33466 and istio/istio#28865 when using Istio + Envoy.

In Linkerd by default, TCP_KEEPALIVE is set - but with a half-open TCP connection, keepalive is not applied to the connection (https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die - Idle ESTAB is forever) - so this situation TCP socket option does not help

The root of the problem is that in connection to an ungraceful terminated TCP node, the sockets switch to the half-open state (https://www.excentis.com/blog/tcp-half-close-a-cool-feature-that-is-now-broken) - in this case, if the client has packets in the connection queue for sending, the following happens:

Sending the first package
TCP stack is waiting for RTO (retransmission timeout)
After RTO, the packet is sent again
Expectation of an exponentially growing RTO
By default, the TCP stack allows 15 such repeats, which results in approximately ~15 minutes of waiting
The connection is considered broken and closes with ETIMEDOUT

The waiting time for confirmation that the connection is broken can be controlled via net.ipv4.tcp_retries2 - system wide or via TCP_USER_TIMEOUT - socket option (TCP_USER_TIMEOUT limits the time allotted for packages retransmission)

How should the problem be solved?

Implement setting TCP_USER_TIMEOUT option to all TCP connections (inbount and outbound connections).

Setting TCP_USER_TIMEOUT on all connections will avoid linkerd-proxy to drain with resources being connected to clients or conrtol-plane (it is also susceptible) - which increases the fault tolerance of a separate linkerd-proxy.

Configuring TCP_USER_TIMEOUT is similar to TCP_KEEPALIVE, so the following implementation is assumed:

Make TCP_USER_TIMEOUT configurable from env, just like TCP_KEEPALIVE
Make default value for TCP_USER_TIMEOUT = 30s - it's will be enough for ~7 package retransmissions; after 7 retransmissions RTO rapidly grows and do not make much sense to wait for too long

It is safe to set TCP_USER_TIMEOUT with default value 30s: this setting works only for the peer who announced it and does not require consistency with another peer. It is somewhat similar to TCP_KEEPALIVE, which is already always set

Only the following parameters will configured by default to control traffic outside of pod:
INBOUND_ACCEPT_USER_TIMEOUT - 30s
OUTBOUND_CONNECT_USER_TIMEOUT - 30s

Any alternatives you've considered?

TCP_KEEPALIVE - will not work with half-opened connections as it may be assumed - is not suitable.
net.ipv4.tcp_retries2 - requires execution of sysctl commands (missing inside linkerd-proxy) and affects all system connections (may have potential unpleasant side-effects) - not suitable
LINKERD2_PROXY_{INBOUND,OUTBOUND}_SERVER_HTTP2_KEEP_ALIVE_{INTERVAL,TIMEOUT} - will work only for HTTP/2 connections, but TCP based connections which also may meshed by linkerd is not covered; For example, connection through opaque ports to MySQL or Redis still may hang up without TCP_USER_TIMEOUT set - not suitable

How would users interact with this feature?

No response

Would you like to work on this feature?

yes

The text was updated successfully, but these errors were encountered:

Implement providing configuration for LINKERD2_PROXY_INBOUND_ACCEPT_USER_TIMEOUT and LINKERD2_PROXY_OUTBOUND_CONNECT_USER_TIMEOUT to linkerd-proxy. Default values for 30s will be enough to linux TCP-stack completes about 7 packages retransmissions, after about 7 retransmissions RTO (retransmission timeout) will rapidly grows and do not make much sense to wait for too long. Setting TCP_USER_TIMEOUT between linkerd-proxy and wild world is enough, since connections to containers in same pod is more stable and reliable Fixes linkerd#13023 Signed-off-by: UsingCoding <[email protected]>

olix0r · 2024-09-05T13:41:01Z

Thank you for the detailed report! It may take us a few days to get to the reviews, but the general proposal makes sense to me.

UsingCoding · 2024-09-12T09:38:21Z

@olix0r for problem to be fully solved also need #13024 to be merged too otherwise, the parameters will not be set and TCP_USER_TIMEOUT will not work, since there is no default configuration.
Should we reopen the issue and wait for the #13024 to be completed ?

UsingCoding added the enhancement label Sep 4, 2024

This was referenced Sep 4, 2024

Implement providing configuration for TCP_USER_TIMEOUT to linkerd-proxy #13024

Open

Implement setting TCP_USER_TIMEOUT option for TCP sockets linkerd/linkerd2-proxy#3174

Merged

olix0r closed this as completed in linkerd/linkerd2-proxy#3174 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TCP_USER_TIMEOUT to detect half-opened TCP connections leading to 15min of dangling connections #13023

Implement TCP_USER_TIMEOUT to detect half-opened TCP connections leading to 15min of dangling connections #13023

UsingCoding commented Sep 4, 2024

olix0r commented Sep 5, 2024

UsingCoding commented Sep 12, 2024

Implement TCP_USER_TIMEOUT to detect half-opened TCP connections leading to 15min of dangling connections #13023

Implement TCP_USER_TIMEOUT to detect half-opened TCP connections leading to 15min of dangling connections #13023

Comments

UsingCoding commented Sep 4, 2024

What problem are you trying to solve?

What problem are you trying to solve?

How should the problem be solved?

Any alternatives you've considered?

How would users interact with this feature?

Would you like to work on this feature?

olix0r commented Sep 5, 2024

UsingCoding commented Sep 12, 2024