Close backend connection when frontend is not found #417

tallclair · 2022-11-17T00:00:42Z

I don't actually understand how we're getting into this state, but this log line shows up with some frequency:

"could not get frontend client" err="can't find connID 2 in the frontends[a19ff692-407c-4326-89e2-c53d8e4d2617]" serverID="1768996f-7935-40e8-9e24-51de50e81663" agentID="a19ff692-407c-4326-89e2-c53d8e4d2617" connectionID=2

This is a non-recoverable condition, indicating that somehow the frontend was disconnected but left the backend connection open. When this happens, reply to the backend with a close request.

tallclair · 2022-11-17T00:04:27Z

Actually, it's possible this was already fixed by #386. Even if that's the case, I think this is the correct way of handling this error, should it ever occur.

tallclair · 2022-11-21T18:02:06Z

It occured to me that we should probably apply this same change everywhere a data packet is handled (client & agent too).

tallclair · 2022-12-02T19:45:29Z

tests/concurrent_test.go

 	var wg sync.WaitGroup
 	verify := func() {
 		defer wg.Done()

+		// run test client


Note to reviewers: This test was reusing the tunnel across multiple requests, which we explicitly say not to do. It worked before because the test only sets up a single backend, and the FE request fits in a single data packet. This PR makes the failure more explicit when the connection is reused, which broke this test.

tallclair · 2023-01-13T21:30:27Z

Rebased.

/assign @jkh52

jkh52 · 2023-01-13T21:53:57Z

TestServerProxyConnectionMismatch test failure looks deterministic.

tallclair · 2023-01-13T22:20:43Z

Fixed. I'm not sure why it wasn't failing before. The order isn't supposed to matter on those mocked calls (they're not in an InOrder(...) block), but for some reason it does... I don't have time to dig deeper right now, but reordering fixes it.

cheftako · 2023-01-13T23:48:06Z

konnectivity-client/pkg/client/client.go

@@ -339,7 +339,15 @@ func (t *grpcTunnel) serve(tunnelCtx context.Context) {
 			conn, ok := t.conns.get(resp.ConnectID)

 			if !ok {
-				klog.V(1).InfoS("Connection not recognized", "connectionID", resp.ConnectID)
+				klog.ErrorS(nil, "Connection not recognized", "connectionID", resp.ConnectID)


This could be an error but it can also be a race condition. We used to have this as an error and it tended to cause undue concern as it tends to happen when a connection is being shutdown. t.conns missing a reference to connectID usually just indicates that the connection is being shut down but that the other end sent a data packet before it realized the connection was going away.

Hmm, good point. Especially with the fixes in this PR, it's pretty unlikely this would keep happening in another scenario. I'll change it back.

Not in this PR, but we may want to consider keeping a cache of recently closed connections. I've thought of a few cases where it would be useful.

+1 to keep recent connections to understand this case better.

cheftako · 2023-01-13T23:48:51Z

pkg/agent/client.go

@@ -501,6 +501,18 @@ func (a *Client) Serve() {
 			ctx, ok := a.connManager.Get(data.ConnectID)
 			if ok {
 				ctx.send(data.Data)
+			} else {
+				klog.ErrorS(nil, "received DATA for unrecognized connection", "connectionID", data.ConnectID)


Again, a little concerned about having this as an error.

cheftako · 2023-01-13T23:50:43Z

pkg/server/server.go

+		for range recvCh {
+			// Ignore values as this indicates there was a problem
+			// with the remote connection.
+		}


Can we add a debug message to let us know how many packets we dropped? Might even be worth adding a metric.

Done. Also added to the agent version of this.

cheftako · 2023-01-13T23:53:15Z

pkg/server/server.go

 			if firstConnID == 0 {
 				firstConnID = connID
 			} else if firstConnID != connID {
-				klog.V(5).InfoS("Data does not match first connection id", "fistConnectionID", firstConnID, "connectionID", connID)
+				klog.ErrorS(nil, "Data does not match first connection id", "fistConnectionID", firstConnID, "connectionID", connID)


Do we believe we have seen this? If so I would very much like to see a metric added.

Fixed. This could happen if a frontend client was attempting to reuse a tunnel. Agree on the metric, but I'd add it as part of a generic connection_closed metric with a close_reason. Basically the post-dial equivalent of the dial_failures metric: #410

Not sure even a reuse will cause a new connID. the tunnel sets the connID while the tunnel is being set up.
The connection ID is set by the agent at https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/agent/client.go#L404.
The means it will only be created in response to a dial request which is created by the client at

apiserver-network-proxy/konnectivity-client/pkg/client/client.go

Line 410 in cec4208

Type: client.PacketType_DIAL_REQ,

.
Within the client that is only called from CreateSingleUseGrpcTunnelWithContext (

apiserver-network-proxy/konnectivity-client/pkg/client/client.go

Line 169 in cec4208

func CreateSingleUseGrpcTunnelWithContext(createCtx, tunnelCtx context.Context, address string, opts ...grpc.DialOption) (Tunnel, error) {

).
So reusing a connection on a tunnel should not create a new connection id.

pkg/server/server.go

jkh52 · 2023-01-13T22:55:07Z

pkg/server/server.go

@@ -856,7 +856,8 @@ func (s *ProxyServer) serveRecvBackend(backend Backend, stream agent.AgentServic
 			klog.V(5).InfoS("Received data from agent", "bytes", len(resp.Data), "agentID", agentID, "connectionID", resp.ConnectID)
 			frontend, err := s.getFrontend(agentID, resp.ConnectID)
 			if err != nil {
-				klog.ErrorS(err, "could not get frontend client", "agentID", agentID, "connectionID", resp.ConnectID)
+				klog.ErrorS(err, "could not get frontend client; closing conenction", "agentID", agentID, "connectionID", resp.ConnectID)


Data point: in a cluster I'm debugging (with a thrashing controller), I see ~20 of this error log, and ~400k of the info level log below (CLOSE_RSP case - "could not get frontend client for closing")

So we may need this in that case too.

The close case just means the frontend terminated the connection before the backend responded with the CLOSE_RSP (it could also mean the server initiated the close for some reason). In either case, the backend will have already shut down any connection state after sending a CLOSE_RSP, so no need to respond with a CLOSE_REQ in that case.

That said, it's a good call that in the frontend CLOSE_REQ error cases, we should return a CLOSE_RSP from the server so the client can fail fast. I'll add that as a separate commit.

jkh52 · 2023-01-13T23:46:21Z

pkg/server/server.go

+		// As the read side of the recvCh channel, we cannot close it.
+		// However readFrontendToChannel() may be blocked writing to the channel,
+		// so we need to consume the channel until it is closed.
+		for range recvCh {


Is it possible to add unit test coverage?

What is the bad thing this prevents - leaked goroutine? If so I suggest explicit in this comment.

Without this, the go-routine that actually receives the packets and queues them up in the channel could be blocked here:

apiserver-network-proxy/pkg/server/server.go

Line 422 in 682e779

case recvCh <- in: // Send didn't block, carry on.

, which would prevent it from reading an EOF from the Recv() call:

apiserver-network-proxy/pkg/server/server.go

Lines 400 to 401 in 682e779

in, err := stream.Recv()

if err == io.EOF {

, which is what actually triggers the channel to be closed.

So yes, without this the go-routines could be deadlocked.

Note, this was copied from the agent client, which does the same thing: https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/agent/client.go#L586-L597

I realized that serverRecvBackend also needs the same treatment, and added it there.

EDIT: Actually, this should never happen on the backend side. I'm going to leave the drain there as a safety measure though.

I modified TestServerProxyRecvChanFull to verify that the deadlock condition doesn't occur. It's a bit contrived, but the scenarios where this would happen are when the frontend client is behaving correctly.

Note that getting the test to work required terminating serveRecvFrontend as soon as the CLOSE_REQ was sent to the backend. This change fixed the duplicate close packet.

I think its fine but if we're worried we could adding a draining connection metric. Increment when we enter the defer and decrement when we exit. If this gets stuck calling the recvCh then the metric will stay up.

I think its fine but if we're worried we could adding a draining connection metric. Increment when we enter the defer and decrement when we exit. If this gets stuck calling the recvCh then the metric will stay up.

IMO: keep the logs for now, and soon add a proxy-server metric for proxy connections labeled by state, similar to the client metric.

jkh52 · 2023-01-13T23:54:14Z

konnectivity-client/pkg/client/client.go

@@ -339,7 +339,15 @@ func (t *grpcTunnel) serve(tunnelCtx context.Context) {
 			conn, ok := t.conns.get(resp.ConnectID)

 			if !ok {
-				klog.V(1).InfoS("Connection not recognized", "connectionID", resp.ConnectID)
+				klog.ErrorS(nil, "Connection not recognized", "connectionID", resp.ConnectID)
+				t.stream.Send(&client.Packet{


ideally we would add ObservePacket() and conditional ObserveStreamError() here.

(See the other 2 stream send in this file, and note that t.stream is the raw stream.)

Done. We may want to consider switching to a wrapper around the ProxyStream client that records the metrics automatically.

cheftako · 2023-01-14T00:03:04Z

pkg/server/server.go

@@ -520,16 +530,24 @@ func (s *ProxyServer) serveRecvFrontend(stream client.ProxyService_ProxyServer,
 			connID := pkt.GetData().ConnectID
 			data := pkt.GetData().Data
 			klog.V(5).InfoS("Received data from connection", "bytes", len(data), "connectionID", connID)
+			if backend == nil {
+				klog.V(2).InfoS("Backend has not been initialized for the connection. Client should send a Dial Request first", "connectionID", connID)


Seems more likely the client sent the data packet on the wrong stream than that they attempted to send a data packet before a dial request.

Agreed. This was copied from the log message under the DATA case, but it's weird in both places. I just removed the second sentence.

tallclair · 2023-01-14T00:09:13Z

/label tide/merge-method-squash

cheftako · 2023-01-16T22:14:49Z

konnectivity-client/pkg/client/client.go

@@ -339,7 +339,20 @@ func (t *grpcTunnel) serve(tunnelCtx context.Context) {
 			conn, ok := t.conns.get(resp.ConnectID)

 			if !ok {
-				klog.V(1).InfoS("Connection not recognized", "connectionID", resp.ConnectID)
+				klog.ErrorS(nil, "Connection not recognized", "connectionID", resp.ConnectID, "packetType", "DATA")


I thought we were taking this back to Info?

The connection ID is only removed from t.conns when receiving a CLOSE_RSP from the server, not for a client-side close. So the only way this would be hit is if the server sent a DATA packet after a CLOSE_RSP, which should never happen.

Note that I did change the CLOSE_RSP version of this log (line 374) to an info log, as there are cases where multiple CLOSE_RSP packets might be sent.

I buy that argument (for now) but it suggests that we only clean up the connection map if we get a CLOSE_RSP packet. That seems like it could represent a possible leak. If we decide to fix that possible leak then we may need to revisit this log message.

konnectivity-client/pkg/client/client.go

pkg/agent/client.go

cheftako · 2023-01-16T22:47:27Z

I really like the drains, good catch. There are a couple of minor issues. Happy to lgtm/approve as soon as they are fixed.

jkh52 · 2023-01-17T00:41:47Z

I really like the drains, good catch. There are a couple of minor issues. Happy to lgtm/approve as soon as they are fixed.

+1 lots of good improvements here. I'm excited to tag after merging this, and observe agent memory improvement.

cheftako · 2023-01-18T01:17:53Z

/lgtm
/approve

k8s-ci-robot · 2023-01-18T01:18:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheftako]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

) * [server] Close backend connection when frontend is not found * [server] Handle non-recoverable frontend errors * [agent] Handle unrecognized connections * [agent] clean up typos * [client] Handle unrecognized connection * Fix concurrent_test * Change unknown connection logs to Info level V2 * Log & test drained recv packets * [server] Return a CLOSE_RSP to the frontend when failing the CLOSE_REQ * [client] observe close_req metrics * [server] clean up uninitialized backend log message * [client] Use Send wrapper for metrics observation * Fail fast when a DATA packet is missing a connection ID

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 17, 2022

k8s-ci-robot requested review from dberkov and Jefftree November 17, 2022 00:00

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 17, 2022

tallclair force-pushed the data-close branch from 705f70e to 17d661b Compare November 22, 2022 22:13

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 22, 2022

tallclair commented Dec 2, 2022

View reviewed changes

tallclair force-pushed the data-close branch 3 times, most recently from 43255e8 to b86eaa4 Compare December 2, 2022 19:59

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2022

tallclair force-pushed the data-close branch from b86eaa4 to 9f447dd Compare January 13, 2023 21:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2023

k8s-ci-robot assigned jkh52 Jan 13, 2023

tallclair force-pushed the data-close branch from 9f447dd to 682e779 Compare January 13, 2023 22:19

cheftako reviewed Jan 13, 2023

View reviewed changes

jkh52 reviewed Jan 13, 2023

View reviewed changes

cheftako reviewed Jan 14, 2023

View reviewed changes

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 14, 2023

cheftako reviewed Jan 16, 2023

View reviewed changes

konnectivity-client/pkg/client/client.go Outdated Show resolved Hide resolved

cheftako reviewed Jan 16, 2023

View reviewed changes

pkg/agent/client.go Show resolved Hide resolved

tallclair added 13 commits January 17, 2023 16:38

[server] Close backend connection when frontend is not found

f5bdea9

[server] Handle non-recoverable frontend errors

e62ac0c

[agent] Handle unrecognized connections

144b723

[agent] clean up typos

69ca92c

[client] Handle unrecognized connection

5d074a3

Fix concurrent_test

cb5699e

Change unknown connection logs to Info level V2

e677999

Log & test drained recv packets

94f29c4

[server] Return a CLOSE_RSP to the frontend when failing the CLOSE_REQ

8a149f1

[client] observe close_req metrics

04f5abb

[server] clean up uninitialized backend log message

5317f85

[client] Use Send wrapper for metrics observation

fa6ea5a

Fail fast when a DATA packet is missing a connection ID

7cbf93a

tallclair force-pushed the data-close branch from 730bffb to 7cbf93a Compare January 18, 2023 00:53

k8s-ci-robot assigned cheftako Jan 18, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 18, 2023

k8s-ci-robot merged commit 96f0407 into kubernetes-sigs:master Jan 18, 2023

This was referenced Jan 18, 2023

[release-0.0 branch] Backports from 0.1.1 #454

Merged

Bump konnectivity-client to v0.1.1 kubernetes/kubernetes#115191

Merged

fix: don't send closeReq twice if it is already sent #322

Closed

jkh52 mentioned this pull request Feb 8, 2023

[release-1.26] Bump konnectivity-client to v0.0.36 kubernetes/kubernetes#115599

Merged

Close backend connection when frontend is not found #417

Close backend connection when frontend is not found #417

Uh oh!

Conversation

tallclair commented Nov 17, 2022

Uh oh!

tallclair commented Nov 17, 2022

Uh oh!

tallclair commented Nov 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tallclair commented Jan 13, 2023

Uh oh!

jkh52 commented Jan 13, 2023

Uh oh!

tallclair commented Jan 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tallclair Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tallclair commented Jan 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheftako Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cheftako commented Jan 16, 2023

tallclair Jan 14, 2023 •

edited

Loading

cheftako Jan 18, 2023 •

edited

Loading