-
Notifications
You must be signed in to change notification settings - Fork 889
Description
Description
Several TestCounters subtests are intermittently failing in the e2e-stable CI step on a GKE Autopilot cluster because the Agones SDK sidecar gRPC server on 127.0.0.1:9357 returns connection refused instead of the expected domain-level error responses.
CI Build
- Cloud product:
gke-autopilot - Feature gates active:
CountsAndLists=true,SidecarContainers=true,GKEAutopilotExtendedDurationPods=true,DisableResyncOnSDKServer=true
Note: the total failure count in this run (20) exceeded
--rerun-fails-max-failures=10, so no automatic re-runs were attempted.
Failing Tests
All failing subtests share the same pattern — the test sends a UDP message to simple-game-server, which then calls the Agones SDK gRPC server on 127.0.0.1:9357. The expected response is a specific SDK-level error (e.g. out-of-range), but instead the game server returns a connection refused error:
Error Trace: /go/src/agones.dev/agones/test/e2e/gameserver_test.go:1651
Error: Not equal:
expected: "ERROR: could not increment Counter games by amount 50: rpc error: code = Unknown desc = out of range. Count must be within range [0,Capacity]. Found Count: 51, Capacity: 50\n"
actual : "ERROR: could not increment Counter games by amount 50: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:9357: connect: connection refused\"\n"
--- FAIL: TestCounters/IncrementCounter_Past_Capacity (0.08s)
Affected subtests:
TestCounters/IncrementCounter_Past_CapacityTestCounters/IncrementCounter_NegativeTestCounters/DecrementCounter_Past_CapacityTestCounters/SetCounterCount_Past_CapacityTestCounters/SetCounterCount_Past_Zero
Root Cause Analysis
TestCounters creates a single GameServer, waits for it to reach Ready, then runs all subtests against that shared GameServer. The failing subtests are the ones that trigger an SDK gRPC call from inside simple-game-server to the Agones SDK server on 127.0.0.1:9357.
connection refused on a loopback address means the SDK server gRPC port is not bound at the time of the call. With SidecarContainers=true, the SDK server runs as a proper Kubernetes sidecar container. A likely cause is a race condition: the SDK server has processed the Ready() call (causing the GameServer to transition to Ready state), but the gRPC listener on port 9357 is momentarily unavailable — for example, due to a brief disconnect/reconnect cycle in the SDK server, or the listener not yet being re-established after some internal state change triggered by the Ready() transition.
Since the subtests iterate over a Go map (random order), the failing ones are those that happen to be scheduled during the brief window when the port is unavailable.
This may be more likely to surface on GKE Autopilot due to the additional scheduling and networking latency compared to standard GKE.
Potential Solutions / Areas for Exploration
-
Investigate SDK server stability after
Ready()withSidecarContainers=true: Check whether the SDK server gRPC listener on port 9357 can ever become temporarily unavailable after theReady()call is processed. Aconnection refusedon loopback suggests the listener has stopped — this should not be possible for a stable sidecar, so this warrants investigation. -
Investigate
DisableResyncOnSDKServer=trueinteraction: This feature gate is active in thee2e-stablerun. It is worth confirming whether disabling SDK server resyncs has any effect on port availability around state transitions. -
Add retry in
simple-game-serverfor SDK gRPC calls: The game server could retry the SDK connection onUnavailableerrors with a short backoff before returning the error to the caller. This would make the test more resilient to transient SDK server unavailability. -
Add a post-Ready SDK connectivity check in the test framework: Before
CreateGameServerAndWaitUntilReadyreturns, verify the SDK server port is actually accepting connections, ensuring the GameServer is truly ready for SDK interactions.