Skip to content

Flaky E2E: TestCounters subtests fail with SDK gRPC connection refused (127.0.0.1:9357) in e2e-stable #4461

@markmandel

Description

@markmandel

Description

Several TestCounters subtests are intermittently failing in the e2e-stable CI step on a GKE Autopilot cluster because the Agones SDK sidecar gRPC server on 127.0.0.1:9357 returns connection refused instead of the expected domain-level error responses.

CI Build

Logs: https://console.cloud.google.com/cloud-build/builds/fbedf122-d973-46db-ba07-1fbe141601ae;step=2?project=agones-images

  • Cloud product: gke-autopilot
  • Feature gates active: CountsAndLists=true, SidecarContainers=true, GKEAutopilotExtendedDurationPods=true, DisableResyncOnSDKServer=true

Note: the total failure count in this run (20) exceeded --rerun-fails-max-failures=10, so no automatic re-runs were attempted.

Failing Tests

All failing subtests share the same pattern — the test sends a UDP message to simple-game-server, which then calls the Agones SDK gRPC server on 127.0.0.1:9357. The expected response is a specific SDK-level error (e.g. out-of-range), but instead the game server returns a connection refused error:

Error Trace: /go/src/agones.dev/agones/test/e2e/gameserver_test.go:1651
Error:       Not equal:
             expected: "ERROR: could not increment Counter games by amount 50: rpc error: code = Unknown desc = out of range. Count must be within range [0,Capacity]. Found Count: 51, Capacity: 50\n"
             actual  : "ERROR: could not increment Counter games by amount 50: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:9357: connect: connection refused\"\n"
--- FAIL: TestCounters/IncrementCounter_Past_Capacity (0.08s)

Affected subtests:

  • TestCounters/IncrementCounter_Past_Capacity
  • TestCounters/IncrementCounter_Negative
  • TestCounters/DecrementCounter_Past_Capacity
  • TestCounters/SetCounterCount_Past_Capacity
  • TestCounters/SetCounterCount_Past_Zero

Root Cause Analysis

TestCounters creates a single GameServer, waits for it to reach Ready, then runs all subtests against that shared GameServer. The failing subtests are the ones that trigger an SDK gRPC call from inside simple-game-server to the Agones SDK server on 127.0.0.1:9357.

connection refused on a loopback address means the SDK server gRPC port is not bound at the time of the call. With SidecarContainers=true, the SDK server runs as a proper Kubernetes sidecar container. A likely cause is a race condition: the SDK server has processed the Ready() call (causing the GameServer to transition to Ready state), but the gRPC listener on port 9357 is momentarily unavailable — for example, due to a brief disconnect/reconnect cycle in the SDK server, or the listener not yet being re-established after some internal state change triggered by the Ready() transition.

Since the subtests iterate over a Go map (random order), the failing ones are those that happen to be scheduled during the brief window when the port is unavailable.

This may be more likely to surface on GKE Autopilot due to the additional scheduling and networking latency compared to standard GKE.

Potential Solutions / Areas for Exploration

  1. Investigate SDK server stability after Ready() with SidecarContainers=true: Check whether the SDK server gRPC listener on port 9357 can ever become temporarily unavailable after the Ready() call is processed. A connection refused on loopback suggests the listener has stopped — this should not be possible for a stable sidecar, so this warrants investigation.

  2. Investigate DisableResyncOnSDKServer=true interaction: This feature gate is active in the e2e-stable run. It is worth confirming whether disabling SDK server resyncs has any effect on port availability around state transitions.

  3. Add retry in simple-game-server for SDK gRPC calls: The game server could retry the SDK connection on Unavailable errors with a short backoff before returning the error to the caller. This would make the test more resilient to transient SDK server unavailability.

  4. Add a post-Ready SDK connectivity check in the test framework: Before CreateGameServerAndWaitUntilReady returns, verify the SDK server port is actually accepting connections, ensuring the GameServer is truly ready for SDK interactions.

Metadata

Metadata

Labels

area/testsUnit tests, e2e tests, anything to make sure things don't breakhelp wantedWe would love help on these issues. Please come help us!kind/bugThese are bugs.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions