-
Notifications
You must be signed in to change notification settings - Fork 889
Description
Description
TestGameServerAllocationReturnLabels is flaky and fails intermittently in e2e tests on GKE Autopilot with a nil pointer dereference panic.
Environment
- Cluster:
gke-autopilot-1.34 - Test step:
e2e-stable - Build: https://console.cloud.google.com/cloud-build/builds/91967d42-0fee-4ba1-bafc-f4d80eae18f1;step=2?project=agones-images
Error
--- FAIL: TestGameServerAllocationReturnLabels (112.92s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2c336c3]
goroutine 643 [running]:
testing.tRunner.func1.2({0x2f40480, 0x4948e50})
/usr/local/go/src/testing/testing.go:1872 +0x419
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1875 +0x683
panic({0x2f40480?, 0x4948e50?})
/usr/local/go/src/runtime/panic.go:783 +0x132
agones.dev/agones/test/e2e.TestGameServerAllocationReturnLabels(0xc00055cc40)
/go/src/agones.dev/agones/test/e2e/gameserverallocation_test.go:1368 +0x983
testing.tRunner(0xc00055cc40, 0x33b6ed8)
/usr/local/go/src/testing/testing.go:1934 +0x21d
created by testing.(*T).Run in goroutine 1
/usr/local/go/src/testing/testing.go:1997 +0x9d3
Root Cause Analysis
The panic occurs at test/e2e/gameserverallocation_test.go:1368:
assert.Equal(t, t.Name(), gsa.Status.Metadata.Labels[role])The test creates a Fleet with 1 replica and waits for it to become Ready via AssertFleetCondition. However, on GKE Autopilot, node provisioning can be slow due to scale-from-zero behavior. From the logs, the Fleet spent approximately 110 seconds (from 19:50:41 to 19:52:30) waiting for ReadyReplicas to go from 0 to 1.
The logs show the Fleet stuck at ReadyReplicas:0 while Replicas:1 for an extended period:
time="2026-01-24 19:50:42.290" level=info msg="Checking Fleet Ready replicas" expected=1 fleet=simple-fleet-1.0x7qjj fleetStatus="{Replicas:1 ReadyReplicas:0 ...}"
...
time="2026-01-24 19:52:27.056" level=info msg="Checking Fleet Ready replicas" expected=1 fleet=simple-fleet-1.0x7qjj fleetStatus="{Replicas:1 ReadyReplicas:0 ...}"
time="2026-01-24 19:52:30.251" level=info msg="Checking Fleet Ready replicas" expected=1 fleet=simple-fleet-1.0x7qjj fleetStatus="{Replicas:1 ReadyReplicas:1 ...}"
When AssertFleetCondition eventually passes (or times out), if the allocation happens when no GameServers are actually Ready, the allocation returns an UnAllocated state. In this state, gsa.Status.Metadata is nil, causing the panic when the test tries to access gsa.Status.Metadata.Labels.
Observations
- The same test passed in Step Gopkg.toml should use tags not branches for k8s.io dependencies #1 (
e2e-feature-gates) but failed in Step ConsolidateVersioninto a single constant #2 (e2e-stable) - This suggests timing-dependent flakiness related to cluster state and node availability
- The panic caused subsequent tests to also fail with
-1.00sduration (test runner aborted)
Suggested Fix
The test should verify that gsa.Status.State == GameServerAllocationAllocated before attempting to access gsa.Status.Metadata. The current code uses assert.Equal which doesn't prevent subsequent code from running:
assert.Equal(t, allocationv1.GameServerAllocationAllocated, gsa.Status.State) // line 1367
assert.Equal(t, t.Name(), gsa.Status.Metadata.Labels[role]) // line 1368 - panics if State != AllocatedShould be changed to use require.Equal for the state check, or add a nil check:
require.Equal(t, allocationv1.GameServerAllocationAllocated, gsa.Status.State)
// Now safe to access gsa.Status.MetadataOr add explicit nil check:
assert.Equal(t, allocationv1.GameServerAllocationAllocated, gsa.Status.State)
require.NotNil(t, gsa.Status.Metadata, "allocation metadata should not be nil for allocated state")
assert.Equal(t, t.Name(), gsa.Status.Metadata.Labels[role])