Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes upgrade game server template to use safe-to-evict: Always #4096

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

igooch
Copy link
Collaborator

@igooch igooch commented Jan 21, 2025

What type of PR is this?

/kind bug

What this PR does / Why we need it:

TL;DR
This PR is to update the upgrade test game server to use safe-to-evict: true (AKA eviction: Always) which will change the Node-Selectors for game servers on autopilot clusters to <none>, which is the same as the balloon pods that are in place to prevent the need for scale-up on autopilot clusters. This should make spinning up new game server pods faster on autopilot clusters, and prevent frequent flakes.

A number of the upgrade test flakes appear to be due to issues creating backing pods for the game servers.

Taking a look at the logs, there are warnings like below on autopilot test clusters:

0/9 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 3 node(s) had untolerated taint {cloud.google.com/not-target-gke-version: true}, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/9 nodes are available: 2 Insufficient memory, 7 Preemption is not helpful for scheduling.

The game server is eventually assigned:

Successfully assigned default/sdk-client-test-p8sdx to gk3-gke-autopilot-upgrad-nap-9xmh9i3h-0c0fa920-cg7c

But then the pod repeatedly fails liveness probes:

Liveness probe failed: Get "http://10.8.132.8:8080/healthz": dial tcp 10.8.132.8:8080: connect: connection refused

The pod is then marked as unhealthy:

"SyncLoop (probe)" probe="liveness" status="unhealthy" pod="default/sdk-client-test-p8sdx"

And the container is killed, which causes the test to fail as the gameserver is marked as unhealthy:

"Killing container with a grace period" pod="default/sdk-client-test-p8sdx" podUID="28085f3c-a7f1-4166-8552-3c48c4c38671" containerName="agones-gameserver-sidecar" containerID="containerd://c87b32ddf0c4f2d9848e2b5f996b0eb618a438019971235043290e15635e263b" gracePeriod=30

Both the "balloon" pods (evictable-pods-deployment) and the game server pods have the same labels and pod affinity:

      labels:
...
        agones.dev/role: gameserver
...
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchLabels:
                    agones.dev/role: gameserver
                topologyKey: kubernetes.io/hostname
              weight: 100

However, they do not have the same Node-Selectors. The evictable-pods-deployment have Node-Selectors: <none> while the game servers have Node-Selectors: cloud.google.com/extended-duration-pods=0. The difference in Node-Selectors is due to the game server having a default safe-to-evict value of false (Never). This Node-Selectors is automatically set by GKE Autopilot whenever the game server spec eviction safe is set to Never or Upgrade. By setting the game server spec eviction to Always the Node-Selectors for game server pods will be <none>. This means that the game server pods will have the same node affinity/selector as the "balloon" pods, and will be able to evict the "balloon" pods and more quickly spin up backing pods for the game servers.

Which issue(s) this PR fixes:

NA

Special notes for your reviewer:

@igooch igooch requested a review from vicentefb January 21, 2025 23:20
@github-actions github-actions bot added kind/bug These are bugs. size/XS labels Jan 21, 2025
@agones-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: d3a84d32-b83c-4191-8312-88135672f19a

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4096/head:pr_4096 && git checkout pr_4096
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.47.0-dev-cc52905

@vicentefb
Copy link
Collaborator

lgtm

@igooch igooch merged commit bebe915 into googleforgames:main Jan 22, 2025
4 checks passed
@igooch igooch deleted the upgrade-test branch January 22, 2025 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs. size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants