Skip to content

Conversation

@upodroid
Copy link
Member

@upodroid upodroid commented Nov 2, 2025

This PR tweaks the instance dumping logic:

  1. Prioritises instances that fail to join the cluster
  2. Use the control plane instances as an SSH bastion if a bastion doesn't exist and the control plane instances have a public IP

We need this for troubleshooting infra flakes with kops

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 2, 2025
@k8s-ci-robot k8s-ci-robot added the area/provider/gcp Issues or PRs related to gcp provider label Nov 2, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hakman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hakman
Copy link
Member

hakman commented Nov 2, 2025

Stores ssh logs by instance names instead of IPs

Not sure this makes things better or worse. IPs make it obvious that the node has not registered.

@upodroid
Copy link
Member Author

upodroid commented Nov 2, 2025

Stores ssh logs by instance names instead of IPs

Not sure this makes things better or worse. IPs make it obvious that the node has not registered.

kops validate logs failures like this:

VALIDATION ERRORS
KIND	NAME																MESSAGE
Machine	https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-027/zones/us-west1-b/instances/nodes-us-west1-b-dksv	machine "https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-027/zones/us-west1-b/instances/nodes-us-west1-b-dksv" has not yet joined cluster

looking by nodes should make it easier

@hakman
Copy link
Member

hakman commented Nov 2, 2025

Stores ssh logs by instance names instead of IPs

Not sure this makes things better or worse. IPs make it obvious that the node has not registered.

kops validate logs failures like this:

VALIDATION ERRORS
KIND	NAME																MESSAGE
Machine	https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-027/zones/us-west1-b/instances/nodes-us-west1-b-dksv	machine "https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-boskos-027/zones/us-west1-b/instances/nodes-us-west1-b-dksv" has not yet joined cluster

looking by nodes should make it easier

The usual workflow is to just look at nodes why they failed. It is not really useful to know the ID of the failed one if you just know it will be one with IP address.

@upodroid
Copy link
Member Author

upodroid commented Nov 2, 2025

I'll change it back to ip

@upodroid
Copy link
Member Author

upodroid commented Nov 2, 2025

/test presubmit-kops-gce-small-scale-ipalias-using-cl2

@upodroid
Copy link
Member Author

upodroid commented Nov 2, 2025

ready to be merged, it's working correctly

@upodroid
Copy link
Member Author

upodroid commented Nov 2, 2025

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants