Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rhel9-ppc64le VMs offline due to NVMe failure at OSUOSL #3998

Open
richardlau opened this issue Jan 14, 2025 · 6 comments
Open

rhel9-ppc64le VMs offline due to NVMe failure at OSUOSL #3998

richardlau opened this issue Jan 14, 2025 · 6 comments

Comments

@richardlau
Copy link
Member

Got this email from OSUOSL:

One of our nodes which uses local storage with NVMe has had a failure and all the VMs on that node are offline. Due to the way we had storage configured on this node for performance, we're unlikely to save any of the ephemeral disks. However if you had any volumes attached, those will still be intact since they are on our Ceph cluster.

I'm going to be working on getting the NVMe replaced, in the meantime, if you need us to rebuild any of your VMs, please let us know.

This affects our rhel9-ppc64le VMs, which I had set up with NVMe storage as they had resulted in marginally faster builds:

@mhdawson
Copy link
Member

Removed rhel9-ppc64le from these jobs:

@richardlau said he'd take a look tomorrow.

@mhdawson
Copy link
Member

There are a couple of other jobs that use rhel9-ppc64le, but they have not runs for 5-6 days so they are more infrequent so I've not disabled them yet.

I think we'll get the machine back tomorrow, if not we can exclude those as well:

@richardlau
Copy link
Member Author

No updates on the NVMe replacement. I'm going to create new (non-NVMe) VMs.

@richardlau
Copy link
Member Author

I've created a new test-osuosl-rhel9-ppc64_le-4 and replaced test-osuosl-rhel9-ppc64_le-3 (Ansible inventory update in #3999) with non-NVMe RHEL 9 VMs. That will give us two machines for now.

My plan is to wait to see what the outlook is for getting the NVMe replaced before deciding what to do about test-osuosl-rhel9-ppc64_le-1 and test-osuosl-rhel9-ppc64_le-2.

@richardlau
Copy link
Member Author

@mhdawson
Copy link
Member

@richardlau thanks for the quick work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants