-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] - Operator gets stuck deploying a slinkee cluster when nodes have taints (e.g,. control plane nodes) #11
Comments
@t1mk1k can you try newest build and let me know if that resolves? |
@vultj I've changed the image tags in the Helm chart to However, one of the deployments is still trying to schedule a pod on a control plane node:
It does not appear to affect the SLURM cluster, but probably good to filter out nodes with taints at the deployment stage too. |
These issues are worse when your cluster has specific requirements for what is allowed to deploy on it. For example, on OpenShift, infrastructure taints have to be tolerated to be in license compliance. So ideally, you would want a way to tell the operator to not deploy on certain nodes. |
Likely, but I welcome any PRs that add such functionality. |
Describe the bug
The slinkee operator gets stuck deploying a slinkee cluster when there are nodes with taints that will not have a slurmabler deployed on them. The reason is that the operator waits until each node has been labelled, however the slurmabler will not be scheduled on nodes with a taint.
In many Kubernetes clusters the control plane will be tainted so regular workloads cannot be scheduled on them, so on many Kubernetes clusters this will be an issue.
To Reproduce
Steps to reproduce the behavior using a Kind cluster:
/tmp/one-node-kind.yml
:Expected behavior
The slinkee operator should ignore the nodes that do not have a slurmabler scheduled on them because of taints.
Additional context
Commit of repo used for testing: 5538806
Kind version: kind v0.22.0 go1.21.7 darwin/arm64
The text was updated successfully, but these errors were encountered: