Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Operator gets stuck deploying a slinkee cluster when nodes have taints (e.g,. control plane nodes) #11

Open
t1mk1k opened this issue Jun 21, 2024 · 4 comments · Fixed by #15
Assignees
Labels
bug Something isn't working

Comments

@t1mk1k
Copy link

t1mk1k commented Jun 21, 2024

Describe the bug
The slinkee operator gets stuck deploying a slinkee cluster when there are nodes with taints that will not have a slurmabler deployed on them. The reason is that the operator waits until each node has been labelled, however the slurmabler will not be scheduled on nodes with a taint.

In many Kubernetes clusters the control plane will be tainted so regular workloads cannot be scheduled on them, so on many Kubernetes clusters this will be an issue.

To Reproduce
Steps to reproduce the behavior using a Kind cluster:

  1. Create a kind cluster config for one control plane node and one worker node in /tmp/one-node-kind.yml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  1. Create a kind cluster with the config:
kind create cluster --config /tmp/one-node-kind.yml
  1. Deploy slinkee via helm
helm install -f helm/slinkee/values.yaml slinkee ./helm/slinkee/
  1. Deploy the simple slinkee cluster
kubectl apply -f payloads/simple.yaml 
  1. Wait for the slinkee-operator to create the slurm-ablers and observe how the worker node will get labels and the control plane node will not:
kubectl get nodes --show-labels
  1. In the logs of the slinkee-operator you can see it is waiting for nodes to be labelled:
kubectl logs slinkee-operator-767fb59df6-7w66j --tail 10
2024-06-21T11:34:52.987Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:34:53.993Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:34:54.999Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:34:56.006Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:34:57.012Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:34:58.020Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:34:59.027Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:35:00.033Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:35:01.039Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
2024-06-21T11:35:02.046Z	INFO	slurm/create_slurmabler.go:102	github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet	node lacking labels...	{"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
  1. Remove the taint from the control plane node:
kubectl taint node kind-control-plane node-role.kubernetes.io/control-plane:NoSchedule-
  1. Now watch the slinkee simple cluster being deployed
kubectl get pods -n default -w

Expected behavior

The slinkee operator should ignore the nodes that do not have a slurmabler scheduled on them because of taints.

Additional context

Commit of repo used for testing: 5538806
Kind version: kind v0.22.0 go1.21.7 darwin/arm64

@t1mk1k t1mk1k added the bug Something isn't working label Jun 21, 2024
@vultj vultj closed this as completed in #15 Jun 24, 2024
@vultj
Copy link
Contributor

vultj commented Jun 24, 2024

@t1mk1k can you try newest build and let me know if that resolves?

@vultj vultj reopened this Jun 24, 2024
@t1mk1k
Copy link
Author

t1mk1k commented Jun 25, 2024

@vultj I've changed the image tags in the Helm chart to v0.0.2 and now a cluster is deployed successfully.

However, one of the deployments is still trying to schedule a pod on a control plane node:

✗ kubectl get pods -owide                                     
NAME                                       READY   STATUS    RESTARTS   AGE    IP           NODE          NOMINATED NODE   READINESS GATES
slik-operator-6bf7848d88-wqxqh             1/1     Running   0          102m   10.244.1.4   kind-worker   <none>           <none>
test-kind-control-plane-78c96648b7-jms5g   0/2     Pending   0          99m    <none>       <none>        <none>           <none>
test-kind-worker-cb684697f-9bfhr           2/2     Running   0          99m    10.244.1.7   kind-worker   <none>           <none>
test-slurm-toolbox-64bf746f8c-7s2pl        2/2     Running   0          99m    10.244.1.8   kind-worker   <none>           <none>
test-slurmabler-kz6xb                      1/1     Running   0          99m    10.244.1.5   kind-worker   <none>           <none>
test-slurmctld-dbfcd569f-r7cfd             2/2     Running   0          99m    10.244.1.6   kind-worker   <none>           <none>
✗ kubectl get deployments.apps test-kind-control-plane -owide
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS   IMAGES                                SELECTOR
test-kind-control-plane   0/1     1            0           102m   slurmd       ewr.vultrcr.com/slurm/slurmd:v0.0.2   app=test-slurmd,host=kind-control-plane

It does not appear to affect the SLURM cluster, but probably good to filter out nodes with taints at the deployment stage too.

@odellem
Copy link

odellem commented Aug 1, 2024

These issues are worse when your cluster has specific requirements for what is allowed to deploy on it. For example, on OpenShift, infrastructure taints have to be tolerated to be in license compliance. So ideally, you would want a way to tell the operator to not deploy on certain nodes.

@vultj
Copy link
Contributor

vultj commented Aug 1, 2024

Likely, but I welcome any PRs that add such functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants