You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
After applying a payload to the slurm cluster, the operator creates the daemonset for slurmabler pods. However, if the pod crashes or restarts, it will error loop because the daemonset already exists.
To Reproduce
Steps to reproduce the behavior:
Install Slik
Apply either payload
Let the slurmabler pods be created
Delete the operator pod, allowing the deployment to recreate it, and check the logs for the error loop.
Expected behavior
It should handle errors gracefully, or if there is an issue where the daemonset needs to be created, then the operator should just delete and then recreate the daemonset.
Additional context
Deleting the daemonset and restarting the operator pod will fix the problem but when you upgrade a cluster pods will be moved around during the rolling update, therefore any cluster upgrade will break the slurm operator.
The text was updated successfully, but these errors were encountered:
Describe the bug
After applying a payload to the slurm cluster, the operator creates the daemonset for slurmabler pods. However, if the pod crashes or restarts, it will error loop because the daemonset already exists.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
It should handle errors gracefully, or if there is an issue where the daemonset needs to be created, then the operator should just delete and then recreate the daemonset.
Additional context
Deleting the daemonset and restarting the operator pod will fix the problem but when you upgrade a cluster pods will be moved around during the rolling update, therefore any cluster upgrade will break the slurm operator.
The text was updated successfully, but these errors were encountered: