Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skaha leaks kubernetes resources #727

Closed
rptaylor opened this issue Nov 13, 2024 · 2 comments · May be fixed by #810
Closed

Skaha leaks kubernetes resources #727

rptaylor opened this issue Nov 13, 2024 · 2 comments · May be fixed by #810
Assignees

Comments

@rptaylor
Copy link
Contributor

rptaylor commented Nov 13, 2024

Skaha does not appear to clean up the services and ingressroutes that it creates for workload pods so there is an unbounded growth of orphaned resources on clusters that it runs on. This contributes to load on the API and etcd services. Left long enough (especially running at large scale) there is a risk of consuming all etcd storage and bringing down the cluster. It would be difficult to recover from this condition (we already use the maximum recommended etcd size) so in practice it would likely require destroying and rebuilding the cluster.

Currently we have to occasionally remember to remind you :) to run a manual cleanup every few months to avoid this, which is a bit operationally fragile. Skaha should automatically clean up all resources it creates to avoid leaving orphaned resources indefinitely.

@brianmajor brianmajor self-assigned this Nov 14, 2024
@sbathgate
Copy link

Ryan and I were discussing this previously and I wanted to share an idea that might help you move away from the current cleanup.sh script and instead leverage Kubernetes native garbage-collection features. Right now, your script checks for orphaned sessions and deletes them based on naming conventions. While it works, it’s not the most efficient or reliable in the long run, especially if something goes wrong with the cron job or session naming. We believe you can improve your workflow by incorporating ownerReferences in the resources you generate.

The key idea is to let Kubernetes do the cleanup automatically: once a parent Job is deleted, any child objects that reference that Job via ownerReferences are removed as well. You could implement logic similar to what you do with GPU scheduling in SessionJobBuilder#mergeAffinity. In that method, you're already parsing a YAML string, manipulating its data (for GPU affinity), and returning updated YAML. You can follow the same pattern to merge the UID from the starting Job into the session and ingress resources, which are generated around PostAction lines 648–674.

Specifically, once you create the Job, you’d grab its UID (likely via the Kubernetes Java client), and then inject that UID as an owner reference in each child resource. If you add something like a mergeOwnerReference method—mirroring how mergeAffinity works—it will parse the existing YAML, insert the owner reference details, and spit out an updated YAML to create the session and ingress. This should let you completely rely on Kubernetes for garbage collection and remove the need for a manual cleanup script.

Happy to discuss further if it is of interest / you'd likely clarification on anything above 😄

@brianmajor
Copy link
Member

@sbathgate @rptaylor - Thanks so much for drawing our attention to this child objects feature--it sounds perfect for the job cleanup.sh is currently doing! I'll be turning your comments into a story on our side, so closing this issue off for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants