Skaha leaks kubernetes resources #727

rptaylor · 2024-11-13T20:09:53Z

Skaha does not appear to clean up the services and ingressroutes that it creates for workload pods so there is an unbounded growth of orphaned resources on clusters that it runs on. This contributes to load on the API and etcd services. Left long enough (especially running at large scale) there is a risk of consuming all etcd storage and bringing down the cluster. It would be difficult to recover from this condition (we already use the maximum recommended etcd size) so in practice it would likely require destroying and rebuilding the cluster.

Currently we have to occasionally remember to remind you :) to run a manual cleanup every few months to avoid this, which is a bit operationally fragile. Skaha should automatically clean up all resources it creates to avoid leaving orphaned resources indefinitely.

sbathgate · 2025-01-29T23:48:52Z

Ryan and I were discussing this previously and I wanted to share an idea that might help you move away from the current cleanup.sh script and instead leverage Kubernetes native garbage-collection features. Right now, your script checks for orphaned sessions and deletes them based on naming conventions. While it works, it’s not the most efficient or reliable in the long run, especially if something goes wrong with the cron job or session naming. We believe you can improve your workflow by incorporating ownerReferences in the resources you generate.

The key idea is to let Kubernetes do the cleanup automatically: once a parent Job is deleted, any child objects that reference that Job via ownerReferences are removed as well. You could implement logic similar to what you do with GPU scheduling in SessionJobBuilder#mergeAffinity. In that method, you're already parsing a YAML string, manipulating its data (for GPU affinity), and returning updated YAML. You can follow the same pattern to merge the UID from the starting Job into the session and ingress resources, which are generated around PostAction lines 648–674.

Specifically, once you create the Job, you’d grab its UID (likely via the Kubernetes Java client), and then inject that UID as an owner reference in each child resource. If you add something like a mergeOwnerReference method—mirroring how mergeAffinity works—it will parse the existing YAML, insert the owner reference details, and spit out an updated YAML to create the session and ingress. This should let you completely rely on Kubernetes for garbage collection and remove the need for a manual cleanup script.

Happy to discuss further if it is of interest / you'd likely clarification on anything above 😄

brianmajor · 2025-01-30T22:02:36Z

@sbathgate @rptaylor - Thanks so much for drawing our attention to this child objects feature--it sounds perfect for the job cleanup.sh is currently doing! I'll be turning your comments into a story on our side, so closing this issue off for now.

brianmajor self-assigned this Nov 14, 2024

brianmajor closed this as completed Jan 30, 2025

at88mph mentioned this issue Feb 11, 2025

Cascade cleanup #810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skaha leaks kubernetes resources #727

Skaha leaks kubernetes resources #727

rptaylor commented Nov 13, 2024 •

edited

Loading

sbathgate commented Jan 29, 2025

brianmajor commented Jan 30, 2025

Skaha leaks kubernetes resources #727

Skaha leaks kubernetes resources #727

Comments

rptaylor commented Nov 13, 2024 • edited Loading

sbathgate commented Jan 29, 2025

brianmajor commented Jan 30, 2025

rptaylor commented Nov 13, 2024 •

edited

Loading