-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skaha leaks kubernetes resources #727
Comments
Ryan and I were discussing this previously and I wanted to share an idea that might help you move away from the current The key idea is to let Kubernetes do the cleanup automatically: once a parent Job is deleted, any child objects that reference that Job via Specifically, once you create the Job, you’d grab its UID (likely via the Kubernetes Java client), and then inject that UID as an owner reference in each child resource. If you add something like a Happy to discuss further if it is of interest / you'd likely clarification on anything above 😄 |
@sbathgate @rptaylor - Thanks so much for drawing our attention to this child objects feature--it sounds perfect for the job cleanup.sh is currently doing! I'll be turning your comments into a story on our side, so closing this issue off for now. |
Skaha does not appear to clean up the services and ingressroutes that it creates for workload pods so there is an unbounded growth of orphaned resources on clusters that it runs on. This contributes to load on the API and etcd services. Left long enough (especially running at large scale) there is a risk of consuming all etcd storage and bringing down the cluster. It would be difficult to recover from this condition (we already use the maximum recommended etcd size) so in practice it would likely require destroying and rebuilding the cluster.
Currently we have to occasionally remember to remind you :) to run a manual cleanup every few months to avoid this, which is a bit operationally fragile. Skaha should automatically clean up all resources it creates to avoid leaving orphaned resources indefinitely.
The text was updated successfully, but these errors were encountered: