-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3d tests frequently timeout #9585
Comments
From my experience, k3d is a lot slower when starting and stopping pods, additionally logs are logs when a pod is stopped, so to keep a log trail, we need to save logs for all pods everytime we stop a pod (because a crash in haproxy will also crash API). I think k3d generally requires more resources than docker on the tiny CI machine, so I do expect tests to be a little slower. One intermediary solution would be to skip all tests that require pod restarts in k3d and see how that improves stability, but ideally we would find a way to change these tests so that they don't need service restarts, or if we could somehow disconnect a service in some other way, without stopping it. |
Discussing with @garethbowen, we've come up with a plan to:
|
No longer start and stop sentinel to stop/start transitions and run scheduled tasks. Instead make it listen to two kill signals. Kill the test process if API is down in the before hook. This way we might get logs when the process ends up hanging. #9585
Added signals to sentinel to:
Added a killswitch in mocha when API is offline in a beforeEach. |
Ok! These are the logs and the reason the API pod doesn't start up:
This is probably some bug on k3d side, because the container had been up and the image has not changed. |
I found something relevant: kubernetes/kubernetes#123631 It seems that, at least for kubernetes, it may happen that garbage collection can delete an image. This would explain this behavior. |
Describe the issue
There are two suites that run with k3d instead of docker compose. These suites are copies of other suites so they run exactly the same tests with the only difference being how the CHT is being launched. This is better because it's more like a production environment and therefore more likely to find issues that actual projects would hit. Eventually the goal was to remove all docker compose versions to reduce the number of suites we execute.
However, the k3d suites frequently fail due to hitting the 60 minute GHA limit.
The original issue covering this migration is #8909
Describe the improvement you'd like
We need to find a way to stablise these suites so that we can trust the results of our CI runs. The ideal solution is to figure out why k3d is so much slower than docker compose and fix that. If it's not possible to speed the test up we could execute a minimal k3d suite, and retain docker compose suites for the bulk of the testing.
Describe alternatives you've considered
We could just increase the GHA timeout and that would probably work but then we would have to wait over an hour for each build to complete which is not acceptable.
The text was updated successfully, but these errors were encountered: