k3d tests frequently timeout #9585

garethbowen · 2024-10-25T08:56:48Z

Describe the issue

There are two suites that run with k3d instead of docker compose. These suites are copies of other suites so they run exactly the same tests with the only difference being how the CHT is being launched. This is better because it's more like a production environment and therefore more likely to find issues that actual projects would hit. Eventually the goal was to remove all docker compose versions to reduce the number of suites we execute.

However, the k3d suites frequently fail due to hitting the 60 minute GHA limit.

The original issue covering this migration is #8909

Describe the improvement you'd like

We need to find a way to stablise these suites so that we can trust the results of our CI runs. The ideal solution is to figure out why k3d is so much slower than docker compose and fix that. If it's not possible to speed the test up we could execute a minimal k3d suite, and retain docker compose suites for the bulk of the testing.

Describe alternatives you've considered

We could just increase the GHA timeout and that would probably work but then we would have to wait over an hour for each build to complete which is not acceptable.

dianabarsan · 2024-10-30T13:53:32Z

From my experience, k3d is a lot slower when starting and stopping pods, additionally logs are logs when a pod is stopped, so to keep a log trail, we need to save logs for all pods everytime we stop a pod (because a crash in haproxy will also crash API).

I think k3d generally requires more resources than docker on the tiny CI machine, so I do expect tests to be a little slower.
I think the solution is to work out a way to change the tests so frequent or regular restarts are not required.
We frequently stop sentinel just so we stop it from processing docs, for example, or forcing it to run some startup queue, instead of waiting for the regular 5 minute timer to elapse.

One intermediary solution would be to skip all tests that require pod restarts in k3d and see how that improves stability, but ideally we would find a way to change these tests so that they don't need service restarts, or if we could somehow disconnect a service in some other way, without stopping it.

dianabarsan · 2024-11-07T16:51:12Z

Discussing with @garethbowen, we've come up with a plan to:

separate tests that genuinely require service restarts into their separate suite. this suite should run both over docker and k3d.
update service (api? sentinel?) to respond to process signals. instead of stopping and starting the service to achieve some behavior, setup hooks that run that code that are triggered by signals. add code in e2e tests that pass these signals to the containers/pods.

No longer start and stop sentinel to stop/start transitions and run scheduled tasks. Instead make it listen to two kill signals. Kill the test process if API is down in the before hook. This way we might get logs when the process ends up hanging. #9585

dianabarsan · 2024-11-25T12:16:47Z

Added signals to sentinel to:

toggle transition processing on or off
run scheduled tasks

Added a killswitch in mocha when API is offline in a beforeEach.
When api fails to start after being scaled down, we should at least have some logs instead of timing out.

dianabarsan · 2024-11-27T05:29:52Z

Ok!
The api test failed, and my workaround to kill the process early and get logs succeeded: https://github.com/medic/cht-core/actions/runs/12038349738?pr=9670

These are the logs and the reason the API pod doesn't start up:

Error from server (BadRequest): container "cht-api" in pod "cht-api-56474cb78b-bh2rn" is waiting to start: trying and failing to pull image

This is probably some bug on k3d side, because the container had been up and the image has not changed.

dianabarsan · 2024-12-03T16:11:01Z

I found something relevant: kubernetes/kubernetes#123631

It seems that, at least for kubernetes, it may happen that garbage collection can delete an image. This would explain this behavior.
I'm going to try to work around it by creating a local registry.

garethbowen added Flaky Indicates a flaky or unreliable test Testing Affects how the code is tested Type: Technical issue Improve something that users won't notice labels Oct 25, 2024

garethbowen added this to Product Team Activities Oct 25, 2024

github-project-automation bot moved this to Todo in Product Team Activities Oct 25, 2024

dianabarsan self-assigned this Oct 30, 2024

dianabarsan moved this from Todo to In Progress in Product Team Activities Nov 19, 2024

dianabarsan added this to the 4.16.0 milestone Nov 19, 2024

dianabarsan mentioned this issue Nov 19, 2024

chore(#9585): teach Sentinel sign language #9658

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3d tests frequently timeout #9585

k3d tests frequently timeout #9585

garethbowen commented Oct 25, 2024

dianabarsan commented Oct 30, 2024 •

edited

Loading

dianabarsan commented Nov 7, 2024

dianabarsan commented Nov 25, 2024

dianabarsan commented Nov 27, 2024

dianabarsan commented Dec 3, 2024

k3d tests frequently timeout #9585

k3d tests frequently timeout #9585

Comments

garethbowen commented Oct 25, 2024

dianabarsan commented Oct 30, 2024 • edited Loading

dianabarsan commented Nov 7, 2024

dianabarsan commented Nov 25, 2024

dianabarsan commented Nov 27, 2024

dianabarsan commented Dec 3, 2024

dianabarsan commented Oct 30, 2024 •

edited

Loading