Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3d tests frequently timeout #9585

Open
garethbowen opened this issue Oct 25, 2024 · 5 comments
Open

k3d tests frequently timeout #9585

garethbowen opened this issue Oct 25, 2024 · 5 comments
Assignees
Labels
Flaky Indicates a flaky or unreliable test Testing Affects how the code is tested Type: Technical issue Improve something that users won't notice
Milestone

Comments

@garethbowen
Copy link
Member

Describe the issue

There are two suites that run with k3d instead of docker compose. These suites are copies of other suites so they run exactly the same tests with the only difference being how the CHT is being launched. This is better because it's more like a production environment and therefore more likely to find issues that actual projects would hit. Eventually the goal was to remove all docker compose versions to reduce the number of suites we execute.

However, the k3d suites frequently fail due to hitting the 60 minute GHA limit.

The original issue covering this migration is #8909

Describe the improvement you'd like

We need to find a way to stablise these suites so that we can trust the results of our CI runs. The ideal solution is to figure out why k3d is so much slower than docker compose and fix that. If it's not possible to speed the test up we could execute a minimal k3d suite, and retain docker compose suites for the bulk of the testing.

Describe alternatives you've considered

We could just increase the GHA timeout and that would probably work but then we would have to wait over an hour for each build to complete which is not acceptable.

@garethbowen garethbowen added Flaky Indicates a flaky or unreliable test Testing Affects how the code is tested Type: Technical issue Improve something that users won't notice labels Oct 25, 2024
@dianabarsan
Copy link
Member

dianabarsan commented Oct 30, 2024

From my experience, k3d is a lot slower when starting and stopping pods, additionally logs are logs when a pod is stopped, so to keep a log trail, we need to save logs for all pods everytime we stop a pod (because a crash in haproxy will also crash API).

I think k3d generally requires more resources than docker on the tiny CI machine, so I do expect tests to be a little slower.
I think the solution is to work out a way to change the tests so frequent or regular restarts are not required.
We frequently stop sentinel just so we stop it from processing docs, for example, or forcing it to run some startup queue, instead of waiting for the regular 5 minute timer to elapse.

One intermediary solution would be to skip all tests that require pod restarts in k3d and see how that improves stability, but ideally we would find a way to change these tests so that they don't need service restarts, or if we could somehow disconnect a service in some other way, without stopping it.

@dianabarsan dianabarsan self-assigned this Oct 30, 2024
@dianabarsan
Copy link
Member

Discussing with @garethbowen, we've come up with a plan to:

  • separate tests that genuinely require service restarts into their separate suite. this suite should run both over docker and k3d.
  • update service (api? sentinel?) to respond to process signals. instead of stopping and starting the service to achieve some behavior, setup hooks that run that code that are triggered by signals. add code in e2e tests that pass these signals to the containers/pods.

@dianabarsan dianabarsan moved this from Todo to In Progress in Product Team Activities Nov 19, 2024
@dianabarsan dianabarsan added this to the 4.16.0 milestone Nov 19, 2024
dianabarsan added a commit that referenced this issue Nov 25, 2024
No longer start and stop sentinel to stop/start transitions and run scheduled tasks. Instead make it listen to two kill signals.
Kill the test process if API is down in the before hook. This way we might get logs when the process ends up hanging.

#9585
@dianabarsan
Copy link
Member

Added signals to sentinel to:

  • toggle transition processing on or off
  • run scheduled tasks

Added a killswitch in mocha when API is offline in a beforeEach.
When api fails to start after being scaled down, we should at least have some logs instead of timing out.

@dianabarsan
Copy link
Member

Ok!
The api test failed, and my workaround to kill the process early and get logs succeeded: https://github.com/medic/cht-core/actions/runs/12038349738?pr=9670

These are the logs and the reason the API pod doesn't start up:

Error from server (BadRequest): container "cht-api" in pod "cht-api-56474cb78b-bh2rn" is waiting to start: trying and failing to pull image

This is probably some bug on k3d side, because the container had been up and the image has not changed.

@dianabarsan
Copy link
Member

I found something relevant: kubernetes/kubernetes#123631

It seems that, at least for kubernetes, it may happen that garbage collection can delete an image. This would explain this behavior.
I'm going to try to work around it by creating a local registry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Flaky Indicates a flaky or unreliable test Testing Affects how the code is tested Type: Technical issue Improve something that users won't notice
Projects
Status: In Progress
Development

No branches or pull requests

2 participants