job: Close the health reporter instead of stopping #20

joamaki · 2024-10-10T15:49:52Z

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs.

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs. Signed-off-by: Jussi Maki <[email protected]>

joamaki · 2024-10-10T15:51:17Z

There's also another related issue on the cilium/cilium pkg/hive/health/provider.go side where the Stopped() method only marks StoppedAt time but doesn't set Level nor Message which can lead to confusion. Will do a separate PR for that.

(https://github.com/cilium/cilium/pull/35154/files#diff-f0f7633343b350adb76fe60850c162ed40b56279936699feffa2c0dce5587411)

joamaki · 2024-10-10T16:05:49Z

I'm slightly unsure of this change. Do we want the jobs themselves to close their health scopes, or should we just have them call Stopped and we'd Close the whole tree when the job.Group is stopped? That might not ever happen though... it's more likely the group stays running and it's used to spawn a lot of one-shot jobs. I do see the benefit of seeing that a one-shot or a timer job was stopped... another way to solve this is to do garbage collection of stopped health entries after awhile...

My main concern is that right now it's way too easy to leave a lot of these around and if we ever start spawning bunch of these with generated names we'll end up with a huge health table, so I feel like erring on the side of cleaning up might be better. Thoughts?

ovidiutirla · 2024-10-13T19:26:52Z

I'm more into regular GC of the stopped health entries. Keeping their status stopped as long as the cilium-agent runs sound like it could easily become a memory leak.

JobGroups might never get stopped, I wouldn't rely on that. (agree on this)

Could we mark them as stopped, the metric should report that they were stopped, and immediately after closing them.
Or in between stop and close have a slight delay instead of having a periodic gc?

Closing them instead of reporting as stopped could produce some weird visuals on the metrics which are not that intuitive.

job: Close the health reporter instead of stopping

af3edec

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs. Signed-off-by: Jussi Maki <[email protected]>

joamaki requested a review from a team as a code owner October 10, 2024 15:49

joamaki requested review from ovidiutirla and removed request for a team October 10, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job: Close the health reporter instead of stopping #20

job: Close the health reporter instead of stopping #20

joamaki commented Oct 10, 2024

joamaki commented Oct 10, 2024 •

edited

Loading

joamaki commented Oct 10, 2024 •

edited

Loading

ovidiutirla commented Oct 13, 2024 •

edited

Loading

job: Close the health reporter instead of stopping #20

Are you sure you want to change the base?

job: Close the health reporter instead of stopping #20

Conversation

joamaki commented Oct 10, 2024

joamaki commented Oct 10, 2024 • edited Loading

joamaki commented Oct 10, 2024 • edited Loading

ovidiutirla commented Oct 13, 2024 • edited Loading

joamaki commented Oct 10, 2024 •

edited

Loading

joamaki commented Oct 10, 2024 •

edited

Loading

ovidiutirla commented Oct 13, 2024 •

edited

Loading