Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proc_loadavg: Sleep in loadavg thread should not slow down reloads #679

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DeyanSG
Copy link

@DeyanSG DeyanSG commented Mar 17, 2025

This resolves the issue with reload described in #678

@mihalicyn mihalicyn self-requested a review March 18, 2025 14:20
@Blub
Copy link
Member

Blub commented Mar 19, 2025

The signal handler will set need_reload=1, which is a bit of a weird side effect.
It should be okay, since when we reach this code, we're either in do_reload() where that's already the case, or in main() during shutdown, but personally I don't like depending on this kind of knowledge, as it is a maintenance burden.
(I wonder if we should introduce a more specific signal there (SIGALRM is common to wake up sleeps I think, or some RT signal)

@mihalicyn
Copy link
Member

Hello @DeyanSG,

first of all, thanks for your report and proposed fix!

From #678:

This led us to conclude that we missed the wake-up time for the timer started by the usleep in the loadavg thread. We calculated that it would take about 70 minutes for the timer to overflow and reach the same value again, allowing normal operations to continue.

you are referring to this part of code:

	clock_t time1, time2;

	for (;;) {
		if (loadavg_stop == 1)
			return NULL;

		time1 = clock();
<...>
		if (loadavg_stop == 1)
			return NULL;

		time2 = clock();
		usleep(FLUSH_TIME * 1000000 -
		       (int)((time2 - time1) * 1000000 / CLOCKS_PER_SEC));

right?

Please, can you test the following simple fix:

--- a/src/proc_loadavg.c
+++ b/src/proc_loadavg.c
@@ -504,6 +504,7 @@ static void *load_begin(void *arg)
        int first_node, sum;
        struct load_node *f;
        clock_t time1, time2;
+       int sleep_time;
 
        for (;;) {
                if (loadavg_stop == 1)
@@ -542,8 +543,9 @@ static void *load_begin(void *arg)
                        return NULL;
 
                time2 = clock();
-               usleep(FLUSH_TIME * 1000000 -
-                      (int)((time2 - time1) * 1000000 / CLOCKS_PER_SEC));
+               sleep_time = FLUSH_TIME - (int)((time2 - time1) / CLOCKS_PER_SEC);
+               if ((sleep_time > 0) && (sleep_time <= FLUSH_TIME))
+                       usleep(sleep_time * 1000000);
        }
 }

Kind regards,
Alex

@Blub
Copy link
Member

Blub commented Mar 19, 2025

Ah yes, catching the overflow definitely makes sense. We'd still be a bit delayed, but that's much less problematic than a near-endless sleep call and can still be addressed separately.

@DeyanSG
Copy link
Author

DeyanSG commented Mar 20, 2025

Hi @mihalicyn and @Blub,

Thanks for looking into this.

It seems I somehow missed the overflow in the code and went on to look into the kernel code. Thanks for spotting this, @mihalicyn. We will deploy a patched version today that includes the suggested fix and some additional logging in case we skip the sleep, so we can see all the values.

The issue is tricky to reproduce, as we do not trigger these live migrations. They are part of some GCE maintenance, and so far, the issue during a migration seems to occur rarely. Nevertheless, we will do our best to test and see if we can determine whether this patch will solve the issue.

@Blub, I tested with SIGALRM and SIGVTALRM. However, sending these to the thread seems to kill the whole process. I am also not sure how good of an idea it is to set a custom handler for these signals. I also tested with SIGUSR2, and even without installing a handler for this signal, it seems to work correctly. Let me know if you are keen on merging this with SIGUSR2 (or another signal), or if we will stick to just the solution provided by @mihalicyn if it works.

Regards,
Deyan

@Blub
Copy link
Member

Blub commented Mar 20, 2025

SIGUSR2 is already used up by existing lxcfs code.

I think installing a handler for SIGALRM should be fine, we can choose ourselves how such signals should be dealt with, and SIGALRM is commonly for timeouts (eg. using the alarm(2) call which is meant to interrupt (without killing the process). (Eg. there are no file-locking methods with a timeout, so the only way to try such locks with a time out is to set a timer or alarm or emulate such a thing via threads+sleep)).
If we ever need that behavior we'd want to use timers (timer_create(2)) instead of alarm(2) anyway, as with timers you can target specific threads.

The thread might be sleeping. Make sure to interrupt so we can reload faster.

Signed-off-by: Deyan Doychev <[email protected]>
@DeyanSG
Copy link
Author

DeyanSG commented Mar 21, 2025

Hi @Blub ,

I've modified it to use SIGALRM as suggested and did a quick test. It seems to work fine this way.

Feel free to suggest more improvements.

Regards,
Deyan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants