Fix signal handling during machine start on macOS#28991
Conversation
23a1432 to
7ea17e4
Compare
|
Converted to draft as the new e2e test is failing consistently in CI: |
c120469 to
08bf3e6
Compare
08bf3e6 to
5a1d276
Compare
|
[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore. |
b8fa1cf to
025dba4
Compare
|
All tests are green. The PR is ready for review PTAL @podman-container-tools/podman-maintainers @podman-container-tools/podman-reviewers |
| // a termination signal is sent | ||
| callbackFuncs := machine.CleanUp() | ||
| defer callbackFuncs.CleanIfErr(&err) | ||
| go callbackFuncs.CleanOnSignal() |
There was a problem hiding this comment.
Is this goroutine stopped on a successful start machine?
There was a problem hiding this comment.
I have added a comment because I think you are right; at least we should make it easier to understand for someone reading the code. Anyway, the goroutines are forced to terminate when the main goroutine terminates. This is not ideal, but it's how these cleanup callbackFuncs are currently used (see Init and Start in the current main branch). Considering that podman machine start is a short-lived command, this is not a big problem. And this PR is already big enough.
025dba4 to
8b93ec5
Compare
Luap99
left a comment
There was a problem hiding this comment.
(not a full review github got bugged and I Cannot add more comments)
| // Start a goroutine that will evenutally propagte a | ||
| // terminate signal to the VM process. | ||
| // If the user decides to abort `podman machine start` | ||
| // while the VM is starting, we want the VM to be stopped | ||
| // too. Or the machine will be left in an inconsistent | ||
| // state. | ||
| term := make(chan os.Signal, 1) | ||
| signal.Notify(term, os.Interrupt, syscall.SIGTERM) | ||
| go func() { |
There was a problem hiding this comment.
Setting up yet another signal handler seems to complicate things instead of simplifying. We already have one setup in the main start so we should reuse that, return a proper cleanup function which kill the right processes and than trigger these there.
The more listeners we have to more racy everything gets.
There was a problem hiding this comment.
I agree. This is the big problem with this approach. But returning a proper cleanup function would require changing the interface and implementations across all our providers. Something I wanted to avoid or, at least, do gradually.
There was a problem hiding this comment.
well multiple listeners are racy. there is no guarantee that this here will be called at all before the other listener does the os.Exit(1) so return the cleanup function is the only sane way to keep this correct
There was a problem hiding this comment.
They are racy. You don't need to convince me on that. There are edge cases where the VM doesn't get killed. But that's still better than never killing the VM on a signal. This is a design problem for all providers (note that we spawn 2 more listeners for hyperv: 1 and 2) and I wanted to limit the scope of the PR.
But that's how it goes. You fix one thing, and then you get asked to fix more :-) I can give it a try with a new cleanup mechanism implemented on macOS.
| // To enforce that, the for loop is wrapped with WaitGroup.Go(), and | ||
| // execution is blocked until all goroutines have completed with | ||
| // WaitGroup.Wait(). | ||
| c.wait.Go(func() { | ||
| // Cleanup functions invoked in reverse registration order | ||
| for _, cleanfunc := range slices.Backward(funcs) { | ||
| if err := cleanfunc(); err != nil { | ||
| logrus.Error(err) | ||
| } | ||
| } | ||
| } | ||
| }) | ||
| c.wait.Wait() |
There was a problem hiding this comment.
I am super confused here, I do not see how this fixes anything really. The waitgroup just acts as an extra mutex basically but there i no need to use a wg at all.
If I see the problem right simply move the c.mu.Unlock() to the end of the function, there should be no risk keeping this locked for longer.
There was a problem hiding this comment.
Moving the c.mu.Unlock to the end of the function should work too, yes. But you don't see how this fixes anything?
| // Start a goroutine that waits until the gvproxy process completes. | ||
| // Gvproxy should not exit during the machine startup but, if it does | ||
| // because of an error or a TERM signal, we want to release the | ||
| // associated resources and reap the process. | ||
| // Otherwise the gvproxy completed process will result as a zombie and, | ||
| // most importantly, machine.backoffForProcess fails because | ||
| // Process.IsRunning() returns true | ||
| go func() { | ||
| if err := c.Wait(); err != nil { | ||
| logrus.Debugf("gvproxy exited: %v", err) | ||
| } | ||
| }() |
There was a problem hiding this comment.
that leaks the goroutine forever which is not right.
podman machine start is a short lived process so that does not matter to much but the same can be said for the zombie state, if the parent exists gvproxy get the ini as parent which must reap it.
I think doing the wait on the cleanup code path when we know we killed the process is better.
There was a problem hiding this comment.
Sure, I can move this goroutine there.
| if err = p.Terminate(); err != nil { | ||
| if errors.Is(err, syscall.ESRCH) { | ||
| logrus.Debugf("Gvproxy already dead, exiting cleanly") | ||
| return nil | ||
| } | ||
| return err | ||
| } | ||
|
|
||
| if err = backoffForProcess(p); err == nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Do we need this? any reason why not just sigkill is enough? I don't thin gvpoxy has any state so I doubt it matters and I Rather safe the few lines of code.
There was a problem hiding this comment.
Trying with a sigterm and, if it fails, sending a sigkill seems cleaner. Especially because we are doing this for gvproxy, but we could (should?) reuse the cleanup code for other processes that are started as well. Also, the comment on this function already describes this behavior (the code was updated, but not the comment?), so I was trying to clean things up.
| // callbackFuncs are invoked when errors occurs or term signals | ||
| // are received. Thus we need to defer startingFalse for when | ||
| // Start() completes successfully | ||
| defer func() { _ = startingFalse() }() |
There was a problem hiding this comment.
well that runs always, even if the function returns an error.
so would it not be the right thing to just move the
mc.Starting = false
mc.Write()
to the end of the fucntion instead of using defer at all
There was a problem hiding this comment.
Yes. If an error occurs, those will be called twice. That's why I have added the sync.Once. That happens 4 times in the function Start(): mcunlock, startLockUnlock, startingFalse, releaseCmd. Moving it to the end is ok, but when doing a change or adding a new operation, we may easily forget to update one of the two.
The root of the problem is duplication. And to avoid it, we should change how CallbackFuncs.Add work (a new parameter to specify when it should be called: signal, error, success). Not sure if this should be part of the PR, though.
8b93ec5 to
517528b
Compare
Related to podman-container-tools#28318 Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
517528b to
12ea195
Compare
| // Otherwise, when errors occur, it's called twice and panics: | ||
| // - defer mcunlock() | ||
| // - defer callbackFuncs.CleanIfErr() | ||
| var mcUnlockOnce sync.Once |
There was a problem hiding this comment.
This feels really gross, but I don't have an immediate suggestion for how to avoid it
|
I don't like the locking here, but I don't see an easy solution to resolve it without some sort of trylock semantics I don't think we have on these locks. LGTM on the whole |
Fixed the state of a machine when the
podman machine startcommand is interrupted on macOS (the details of the problem have been described in this issue):podman machine start(fixed by 23a1432)Related to #28318 (fix it on macOS, not Windows)
Checklist
Ensure you have completed the following checklist for your pull request to be reviewed:
commits. (
git commit -s). (If needed, usegit commit -s --amend). The author email must matchthe sign-off email address. See CONTRIBUTING.md
for more information.
Fixes: #00000in commit message (if applicable)make validatepr(format/lint checks)Noneif no user-facing changes)Does this PR introduce a user-facing change?