Fix signal handling during machine start on macOS by l0rd · Pull Request #28991 · podman-container-tools/podman

l0rd · 2026-06-19T21:48:09Z

Fixed the state of a machine when the podman machine start command is interrupted on macOS (the details of the problem have been described in this issue):

first problem: the machine command is interrupted but the VM process isn't (fixed by bd9b189)
second problem: after signal, machine start returns before machine cleanup completes (fixed by bab5c72)
third problem: error logged when stopping gvproxy during machine cleanup (fixed by b73c74a)
fourth problem: incorrect cleanup callbacks registration during machine start (fixed by 4f4b90b)
fifth problem: no automated test of a SIGTERM sent to podman machine start (fixed by 23a1432)

Related to #28318 (fix it on macOS, not Windows)

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
the sign-off email address. See CONTRIBUTING.md
for more information.
Referenced issues using Fixes: #00000 in commit message (if applicable)
Tests have been added/updated (or no tests are needed)
Documentation has been updated (or no documentation changes are needed)
All commits pass make validatepr (format/lint checks)
Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

Fix signal handling during machine on macOS

l0rd · 2026-06-22T08:16:03Z

Converted to draft as the new e2e test is failing consistently in CI:

2026-06-19T22:52:57.7245830Z ------------------------------
2026-06-19T22:52:57.7246230Z podman machine start
2026-06-19T22:52:57.7246810Z /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:21
2026-06-19T22:52:57.7247590Z   start interrupted by SIGTERM while waiting for VM start
2026-06-19T22:52:57.7248460Z   /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354
2026-06-19T22:52:57.7249770Z   > Enter [BeforeEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:218 @ 06/19/26 22:52:57.724
2026-06-19T22:52:57.7251370Z   < Exit [BeforeEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:218 @ 06/19/26 22:52:57.724 (1ms)
2026-06-19T22:52:57.7253200Z   > Enter [It] start interrupted by SIGTERM while waiting for VM start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354 @ 06/19/26 22:52:57.724
2026-06-19T22:52:57.7255230Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine.aarch64.applehv.raw 56b7f309dae6
2026-06-19T22:53:04.7094000Z   Machine init complete
2026-06-19T22:53:04.7094690Z   To start your machine run:
2026-06-19T22:53:04.7095090Z 
2026-06-19T22:53:04.7095400Z   	podman machine start 56b7f309dae6
2026-06-19T22:53:04.7095830Z 
2026-06-19T22:53:04.7201930Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine start 56b7f309dae6
2026-06-19T22:53:04.7543600Z   Starting machine "56b7f309dae6"
2026-06-19T22:53:05.7919990Z   Received a terminate signal
2026-06-19T22:53:05.7920320Z 
2026-06-19T22:53:05.7921040Z   This machine is currently configured in rootless mode. If your containers
2026-06-19T22:53:05.7921560Z   require root permissions (e.g. ports < 1024), or if you run into compatibility
2026-06-19T22:53:05.7922030Z   issues with non-podman clients, you can switch using the following command:
2026-06-19T22:53:05.7933810Z 
2026-06-19T22:53:05.7933950Z   	podman machine set --rootful 56b7f309dae6
2026-06-19T22:53:05.7934120Z 
2026-06-19T22:53:05.7982810Z   Machine command rollback completed
2026-06-19T22:53:05.8027380Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine inspect --format {{.State}} 56b7f309dae6
2026-06-19T22:53:05.8160760Z   running
2026-06-19T22:53:05.8278510Z   [FAILED] Expected
2026-06-19T22:53:05.8278850Z       <string>: running
2026-06-19T22:53:05.8279040Z   to equal
2026-06-19T22:53:05.8279200Z       <string>: stopped
2026-06-19T22:53:05.8279660Z   In [It] at: /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:395 @ 06/19/26 22:53:05.827
2026-06-19T22:53:05.8280010Z 
2026-06-19T22:53:05.8280080Z   Full Stack Trace
2026-06-19T22:53:05.8280380Z     go.podman.io/podman/v6/pkg/machine/e2e_test.init.func21.11()
2026-06-19T22:53:05.8280900Z     	/Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:395 +0x61c
2026-06-19T22:53:05.8285430Z   < Exit [It] start interrupted by SIGTERM while waiting for VM start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354 @ 06/19/26 22:53:05.827 (8.103s)
2026-06-19T22:53:05.8286370Z   > Enter [AfterEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:87 @ 06/19/26 22:53:05.827
2026-06-19T22:53:05.8287150Z   < Exit [AfterEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:87 @ 06/19/26 22:53:05.827 (0s)
2026-06-19T22:53:05.8288000Z   > Enter [DeferCleanup (Each)] podman machine start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/config_init_test.go:95 @ 06/19/26 22:53:05.827
2026-06-19T22:53:05.8288710Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine rm --force 56b7f309dae6
2026-06-19T22:54:36.1722720Z   time="2026-06-19T22:54:36Z" level=warning msg="Failed to gracefully stop machine, performing hard stop"
2026-06-19T22:54:38.2176650Z   < Exit [DeferCleanup (Each)] podman machine start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/config_init_test.go:95 @ 06/19/26 22:54:38.217 (1m32.391s)
2026-06-19T22:54:38.2177700Z   > Enter [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:220 @ 06/19/26 22:54:38.217
2026-06-19T22:54:38.2191690Z   < Exit [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:220 @ 06/19/26 22:54:38.219 (2ms)
2026-06-19T22:54:38.2192330Z • [FAILED] [100.496 seconds]
2026-06-19T22:54:38.2192610Z podman machine start
2026-06-19T22:54:38.2192940Z /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:21
2026-06-19T22:54:38.2193360Z   [It] start interrupted by SIGTERM while waiting for VM start
2026-06-19T22:54:38.2194130Z   /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354

packit-as-a-service · 2026-06-23T14:32:52Z

Cockpit tests failed for commit 08bf3e6. @jelly, @mvollmer please check.

packit-as-a-service · 2026-06-23T15:40:53Z

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

l0rd · 2026-06-24T06:50:03Z

All tests are green. The PR is ready for review PTAL @podman-container-tools/podman-maintainers @podman-container-tools/podman-reviewers

Honny1 · 2026-06-24T08:15:58Z

+	// a termination signal is sent
+	callbackFuncs := machine.CleanUp()
+	defer callbackFuncs.CleanIfErr(&err)
+	go callbackFuncs.CleanOnSignal()


Is this goroutine stopped on a successful start machine?

I have added a comment because I think you are right; at least we should make it easier to understand for someone reading the code. Anyway, the goroutines are forced to terminate when the main goroutine terminates. This is not ideal, but it's how these cleanup callbackFuncs are currently used (see Init and Start in the current main branch). Considering that podman machine start is a short-lived command, this is not a big problem. And this PR is already big enough.

Luap99

(not a full review github got bugged and I Cannot add more comments)

Luap99 · 2026-06-24T10:58:02Z

+	// Start a goroutine that will evenutally propagte a
+	// terminate signal to the VM process.
+	// If the user decides to abort `podman machine start`
+	// while the VM is starting, we want the VM to be stopped
+	// too. Or the machine will be left in an inconsistent
+	// state.
+	term := make(chan os.Signal, 1)
+	signal.Notify(term, os.Interrupt, syscall.SIGTERM)
+	go func() {


Setting up yet another signal handler seems to complicate things instead of simplifying. We already have one setup in the main start so we should reuse that, return a proper cleanup function which kill the right processes and than trigger these there.

The more listeners we have to more racy everything gets.

I agree. This is the big problem with this approach. But returning a proper cleanup function would require changing the interface and implementations across all our providers. Something I wanted to avoid or, at least, do gradually.

well multiple listeners are racy. there is no guarantee that this here will be called at all before the other listener does the os.Exit(1) so return the cleanup function is the only sane way to keep this correct

They are racy. You don't need to convince me on that. There are edge cases where the VM doesn't get killed. But that's still better than never killing the VM on a signal. This is a design problem for all providers (note that we spawn 2 more listeners for hyperv: 1 and 2) and I wanted to limit the scope of the PR.

But that's how it goes. You fix one thing, and then you get asked to fix more :-) I can give it a try with a new cleanup mechanism implemented on macOS.

Luap99 · 2026-06-24T11:04:17Z

+	// To enforce that, the for loop is wrapped with WaitGroup.Go(), and
+	// execution is blocked until all goroutines have completed with
+	// WaitGroup.Wait().
+	c.wait.Go(func() {
+		// Cleanup functions invoked in reverse registration order
+		for _, cleanfunc := range slices.Backward(funcs) {
+			if err := cleanfunc(); err != nil {
+				logrus.Error(err)
+			}
 		}
-	}
+	})
+	c.wait.Wait()


I am super confused here, I do not see how this fixes anything really. The waitgroup just acts as an extra mutex basically but there i no need to use a wg at all.

If I see the problem right simply move the c.mu.Unlock() to the end of the function, there should be no risk keeping this locked for longer.

Moving the c.mu.Unlock to the end of the function should work too, yes. But you don't see how this fixes anything?

Luap99 · 2026-06-24T11:08:05Z

+	// Start a goroutine that waits until the gvproxy process completes.
+	// Gvproxy should not exit during the machine startup but, if it does
+	// because of an error or a TERM signal, we want to release the
+	// associated resources and reap the process.
+	// Otherwise the gvproxy completed process will result as a zombie and,
+	// most importantly, machine.backoffForProcess fails because
+	// Process.IsRunning() returns true
+	go func() {
+		if err := c.Wait(); err != nil {
+			logrus.Debugf("gvproxy exited: %v", err)
+		}
+	}()


that leaks the goroutine forever which is not right.

podman machine start is a short lived process so that does not matter to much but the same can be said for the zombie state, if the parent exists gvproxy get the ini as parent which must reap it.

I think doing the wait on the cleanup code path when we know we killed the process is better.

Sure, I can move this goroutine there.

Luap99 · 2026-06-24T11:13:00Z

+	if err = p.Terminate(); err != nil {
+		if errors.Is(err, syscall.ESRCH) {
+			logrus.Debugf("Gvproxy already dead, exiting cleanly")
+			return nil
+		}
+		return err
+	}
+
+	if err = backoffForProcess(p); err == nil {
+		return nil
+	}


Do we need this? any reason why not just sigkill is enough? I don't thin gvpoxy has any state so I doubt it matters and I Rather safe the few lines of code.

Trying with a sigterm and, if it fails, sending a sigkill seems cleaner. Especially because we are doing this for gvproxy, but we could (should?) reuse the cleanup code for other processes that are started as well. Also, the comment on this function already describes this behavior (the code was updated, but not the comment?), so I was trying to clean things up.

Luap99 · 2026-06-24T11:17:17Z

+	// callbackFuncs are invoked when errors occurs or term signals
+	// are received. Thus we need to defer startingFalse for when
+	// Start() completes successfully
+	defer func() { _ = startingFalse() }()


well that runs always, even if the function returns an error.

so would it not be the right thing to just move the

mc.Starting = false mc.Write()

to the end of the fucntion instead of using defer at all

Yes. If an error occurs, those will be called twice. That's why I have added the sync.Once. That happens 4 times in the function Start(): mcunlock, startLockUnlock, startingFalse, releaseCmd. Moving it to the end is ok, but when doing a change or adding a new operation, we may easily forget to update one of the two.

The root of the problem is duplication. And to avoid it, we should change how CallbackFuncs.Add work (a new parameter to specify when it should be called: signal, error, success). Not sure if this should be part of the PR, though.

Related to podman-container-tools#28318 Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

mheon · 2026-06-24T18:21:03Z

+		// Otherwise, when errors occur, it's called twice and panics:
+		//   - defer mcunlock()
+		//   - defer callbackFuncs.CleanIfErr()
+		var mcUnlockOnce sync.Once


This feels really gross, but I don't have an immediate suggestion for how to avoid it

mheon · 2026-06-24T18:24:44Z

I don't like the locking here, but I don't see an easy solution to resolve it without some sort of trylock semantics I don't think we have on these locks.

LGTM on the whole

Honny1

LGTM

inknos · 2026-06-25T14:59:05Z

went through it and agree with the comments from @mheon and @Luap99

I would only ask to be sure if the comment about sigkill is addressed. I didn't quite understand the final solution :)

besides this doubt the PR LGTM

github-actions Bot added the machine label Jun 19, 2026

l0rd force-pushed the fix-machine-start-signal-handling branch from 23a1432 to 7ea17e4 Compare June 19, 2026 22:27

l0rd marked this pull request as draft June 22, 2026 08:13

l0rd force-pushed the fix-machine-start-signal-handling branch 2 times, most recently from c120469 to 08bf3e6 Compare June 23, 2026 14:01

l0rd force-pushed the fix-machine-start-signal-handling branch from 08bf3e6 to 5a1d276 Compare June 23, 2026 15:00

l0rd force-pushed the fix-machine-start-signal-handling branch 3 times, most recently from b8fa1cf to 025dba4 Compare June 23, 2026 22:03

l0rd marked this pull request as ready for review June 24, 2026 06:48

Honny1 reviewed Jun 24, 2026

View reviewed changes

l0rd force-pushed the fix-machine-start-signal-handling branch from 025dba4 to 8b93ec5 Compare June 24, 2026 11:06

Luap99 reviewed Jun 24, 2026

View reviewed changes

l0rd force-pushed the fix-machine-start-signal-handling branch from 8b93ec5 to 517528b Compare June 24, 2026 15:43

l0rd added 5 commits June 24, 2026 18:52

Propagate SIGTERM to the VM process during machine start on macOS

c694574

Related to podman-container-tools#28318 Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

Machine: extend lock to ensure cleanup callbacks completion

37c8dea

Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

Reap completed gvproxy process if machine start fails on Unix

2f3e645

Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

Fix cleanup callbacks registration in machines Start() function

f5d5159

Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

Add new machine test that covers interrupted start command

12ea195

Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>

l0rd force-pushed the fix-machine-start-signal-handling branch from 517528b to 12ea195 Compare June 24, 2026 16:59

mheon reviewed Jun 24, 2026

View reviewed changes

Honny1 reviewed Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

l0rd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Does this PR introduce a user-facing change?

Uh oh!

l0rd commented Jun 22, 2026

Uh oh!

packit-as-a-service Bot commented Jun 23, 2026

Uh oh!

packit-as-a-service Bot commented Jun 23, 2026

Uh oh!

l0rd commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luap99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mheon commented Jun 24, 2026

Uh oh!

Honny1 left a comment

Choose a reason for hiding this comment

Uh oh!

inknos commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

l0rd commented Jun 19, 2026 •

edited

Loading

l0rd commented Jun 24, 2026 •

edited

Loading