Skip to content

Fix signal handling during machine start on macOS#28991

Open
l0rd wants to merge 5 commits into
podman-container-tools:mainfrom
l0rd:fix-machine-start-signal-handling
Open

Fix signal handling during machine start on macOS#28991
l0rd wants to merge 5 commits into
podman-container-tools:mainfrom
l0rd:fix-machine-start-signal-handling

Conversation

@l0rd

@l0rd l0rd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Fixed the state of a machine when the podman machine start command is interrupted on macOS (the details of the problem have been described in this issue):

  • first problem: the machine command is interrupted but the VM process isn't (fixed by bd9b189)
  • second problem: after signal, machine start returns before machine cleanup completes (fixed by bab5c72)
  • third problem: error logged when stopping gvproxy during machine cleanup (fixed by b73c74a)
  • fourth problem: incorrect cleanup callbacks registration during machine start (fixed by 4f4b90b)
  • fifth problem: no automated test of a SIGTERM sent to podman machine start (fixed by 23a1432)

Related to #28318 (fix it on macOS, not Windows)

image

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

  • Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
    commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
    the sign-off email address. See CONTRIBUTING.md
    for more information.
  • Referenced issues using Fixes: #00000 in commit message (if applicable)
  • Tests have been added/updated (or no tests are needed)
  • Documentation has been updated (or no documentation changes are needed)
  • All commits pass make validatepr (format/lint checks)
  • Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

Fix signal handling during machine on macOS

@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch from 23a1432 to 7ea17e4 Compare June 19, 2026 22:27
@l0rd l0rd marked this pull request as draft June 22, 2026 08:13
@l0rd

l0rd commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Converted to draft as the new e2e test is failing consistently in CI:

2026-06-19T22:52:57.7245830Z ------------------------------
2026-06-19T22:52:57.7246230Z podman machine start
2026-06-19T22:52:57.7246810Z /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:21
2026-06-19T22:52:57.7247590Z   start interrupted by SIGTERM while waiting for VM start
2026-06-19T22:52:57.7248460Z   /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354
2026-06-19T22:52:57.7249770Z   > Enter [BeforeEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:218 @ 06/19/26 22:52:57.724
2026-06-19T22:52:57.7251370Z   < Exit [BeforeEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:218 @ 06/19/26 22:52:57.724 (1ms)
2026-06-19T22:52:57.7253200Z   > Enter [It] start interrupted by SIGTERM while waiting for VM start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354 @ 06/19/26 22:52:57.724
2026-06-19T22:52:57.7255230Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine.aarch64.applehv.raw 56b7f309dae6
2026-06-19T22:53:04.7094000Z   Machine init complete
2026-06-19T22:53:04.7094690Z   To start your machine run:
2026-06-19T22:53:04.7095090Z 
2026-06-19T22:53:04.7095400Z   	podman machine start 56b7f309dae6
2026-06-19T22:53:04.7095830Z 
2026-06-19T22:53:04.7201930Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine start 56b7f309dae6
2026-06-19T22:53:04.7543600Z   Starting machine "56b7f309dae6"
2026-06-19T22:53:05.7919990Z   Received a terminate signal
2026-06-19T22:53:05.7920320Z 
2026-06-19T22:53:05.7921040Z   This machine is currently configured in rootless mode. If your containers
2026-06-19T22:53:05.7921560Z   require root permissions (e.g. ports < 1024), or if you run into compatibility
2026-06-19T22:53:05.7922030Z   issues with non-podman clients, you can switch using the following command:
2026-06-19T22:53:05.7933810Z 
2026-06-19T22:53:05.7933950Z   	podman machine set --rootful 56b7f309dae6
2026-06-19T22:53:05.7934120Z 
2026-06-19T22:53:05.7982810Z   Machine command rollback completed
2026-06-19T22:53:05.8027380Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine inspect --format {{.State}} 56b7f309dae6
2026-06-19T22:53:05.8160760Z   running
2026-06-19T22:53:05.8278510Z   [FAILED] Expected
2026-06-19T22:53:05.8278850Z       <string>: running
2026-06-19T22:53:05.8279040Z   to equal
2026-06-19T22:53:05.8279200Z       <string>: stopped
2026-06-19T22:53:05.8279660Z   In [It] at: /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:395 @ 06/19/26 22:53:05.827
2026-06-19T22:53:05.8280010Z 
2026-06-19T22:53:05.8280080Z   Full Stack Trace
2026-06-19T22:53:05.8280380Z     go.podman.io/podman/v6/pkg/machine/e2e_test.init.func21.11()
2026-06-19T22:53:05.8280900Z     	/Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:395 +0x61c
2026-06-19T22:53:05.8285430Z   < Exit [It] start interrupted by SIGTERM while waiting for VM start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354 @ 06/19/26 22:53:05.827 (8.103s)
2026-06-19T22:53:05.8286370Z   > Enter [AfterEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:87 @ 06/19/26 22:53:05.827
2026-06-19T22:53:05.8287150Z   < Exit [AfterEach] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:87 @ 06/19/26 22:53:05.827 (0s)
2026-06-19T22:53:05.8288000Z   > Enter [DeferCleanup (Each)] podman machine start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/config_init_test.go:95 @ 06/19/26 22:53:05.827
2026-06-19T22:53:05.8288710Z   /Users/MacM1-4-worker/ci/podman/podman/bin/darwin/podman machine rm --force 56b7f309dae6
2026-06-19T22:54:36.1722720Z   time="2026-06-19T22:54:36Z" level=warning msg="Failed to gracefully stop machine, performing hard stop"
2026-06-19T22:54:38.2176650Z   < Exit [DeferCleanup (Each)] podman machine start - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/config_init_test.go:95 @ 06/19/26 22:54:38.217 (1m32.391s)
2026-06-19T22:54:38.2177700Z   > Enter [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:220 @ 06/19/26 22:54:38.217
2026-06-19T22:54:38.2191690Z   < Exit [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/machine_test.go:220 @ 06/19/26 22:54:38.219 (2ms)
2026-06-19T22:54:38.2192330Z • [FAILED] [100.496 seconds]
2026-06-19T22:54:38.2192610Z podman machine start
2026-06-19T22:54:38.2192940Z /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:21
2026-06-19T22:54:38.2193360Z   [It] start interrupted by SIGTERM while waiting for VM start
2026-06-19T22:54:38.2194130Z   /Users/MacM1-4-worker/ci/podman/podman/pkg/machine/e2e/start_test.go:354

@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch 2 times, most recently from c120469 to 08bf3e6 Compare June 23, 2026 14:01
@packit-as-a-service

Copy link
Copy Markdown

Cockpit tests failed for commit 08bf3e6. @jelly, @mvollmer please check.

@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch from 08bf3e6 to 5a1d276 Compare June 23, 2026 15:00
@packit-as-a-service

Copy link
Copy Markdown

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch 3 times, most recently from b8fa1cf to 025dba4 Compare June 23, 2026 22:03
@l0rd l0rd marked this pull request as ready for review June 24, 2026 06:48
@l0rd

l0rd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

All tests are green. The PR is ready for review PTAL @podman-container-tools/podman-maintainers @podman-container-tools/podman-reviewers

Comment thread pkg/machine/apple/apple.go Outdated
Comment thread pkg/machine/apple/apple.go Outdated
Comment thread pkg/machine/apple/apple.go
Comment thread pkg/machine/cleanup.go Outdated
Comment thread pkg/machine/shim/host.go Outdated
// a termination signal is sent
callbackFuncs := machine.CleanUp()
defer callbackFuncs.CleanIfErr(&err)
go callbackFuncs.CleanOnSignal()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this goroutine stopped on a successful start machine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment because I think you are right; at least we should make it easier to understand for someone reading the code. Anyway, the goroutines are forced to terminate when the main goroutine terminates. This is not ideal, but it's how these cleanup callbackFuncs are currently used (see Init and Start in the current main branch). Considering that podman machine start is a short-lived command, this is not a big problem. And this PR is already big enough.

@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch from 025dba4 to 8b93ec5 Compare June 24, 2026 11:06

@Luap99 Luap99 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a full review github got bugged and I Cannot add more comments)

Comment thread pkg/machine/apple/apple.go Outdated
Comment on lines +203 to +211
// Start a goroutine that will evenutally propagte a
// terminate signal to the VM process.
// If the user decides to abort `podman machine start`
// while the VM is starting, we want the VM to be stopped
// too. Or the machine will be left in an inconsistent
// state.
term := make(chan os.Signal, 1)
signal.Notify(term, os.Interrupt, syscall.SIGTERM)
go func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up yet another signal handler seems to complicate things instead of simplifying. We already have one setup in the main start so we should reuse that, return a proper cleanup function which kill the right processes and than trigger these there.

The more listeners we have to more racy everything gets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is the big problem with this approach. But returning a proper cleanup function would require changing the interface and implementations across all our providers. Something I wanted to avoid or, at least, do gradually.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well multiple listeners are racy. there is no guarantee that this here will be called at all before the other listener does the os.Exit(1) so return the cleanup function is the only sane way to keep this correct

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are racy. You don't need to convince me on that. There are edge cases where the VM doesn't get killed. But that's still better than never killing the VM on a signal. This is a design problem for all providers (note that we spawn 2 more listeners for hyperv: 1 and 2) and I wanted to limit the scope of the PR.

But that's how it goes. You fix one thing, and then you get asked to fix more :-) I can give it a try with a new cleanup mechanism implemented on macOS.

Comment thread pkg/machine/apple/apple.go Outdated
Comment thread pkg/machine/cleanup.go Outdated
Comment on lines +65 to +76
// To enforce that, the for loop is wrapped with WaitGroup.Go(), and
// execution is blocked until all goroutines have completed with
// WaitGroup.Wait().
c.wait.Go(func() {
// Cleanup functions invoked in reverse registration order
for _, cleanfunc := range slices.Backward(funcs) {
if err := cleanfunc(); err != nil {
logrus.Error(err)
}
}
}
})
c.wait.Wait()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am super confused here, I do not see how this fixes anything really. The waitgroup just acts as an extra mutex basically but there i no need to use a wg at all.

If I see the problem right simply move the c.mu.Unlock() to the end of the function, there should be no risk keeping this locked for longer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the c.mu.Unlock to the end of the function should work too, yes. But you don't see how this fixes anything?

Comment thread pkg/machine/shim/networking.go Outdated
Comment on lines +90 to +101
// Start a goroutine that waits until the gvproxy process completes.
// Gvproxy should not exit during the machine startup but, if it does
// because of an error or a TERM signal, we want to release the
// associated resources and reap the process.
// Otherwise the gvproxy completed process will result as a zombie and,
// most importantly, machine.backoffForProcess fails because
// Process.IsRunning() returns true
go func() {
if err := c.Wait(); err != nil {
logrus.Debugf("gvproxy exited: %v", err)
}
}()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that leaks the goroutine forever which is not right.

podman machine start is a short lived process so that does not matter to much but the same can be said for the zombie state, if the parent exists gvproxy get the ini as parent which must reap it.

I think doing the wait on the cleanup code path when we know we killed the process is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can move this goroutine there.

Comment on lines +67 to +77
if err = p.Terminate(); err != nil {
if errors.Is(err, syscall.ESRCH) {
logrus.Debugf("Gvproxy already dead, exiting cleanly")
return nil
}
return err
}

if err = backoffForProcess(p); err == nil {
return nil
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? any reason why not just sigkill is enough? I don't thin gvpoxy has any state so I doubt it matters and I Rather safe the few lines of code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying with a sigterm and, if it fails, sending a sigkill seems cleaner. Especially because we are doing this for gvproxy, but we could (should?) reuse the cleanup code for other processes that are started as well. Also, the comment on this function already describes this behavior (the code was updated, but not the comment?), so I was trying to clean things up.

Comment thread pkg/machine/shim/host.go
Comment on lines +607 to +610
// callbackFuncs are invoked when errors occurs or term signals
// are received. Thus we need to defer startingFalse for when
// Start() completes successfully
defer func() { _ = startingFalse() }()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that runs always, even if the function returns an error.

so would it not be the right thing to just move the

mc.Starting = false
mc.Write()

to the end of the fucntion instead of using defer at all

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If an error occurs, those will be called twice. That's why I have added the sync.Once. That happens 4 times in the function Start(): mcunlock, startLockUnlock, startingFalse, releaseCmd. Moving it to the end is ok, but when doing a change or adding a new operation, we may easily forget to update one of the two.

The root of the problem is duplication. And to avoid it, we should change how CallbackFuncs.Add work (a new parameter to specify when it should be called: signal, error, success). Not sure if this should be part of the PR, though.

@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch from 8b93ec5 to 517528b Compare June 24, 2026 15:43
l0rd added 5 commits June 24, 2026 18:52
Related to podman-container-tools#28318

Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
Signed-off-by: Mario Loriedo <mario.loriedo@gmail.com>
@l0rd l0rd force-pushed the fix-machine-start-signal-handling branch from 517528b to 12ea195 Compare June 24, 2026 16:59
Comment thread pkg/machine/shim/host.go
// Otherwise, when errors occur, it's called twice and panics:
// - defer mcunlock()
// - defer callbackFuncs.CleanIfErr()
var mcUnlockOnce sync.Once

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels really gross, but I don't have an immediate suggestion for how to avoid it

@mheon

mheon commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

I don't like the locking here, but I don't see an easy solution to resolve it without some sort of trylock semantics I don't think we have on these locks.

LGTM on the whole

@Honny1 Honny1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@inknos

inknos commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

went through it and agree with the comments from @mheon and @Luap99

I would only ask to be sure if the comment about sigkill is addressed. I didn't quite understand the final solution :)

besides this doubt the PR LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants