Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minikube stop very slow with vfkit driver #20503

Closed
nirs opened this issue Mar 9, 2025 · 1 comment · Fixed by #20504
Closed

minikube stop very slow with vfkit driver #20503

nirs opened this issue Mar 9, 2025 · 1 comment · Fixed by #20504
Labels
co/vfkit VFkit related issues kind/bug Categorizes issue or PR as related to a bug. os/macos

Comments

@nirs
Copy link
Contributor

nirs commented Mar 9, 2025

What Happened?

Stopping minikube driver is very slow due to unreliable implementation, incorrect error handling, and pointless backup attempts.

To reproduce

  1. Configuration
minikube config set vm-driver vfkit
minikube config set container-runtime containerd
  1. Start cluster
minikube start
  1. Stop cluster
minikube stop --v=4 --alsologtostderr 2>stop.log

Timeline

  1. Stopping node started
I0309 15:00:42.297876   13723 out.go:177] * Stopping node "minikube"  ...
  1. Performing first backup
I0309 15:00:42.307878   13723 machine.go:156] backing up vm config to /var/lib/minikube/backup: [/etc/cni /etc/kubernetes]
I0309 15:00:42.307953   13723 ssh_runner.go:195] Run: sudo mkdir -p /var/lib/minikube/backup
I0309 15:00:42.307982   13723 sshutil.go:53] new ssh client: &{IP:192.168.106.7 Port:22 SSHKeyPath:/Users/nir/.minikube/machines/minikube/id_rsa Username:docker}
I0309 15:00:42.339888   13723 ssh_runner.go:195] Run: sudo rsync --archive --relative /etc/cni /var/lib/minikube/backup
I0309 15:00:42.386662   13723 ssh_runner.go:195] Run: sudo rsync --archive --relative /etc/kubernetes /var/lib/minikube/backup
  1. Stopping the driver
I0309 15:00:42.433319   13723 main.go:141] libmachine: Stopping "minikube"...
I0309 15:00:42.433868   13723 main.go:141] libmachine: get state: {State:VirtualMachineStateRunning}

We got vfkit process via the HTTP API.

  1. Trying to set state to HardStop via HTTP API
I0309 15:00:42.452986   13723 stop.go:66] stop err: Post "http://_/vm/state": EOF
W0309 15:00:42.453070   13723 stop.go:165] stop host returned error: Temporary Error: stop: Post "http://_/vm/state": EOF

This always fails with EOF - maybe because vfkit is exiting immediately before returning the response.

  1. Starting retry
I0309 15:00:42.453102   13723 retry.go:31] will retry after 1.148164137s: Temporary Error: stop: Post "http://_/vm/state": EOF
  1. Stopping the host again
I0309 15:00:43.601796   13723 stop.go:39] StopHost: minikube
I0309 15:00:43.608726   13723 out.go:177] * Stopping node "minikube"  ...
I0309 15:00:43.616885   13723 machine.go:156] backing up vm config to /var/lib/minikube/backup: [/etc/cni /etc/kubernetes]
I0309 15:00:43.617271   13723 ssh_runner.go:195] Run: sudo mkdir -p /var/lib/minikube/backup
I0309 15:00:43.617329   13723 sshutil.go:53] new ssh client: &{IP:192.168.106.7 Port:22 SSHKeyPath:/Users/nir/.minikube/machines/minikube/id_rsa Username:docker}
W0309 15:01:58.617799   13723 sshutil.go:64] dial failure (will retry): dial tcp 192.168.106.7:22: connect: operation timed out
W0309 15:01:58.618132   13723 stop.go:55] failed to complete vm config backup (will continue): create dir: NewSession: new client: new client: dial tcp 192.168.106.7:22: connect: operation timed out

The retry attempt is in the wrong level, or maybe the backup is in the wrong level. We just backed up this host, and there is no reason to repeat the backup.

The backup attempt times out after 135 seconds, since the vfkit process is not running at this point.

  1. Stopping the driver gain
I0309 15:01:58.618315   13723 main.go:141] libmachine: Stopping "minikube"...
I0309 15:01:58.619098   13723 stop.go:66] stop err: Machine "minikube" is already stopped.
I0309 15:01:58.619145   13723 stop.go:69] host is already stopped

The stop attempt fails because the host is already stopped. It should succeed without bogus errors.

Issues

  • Using HardStop instead of Stop - does not do graceful shutdown of the host
  • Using HardStop always fails, and we don't handle the error properly
  • Retry is in the wrong level, trying to backup again after a successful backup
  • Backup times out because the host is not running and previous error not handled properly

Attach the log file

I tried several times and the behavior is same.

stop.log

Operating System

macOS (Default)

Driver

N/A

nirs added a commit to nirs/minikube that referenced this issue Mar 9, 2025
GetState() had several issues:

- When accessing vfkit HTTP API, we handled only "running",
  "VirtualMachineStateRunning", "stopped", and
  "VirtualMachineStateStopped", but there are other 10 possible states,
  which we handled as state.None, when vfkit is running and need to be
  stopped. This can lead to wrong handling in the caller.

- When handling "stopped" and "VirtualMachineStateStopped" we returned
  state.Stopped, but did not remove the pidfile. This can cause
  termination of unrelated process or reporting wrong status when the
  pid is reused.

- Accessing the HTTP API will fail after we stop or kill it. This
  cause GetState() to fail when the process is actually stopped, which
  can lead to unnecessary retries and long delays (kubernetes#20503).

- When retuning state.None during Remove(), we use tried to do graceful
  shutdown which does not make sense in minikube delete flow, and is not
  consistent with state.Running handling.

Accessing vfkit API to check for state does not add much value for our
used case, checking if the vfkit process is running, and it is not
reliable.

Fix all the issues by not using the HTTP API in GetState(), and use only
the process state. We still use the API for stopping and killing vfkit
to do graceful shutdown. This also simplifies Remove(), since we need to
handle only the state.Running state.

With this change we consider vfkit as stopped only when the process does
not exist, which takes about 3 seconds after the state is reported as
"stopped".

Example stop flow:

    I0309 18:15:40.260249   18857 main.go:141] libmachine: Stopping "minikube"...
    I0309 18:15:40.263225   18857 main.go:141] libmachine: set state: {State:Stop}
    I0309 18:15:46.266902   18857 main.go:141] libmachine: Machine "minikube" was stopped.
    I0309 18:15:46.267122   18857 stop.go:75] duration metric: took 6.127761459s to stop

Example delete flow:

    I0309 17:00:49.483078   18127 out.go:177] * Deleting "minikube" in vfkit ...
    I0309 17:00:49.499252   18127 main.go:141] libmachine: set state: {State:HardStop}
    I0309 17:00:49.569938   18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ...
    I0309 17:00:49.573977   18127 out.go:177] * Removed all traces of the "minikube" cluster.
nirs added a commit to nirs/minikube that referenced this issue Mar 9, 2025
GetState() had several issues:

- When accessing vfkit HTTP API, we handled only "running",
  "VirtualMachineStateRunning", "stopped", and
  "VirtualMachineStateStopped", but there are other 10 possible states,
  which we handled as state.None, when vfkit is running and need to be
  stopped. This can lead to wrong handling in the caller.

- When handling "stopped" and "VirtualMachineStateStopped" we returned
  state.Stopped, but did not remove the pidfile. This can cause
  termination of unrelated process or reporting wrong status when the
  pid is reused.

- Accessing the HTTP API will fail after we stop or kill it. This
  cause GetState() to fail when the process is actually stopped, which
  can lead to unnecessary retries and long delays (kubernetes#20503).

- When retuning state.None during Remove(), we use tried to do graceful
  shutdown which does not make sense in minikube delete flow, and is not
  consistent with state.Running handling.

Accessing vfkit API to check for state does not add much value for our
used case, checking if the vfkit process is running, and it is not
reliable.

Fix all the issues by not using the HTTP API in GetState(), and use only
the process state. We still use the API for stopping and killing vfkit
to do graceful shutdown. This also simplifies Remove(), since we need to
handle only the state.Running state.

With this change we consider vfkit as stopped only when the process does
not exist, which takes about 3 seconds after the state is reported as
"stopped".

Example stop flow:

    I0309 18:15:40.260249   18857 main.go:141] libmachine: Stopping "minikube"...
    I0309 18:15:40.263225   18857 main.go:141] libmachine: set state: {State:Stop}
    I0309 18:15:46.266902   18857 main.go:141] libmachine: Machine "minikube" was stopped.
    I0309 18:15:46.267122   18857 stop.go:75] duration metric: took 6.127761459s to stop

Example delete flow:

    I0309 17:00:49.483078   18127 out.go:177] * Deleting "minikube" in vfkit ...
    I0309 17:00:49.499252   18127 main.go:141] libmachine: set state: {State:HardStop}
    I0309 17:00:49.569938   18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ...
    I0309 17:00:49.573977   18127 out.go:177] * Removed all traces of the "minikube" cluster.
@afbjorklund afbjorklund added the co/vfkit VFkit related issues label Mar 10, 2025
@afbjorklund
Copy link
Collaborator

afbjorklund commented Mar 10, 2025

Thanks for fixing, the current state handling was more of a quick-and-dirty when "porting" from qemu to vfkit...

The retry at the wrong level is a minikube "feature", though. So I claim no responsibility for that particular part.


@afbjorklund afbjorklund added kind/bug Categorizes issue or PR as related to a bug. os/macos labels Mar 10, 2025
nirs added a commit to nirs/minikube that referenced this issue Mar 10, 2025
GetState() had several issues:

- When accessing vfkit HTTP API, we handled only "running",
  "VirtualMachineStateRunning", "stopped", and
  "VirtualMachineStateStopped", but there are other 10 possible states,
  which we handled as state.None, when vfkit is running and need to be
  stopped. This can lead to wrong handling in the caller.

- When handling "stopped" and "VirtualMachineStateStopped" we returned
  state.Stopped, but did not remove the pidfile. This can cause
  termination of unrelated process or reporting wrong status when the
  pid is reused.

- Accessing the HTTP API will fail after we stop or kill it. This
  cause GetState() to fail when the process is actually stopped, which
  can lead to unnecessary retries and long delays (kubernetes#20503).

- When retuning state.None during Remove(), we use tried to do graceful
  shutdown which does not make sense in minikube delete flow, and is not
  consistent with state.Running handling.

Accessing vfkit API to check for state does not add much value for our
used case, checking if the vfkit process is running, and it is not
reliable.

Fix all the issues by not using the HTTP API in GetState(), and use only
the process state. We still use the API for stopping and killing vfkit
to do graceful shutdown. This also simplifies Remove(), since we need to
handle only the state.Running state.

With this change we consider vfkit as stopped only when the process does
not exist, which takes about 3 seconds after the state is reported as
"stopped".

Example stop flow:

    I0309 18:15:40.260249   18857 main.go:141] libmachine: Stopping "minikube"...
    I0309 18:15:40.263225   18857 main.go:141] libmachine: set state: {State:Stop}
    I0309 18:15:46.266902   18857 main.go:141] libmachine: Machine "minikube" was stopped.
    I0309 18:15:46.267122   18857 stop.go:75] duration metric: took 6.127761459s to stop

Example delete flow:

    I0309 17:00:49.483078   18127 out.go:177] * Deleting "minikube" in vfkit ...
    I0309 17:00:49.499252   18127 main.go:141] libmachine: set state: {State:HardStop}
    I0309 17:00:49.569938   18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ...
    I0309 17:00:49.573977   18127 out.go:177] * Removed all traces of the "minikube" cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
co/vfkit VFkit related issues kind/bug Categorizes issue or PR as related to a bug. os/macos
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants