-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minikube stop very slow with vfkit driver #20503
Comments
nirs
added a commit
to nirs/minikube
that referenced
this issue
Mar 9, 2025
GetState() had several issues: - When accessing vfkit HTTP API, we handled only "running", "VirtualMachineStateRunning", "stopped", and "VirtualMachineStateStopped", but there are other 10 possible states, which we handled as state.None, when vfkit is running and need to be stopped. This can lead to wrong handling in the caller. - When handling "stopped" and "VirtualMachineStateStopped" we returned state.Stopped, but did not remove the pidfile. This can cause termination of unrelated process or reporting wrong status when the pid is reused. - Accessing the HTTP API will fail after we stop or kill it. This cause GetState() to fail when the process is actually stopped, which can lead to unnecessary retries and long delays (kubernetes#20503). - When retuning state.None during Remove(), we use tried to do graceful shutdown which does not make sense in minikube delete flow, and is not consistent with state.Running handling. Accessing vfkit API to check for state does not add much value for our used case, checking if the vfkit process is running, and it is not reliable. Fix all the issues by not using the HTTP API in GetState(), and use only the process state. We still use the API for stopping and killing vfkit to do graceful shutdown. This also simplifies Remove(), since we need to handle only the state.Running state. With this change we consider vfkit as stopped only when the process does not exist, which takes about 3 seconds after the state is reported as "stopped". Example stop flow: I0309 18:15:40.260249 18857 main.go:141] libmachine: Stopping "minikube"... I0309 18:15:40.263225 18857 main.go:141] libmachine: set state: {State:Stop} I0309 18:15:46.266902 18857 main.go:141] libmachine: Machine "minikube" was stopped. I0309 18:15:46.267122 18857 stop.go:75] duration metric: took 6.127761459s to stop Example delete flow: I0309 17:00:49.483078 18127 out.go:177] * Deleting "minikube" in vfkit ... I0309 17:00:49.499252 18127 main.go:141] libmachine: set state: {State:HardStop} I0309 17:00:49.569938 18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ... I0309 17:00:49.573977 18127 out.go:177] * Removed all traces of the "minikube" cluster.
nirs
added a commit
to nirs/minikube
that referenced
this issue
Mar 9, 2025
GetState() had several issues: - When accessing vfkit HTTP API, we handled only "running", "VirtualMachineStateRunning", "stopped", and "VirtualMachineStateStopped", but there are other 10 possible states, which we handled as state.None, when vfkit is running and need to be stopped. This can lead to wrong handling in the caller. - When handling "stopped" and "VirtualMachineStateStopped" we returned state.Stopped, but did not remove the pidfile. This can cause termination of unrelated process or reporting wrong status when the pid is reused. - Accessing the HTTP API will fail after we stop or kill it. This cause GetState() to fail when the process is actually stopped, which can lead to unnecessary retries and long delays (kubernetes#20503). - When retuning state.None during Remove(), we use tried to do graceful shutdown which does not make sense in minikube delete flow, and is not consistent with state.Running handling. Accessing vfkit API to check for state does not add much value for our used case, checking if the vfkit process is running, and it is not reliable. Fix all the issues by not using the HTTP API in GetState(), and use only the process state. We still use the API for stopping and killing vfkit to do graceful shutdown. This also simplifies Remove(), since we need to handle only the state.Running state. With this change we consider vfkit as stopped only when the process does not exist, which takes about 3 seconds after the state is reported as "stopped". Example stop flow: I0309 18:15:40.260249 18857 main.go:141] libmachine: Stopping "minikube"... I0309 18:15:40.263225 18857 main.go:141] libmachine: set state: {State:Stop} I0309 18:15:46.266902 18857 main.go:141] libmachine: Machine "minikube" was stopped. I0309 18:15:46.267122 18857 stop.go:75] duration metric: took 6.127761459s to stop Example delete flow: I0309 17:00:49.483078 18127 out.go:177] * Deleting "minikube" in vfkit ... I0309 17:00:49.499252 18127 main.go:141] libmachine: set state: {State:HardStop} I0309 17:00:49.569938 18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ... I0309 17:00:49.573977 18127 out.go:177] * Removed all traces of the "minikube" cluster.
Thanks for fixing, the current state handling was more of a quick-and-dirty when "porting" from qemu to vfkit... The retry at the wrong level is a minikube "feature", though. So I claim no responsibility for that particular part. |
nirs
added a commit
to nirs/minikube
that referenced
this issue
Mar 10, 2025
GetState() had several issues: - When accessing vfkit HTTP API, we handled only "running", "VirtualMachineStateRunning", "stopped", and "VirtualMachineStateStopped", but there are other 10 possible states, which we handled as state.None, when vfkit is running and need to be stopped. This can lead to wrong handling in the caller. - When handling "stopped" and "VirtualMachineStateStopped" we returned state.Stopped, but did not remove the pidfile. This can cause termination of unrelated process or reporting wrong status when the pid is reused. - Accessing the HTTP API will fail after we stop or kill it. This cause GetState() to fail when the process is actually stopped, which can lead to unnecessary retries and long delays (kubernetes#20503). - When retuning state.None during Remove(), we use tried to do graceful shutdown which does not make sense in minikube delete flow, and is not consistent with state.Running handling. Accessing vfkit API to check for state does not add much value for our used case, checking if the vfkit process is running, and it is not reliable. Fix all the issues by not using the HTTP API in GetState(), and use only the process state. We still use the API for stopping and killing vfkit to do graceful shutdown. This also simplifies Remove(), since we need to handle only the state.Running state. With this change we consider vfkit as stopped only when the process does not exist, which takes about 3 seconds after the state is reported as "stopped". Example stop flow: I0309 18:15:40.260249 18857 main.go:141] libmachine: Stopping "minikube"... I0309 18:15:40.263225 18857 main.go:141] libmachine: set state: {State:Stop} I0309 18:15:46.266902 18857 main.go:141] libmachine: Machine "minikube" was stopped. I0309 18:15:46.267122 18857 stop.go:75] duration metric: took 6.127761459s to stop Example delete flow: I0309 17:00:49.483078 18127 out.go:177] * Deleting "minikube" in vfkit ... I0309 17:00:49.499252 18127 main.go:141] libmachine: set state: {State:HardStop} I0309 17:00:49.569938 18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ... I0309 17:00:49.573977 18127 out.go:177] * Removed all traces of the "minikube" cluster.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What Happened?
Stopping minikube driver is very slow due to unreliable implementation, incorrect error handling, and pointless backup attempts.
To reproduce
minikube start
minikube stop --v=4 --alsologtostderr 2>stop.log
Timeline
I0309 15:00:42.297876 13723 out.go:177] * Stopping node "minikube" ...
We got vfkit process via the HTTP API.
HardStop
via HTTP APIThis always fails with EOF - maybe because vfkit is exiting immediately before returning the response.
I0309 15:00:42.453102 13723 retry.go:31] will retry after 1.148164137s: Temporary Error: stop: Post "http://_/vm/state": EOF
The retry attempt is in the wrong level, or maybe the backup is in the wrong level. We just backed up this host, and there is no reason to repeat the backup.
The backup attempt times out after 135 seconds, since the vfkit process is not running at this point.
The stop attempt fails because the host is already stopped. It should succeed without bogus errors.
Issues
HardStop
instead ofStop
- does not do graceful shutdown of the hostHardStop
always fails, and we don't handle the error properlyAttach the log file
I tried several times and the behavior is same.
stop.log
Operating System
macOS (Default)
Driver
N/A
The text was updated successfully, but these errors were encountered: