Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting EKS Anywhere on VMware, etcd fails to start #9047

tcooperma opened this issue Dec 6, 2024 · 26 comments

Starting EKS Anywhere on VMware, etcd fails to start #9047

tcooperma opened this issue Dec 6, 2024 · 26 comments


Copy link

What happened:

tmcooper@ubuntu-server-2404:~$ eksctl anywhere create cluster -f eksa-w01-cluster.yaml
Warning: VSphereDatacenterConfig configured in insecure mode
Performing setup and validations
Warning: VSphereDatacenterConfig configured in insecure mode
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Machine config tags validated
✅ Control plane and Workload templates validated
[email protected] user vSphere privileges validated
✅ Vsphere Provider setup is valid
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Validate authentication for git provider
✅ Validate cluster's eksaVersion matches EKS-A version
✅ Validate cluster's kubelet configuration for Bottlerocket OS
✅ Validate cluster's worker node kubelet configuration for Bottlerocket OS
Creating new bootstrap cluster
Provider specific pre-capi-install-setup on bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific post-setup
Installing EKS-A custom components on bootstrap cluster
Installing EKS-D components
Installing EKS-A custom components (CRD and controller)
Creating new management cluster
(gets hung)

From the log where it is repeating the problem:
{"T":1733494047021445202,"M":"Sleeping before next retry","time":"1s"}
{"T":1733494048021896267,"M":"Executing command","cmd":"/usr/bin/docker exec -i eksa_1733493551451751668 kubectl get --ignore-not-found -o json --kubeconfig kn01/generated/kn01.kind.kubeconfig --namespace default kn01"}
{"T":1733494048518935993,"M":"Cluster generation and observedGeneration","Generation":1,"ObservedGeneration":1}
{"T":1733494048519028611,"M":"Error happened during retry","error":"cluster condition ControlPlaneReady is False: Etcd is not ready","retries":59}
{"T":1733494048519061258,"M":"Sleeping before next retry","time":"1s"}

Why is etcd not becoming ready and does it have some log?

The file with the YAML configuration is attached.

What you expected to happen:

My EKS Kubernetes to be set up

How to reproduce it (as minimally and precisely as possible):

Set up eks anywhere, docker,
set environment passwords for VMWare
export EKSA_VSPHERE_USERNAME=[email protected]
eksctl anywhere create cluster -f eksa-w01-cluster.yaml

Anything else we need to know?:


  • Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-49-generic x86_64)
  • eksctl 0.197.0
  • VMWare:
    Client version: 2.14.0
    Client build number: 21993070
    ESXi version: 8.0.2
    ESXi build number: 22380479
  • Vcenter 8
  • EKS Anywhere Release: 0.197.0
  • EKS Distro Release:
  • kubectl version
    Client Version: v1.31.3
    Kustomize Version: v5.4.2
Copy link

/usr/bin/docker` exec -i eksa_1733493551451751668 kubectl get --ignore-not-found -o json --kubeconfig kn01/generated/kn01.kind.kubeconfig --namespace default kn01

"apiVersion": "",
"kind": "Cluster",
"metadata": {
"annotations": {
"": "true",
"": "v0.21.1"
"creationTimestamp": "2024-12-06T14:04:03Z",
"finalizers": [
"generation": 1,
"name": "kn01",
"namespace": "default",
"resourceVersion": "11330",
"uid": "7d1ae9a9-48f3-4312-b2ed-09c01e61ee67"
"spec": {
"clusterNetwork": {
"cniConfig": {
"cilium": {}
"dns": {},
"pods": {
"cidrBlocks": [
"services": {
"cidrBlocks": [
"controlPlaneConfiguration": {
"count": 2,
"endpoint": {
"host": ""
"machineGroupRef": {
"kind": "VSphereMachineConfig",
"name": "kn01-cp"
"machineHealthCheck": {
"maxUnhealthy": "100%"
"datacenterRef": {
"kind": "VSphereDatacenterConfig",
"name": "kn01"
"eksaVersion": "v0.21.1",
"externalEtcdConfiguration": {
"count": 3,
"machineGroupRef": {
"kind": "VSphereMachineConfig",
"name": "kn01-etcd"
"kubernetesVersion": "1.31",
"machineHealthCheck": {
"maxUnhealthy": "100%",
"nodeStartupTimeout": "10m0s",
"unhealthyMachineTimeout": "5m0s"
"managementCluster": {
"name": "kn01"
"workerNodeGroupConfigurations": [
"count": 2,
"machineGroupRef": {
"kind": "VSphereMachineConfig",
"name": "kn01"
"machineHealthCheck": {
"maxUnhealthy": "40%"
"name": "md-0"
"status": {
"conditions": [
"lastTransitionTime": "2024-12-06T14:05:31Z",
"reason": "OutdatedInformation",
"severity": "Info",
"status": "False",
"type": "Ready"
"lastTransitionTime": "2024-12-06T14:05:31Z",
"reason": "OutdatedInformation",
"severity": "Info",
"status": "False",
"type": "ControlPlaneInitialized"
"lastTransitionTime": "2024-12-06T14:05:31Z",
"message": "Etcd is not ready",
"reason": "RollingUpgradeInProgress",
"severity": "Info",
"status": "False",
"type": "ControlPlaneReady"
"lastTransitionTime": "2024-12-06T14:04:03Z",
"reason": "ControlPlaneNotReady",
"severity": "Info",
"status": "False",
"type": "DefaultCNIConfigured"
"lastTransitionTime": "2024-12-06T14:04:03Z",
"reason": "ControlPlaneNotInitialized",
"severity": "Info",
"status": "False",
"type": "WorkersReady"
"failureMessage": "validating vCenter setup for VSphereMachineConfig kn01: resource pool '/Datacenter/host/' not found",
"failureReason": "MachineConfigInvalid",
"observedGeneration": 1

Copy link

ahreehong commented Dec 6, 2024

Could you share your cluster config? And could you also validate that this /Datacenter/host/ in your vCenter?

Copy link

I have not kubeconfig. I am relying on eks anywhere to write one. Maybe tht was a first mistake? I am a noob with kubernetes and was hoping that the config was written for me. I will attempt to figure out how to write one.

That directory did not exist, but now I made it and I am trying the whole procedure again.

Copy link

pre-creating the directory did not work :-( .. I will try to figur eout the kubeconfig

Copy link

@tcooperma - seems like it could be similar to my issue in #9040 .

Are any of the nodes spinning up in VMware - can you look at the console and see if it has any errors similar to:

"Container kubeadm-bootstrap exited with non-zero status"

Copy link

I tried the following kubeconfig file which did not work

There are no errors on the spun-up VMware machine kn01-etcd-9b7dp

I cannot figure out what ip number it's onto log into it, I could attached the screen shot, but it does not show anything.. going to try to figure that out

> #Token ID: <token_id>
> #Username: <username@domain>
> apiVersion: v1
> clusters:
> - cluster:
> #    certificate-authority-data: <certificate>
>     insecure: true
>     server:
>   name: kn01
> contexts:
> - context:
>     cluster: kn01
> #    user: target
>   name: kn01
> current-context: target
> kind: Config
> preferences: {}
> users:
> - name: Administrator
>   user:
>     token: "<redacted>"

Copy link

If you dont have any IPs in vSphere, it sounds like the VMs are not being assigned any. Do you have DHCP dishing out IPs to the VMs? It is a requirement for EKS-A.

Copy link

I just do not know how to get the ip numbers which I assume you get from eksctl or kubectl
I have a DHCP server on my network

Copy link

So if the node is up it should be in the Virtual Machine Details in vSphere - if there is no IP, it sounds like it is not being assigned an IP and maybe the reason why it is failing.

Copy link

I am a noob with VMware also .. I am fairly sure it's getting an IP number just cannot figure it out

Copy link

I figured that out and I am on the machine which has an IP number

Copy link

In vCenter, it you click on the etcd host, there should be a details page whcih has the IP address in it. If the node doesnt have an IP, then that is likely the reason why it is failing to join.

If it is just the one host this would make sense as the other hosts would need that IP to join the quorum.

Copy link

What can I look at on the bottle rocket OS for etcd? Is there a log file somewhere?

BTW, thanks for all the help

Copy link

You will need to follow the steps for logging in using the ssh key and then you should be able to access the logs. It is detailed in the Check VM Logs

Copy link

I followed the page the b est I could.
I get no logs from using the kubectl line

mcooper@ubuntu-server-2404:~$ kubectl -n etcdadm-bootstrap-provider-system logs etcdadm-bootstrap-provider-controller-8hc97 --kubeconfig kubeconfig.yaml 
Please enter Username: Administrator 
Please enter Password: E1209 17:43:35.433035  108342 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://vmcenter.local:8500/api?timeout=32s\": dial tcp: lookup vmcenter.local on server misbehaving"
E1209 17:43:35.499930  108342 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://vmcenter.local:8500/api?timeout=32s\": dial tcp: lookup vmcenter.local on server misbehaving"
E1209 17:43:35.648916  108342 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://vmcenter.local:8500/api?timeout=32s\": dial tcp: lookup vmcenter.local on server misbehaving"
E1209 17:43:35.730851  108342 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://vmcenter.local:8500/api?timeout=32s\": dial tcp: lookup vmcenter.local on server misbehaving"
Unable to connect to the server: dial tcp: lookup vmcenter.local on server misbehaving

The bottle rocket etcd machine has no usable logs

[ec2-user@admin]$ ls -l /var/log
total 296
-rw-------. 1 root utmp      0 Oct  9 21:09 btmp
-rw-r--r--. 1 root root 292292 Dec  9 17:38 lastlog
-rw-------. 1 root root      0 Oct  9 21:09 tallylog
-rw-rw-r--. 1 root utmp   1152 Dec  9 17:38 wtmp
-rw-------. 1 root root   2871 Oct  9 21:10 yum.log

Copy link

so it is doubtful you will be able to connect to the cluster using kubectl as it would need the control plane nodes for the api.

As for the node - did you follow the instructions in the post? You need to run sudo sheltie to get into the underlying environment. This should give you access to the logs.

Once you are in there I would recommend running journalctl _COMM=host-ctr --no-pager

Copy link

It looks like the etcd is starting OK in its log. the journal has no errors in it.

Dec 10 18:44:57 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: Waiting for etcd static pods
Dec 10 18:44:57 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: Running etcdadm init health phase
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: Phase command output:
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: --------
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: time="2024-12-10T18:44:57Z" level=info msg="[health] Checking local etcd endpoint health"
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: time="2024-12-10T18:45:06Z" level=info msg="[health] Local etcd endpoint is healthy"
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: --------
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: Bottlerocket bootstrap was successful. Disabled bootstrap container
Dec 10 18:45:06 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: time="2024-12-10T18:45:06Z" level=info msg="container task exited" code=0
Dec 10 18:45:07 kn01-etcd-z5hbt host-containers@kubeadm-bootstrap[1849]: time="2024-12-10T18:45:07Z" level=info msg="received signal: terminated"

Copy link

Might be worth trying to do a

`systemctl list-units’

To see if any other processes have failed.

Have you got any other etcd nodes or is it just the one?

Copy link

I have 3 etcd machines every time this starts up I will check the logs on all 3, but I suspect that it will not be different ..

I cannot tell what is normal with the list-units, but it looks like then are all ready for processing

  UNIT                                                                                                                              LOAD   ACTIVE SUB     DESCRIPTION                                                                                                                 
  sys-devices-pci0000:00-0000:00:15.0-0000:03:00.0-net-eth0.device                                                                  loaded active plugged /sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/net/eth0
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p1.device                                              loaded active plugged VMware Virtual NVMe Disk BIOS-BOOT
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p10.device                                             loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-HASH-B
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p11.device                                             loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-RESERVED-B
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p12.device                                             loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-PRIVATE
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p13.device                                             loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device                                              loaded active plugged VMware Virtual NVMe Disk EFI-SYSTEM
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p3.device                                              loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-BOOT-A
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p4.device                                              loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-ROOT-A
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p5.device                                              loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-HASH-A
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p6.device                                              loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-RESERVED-A
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p7.device                                              loaded active plugged VMware Virtual NVMe Disk EFI-BACKUP
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p8.device                                              loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-BOOT-B
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1-nvme0n1p9.device                                              loaded active plugged VMware Virtual NVMe Disk BOTTLEROCKET-ROOT-B
  sys-devices-pci0000:00-0000:00:16.0-0000:0b:00.0-nvme-nvme0-nvme0n1.device                                                        loaded active plugged VMware Virtual NVMe Disk
  sys-devices-platform-serial8250-tty-ttyS0.device                                                                                  loaded active plugged /sys/devices/platform/serial8250/tty/ttyS0
  sys-devices-platform-serial8250-tty-ttyS1.device                                                                                  loaded active plugged /sys/devices/platform/serial8250/tty/ttyS1
  sys-devices-platform-serial8250-tty-ttyS2.device                                                                                  loaded active plugged /sys/devices/platform/serial8250/tty/ttyS2
  sys-devices-platform-serial8250-tty-ttyS3.device                                                                                  loaded active plugged /sys/devices/platform/serial8250/tty/ttyS3
  sys-devices-virtual-block-dm\x2d0.device                                                                                          loaded active plugged /sys/devices/virtual/block/dm-0
  sys-devices-virtual-block-loop0.device                                                                                            loaded active plugged /sys/devices/virtual/block/loop0
  sys-devices-virtual-block-loop1.device                                                                                            loaded active plugged /sys/devices/virtual/block/loop1
  sys-module-configfs.device                                                                                                        loaded active plugged /sys/module/configfs
  sys-module-fuse.device                                                                                                            loaded active plugged /sys/module/fuse
  sys-subsystem-net-devices-eth0.device                                                                                             loaded active plugged /sys/subsystem/net/devices/eth0                                                                                             
  -.mount                                                                                                                           loaded active mounted Root Mount
  boot.mount                                                                                                                        loaded active mounted /boot
  dev-hugepages.mount                                                                                                               loaded active mounted Huge Pages File System
  dev-mqueue.mount                                                                                                                  loaded active mounted POSIX Message Queue File System
  etc-cni.mount                                                                                                                     loaded active mounted CNI Configuration Directory (/etc/cni)
  etc-containerd.mount                                                                                                              loaded active mounted Containerd Configuration Directory (/etc/containerd)
  etc-host\x2dcontainers.mount                                                                                                      loaded active mounted Host containers Configuration Directory (/etc/host-containers)
  etc-kubernetes-pki-private.mount                                                                                                  loaded active mounted Kubernetes PKI private directory (/etc/kubernetes/pki/private)
  local-mnt.mount                                                                                                                   loaded active mounted local-mnt.mount
  local-opt.mount                                                                                                                   loaded active mounted local-opt.mount
  local-var.mount                                                                                                                   loaded active mounted local-var.mount
  local.mount                                                                                                                       loaded active mounted Local Directory (/local)
  mnt.mount                                                                                                                         loaded active mounted Mnt Directory (/mnt)
  opt-cni.mount                                                                                                                     loaded active mounted CNI Plugin Directory (/opt/cni)
  opt-csi.mount                                                                                                                     loaded active mounted CSI Helper Directory (/opt/csi)
  opt.mount                                                                                                                         loaded active mounted Opt Directory (/opt)                                                                                                                   loaded active mounted AWS configuration directory (/root/.aws)
  run-containerd-io.containerd.grpc.v1.cri-sandboxes-6fa742b1bfcf08582901f94e49a43cae0f730c83bf27f531f665358a4addb92f-shm.mount     loaded active mounted /run/containerd/io.containerd.grpc.v1.cri/sandboxes/6fa742b1bfcf08582901f94e49a43cae0f730c83bf27f531f665358a4addb92f/shm loaded active mounted /run/containerd/io.containerd.runtime.v2.task/ loaded active mounted /run/containerd/io.containerd.runtime.v2.task/
  run-credentials-systemd\x2dsysctl.service.mount                                                                                   loaded active mounted /run/credentials/systemd-sysctl.service
  run-credentials-systemd\x2dsysusers.service.mount                                                                                 loaded active mounted /run/credentials/systemd-sysusers.service
  run-credentials-systemd\x2dtmpfiles\x2dsetup.service.mount                                                                        loaded active mounted /run/credentials/systemd-tmpfiles-setup.service
  run-credentials-systemd\x2dtmpfiles\x2dsetup\x2ddev.service.mount                                                                 loaded active mounted /run/credentials/systemd-tmpfiles-setup-dev.service
  run-host\x2dcontainerd-io.containerd.runtime.v2.task-default-admin-rootfs.mount                                                   loaded active mounted /run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs
  run-netdog.mount                                                                                                                  loaded active mounted Ephemeral netdog configuration directory
  sys-fs-fuse-connections.mount                                                                                                     loaded active mounted FUSE Control File System
  sys-kernel-config.mount                                                                                                           loaded active mounted Kernel Configuration File System
  sys-kernel-debug.mount                                                                                                            loaded active mounted Kernel Debug File System
  sys-kernel-tracing.mount                                                                                                          loaded active mounted Kernel Trace File System
  tmp.mount                                                                                                                         loaded active mounted Temporary Directory /tmp
  var-lib-bottlerocket.mount                                                                                                        loaded active mounted Private Directory (/var/lib/bottlerocket)
  var-lib-kernel\x2ddevel-.overlay-lower.mount                                                                                      loaded active mounted Kernel Development Sources (Read-Only)
  var.mount                                                                                                                         loaded active mounted Var Directory (/var)
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-lib-modules.mount                                                          loaded active mounted Kernel Modules (Read-Write)
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-share-licenses.mount                                                       loaded active mounted License files
  x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-src-kernels.mount                                                          loaded active mounted Kernel Development Sources (Read-Write)                                                                                     
  cri-containerd-6fa742b1bfcf08582901f94e49a43cae0f730c83bf27f531f665358a4addb92f.scope                                             loaded active running libcontainer container 6fa742b1bfcf08582901f94e49a43cae0f730c83bf27f531f665358a4addb92f
  cri-containerd-b6a8b1cc23aedb31da386bb23be4f2503a19afdab92822b9d35b935f95c1bb59.scope                                             loaded active running libcontainer container b6a8b1cc23aedb31da386bb23be4f2503a19afdab92822b9d35b935f95c1bb59
  init.scope                                                                                                                        loaded active running System and Service Manager                                                                                                  
  acpid.service                                                                                                                     loaded active running ACPI event daemon
  activate-configured.service                                                                                                       loaded active exited  Isolates
  activate-multi-user.service                                                                                                       loaded active exited  Isolates
  apiserver.service                                                                                                                 loaded active running Bottlerocket API server
  audit-rules.service                                                                                                               loaded active exited  Load audit rules
  bootstrap-commands.service                                                                                                        loaded active exited  Bootstrap Commands
  chronyd.service                                                                                                                   loaded active running A versatile implementation of the Network Time Protocol
  containerd.service                                                                                                                loaded active running containerd container runtime
  dbus-broker.service                                                                                                               loaded active running D-Bus System Message Bus
  disable-kexec-load.service                                                                                                        loaded active exited  Disable kexec load syscalls
  disable-udp-offload.service                                                                                                       loaded active exited  Disables UDP offload
  generate-network-config.service                                                                                                   loaded active exited  Generate network configuration
  has-boot-ever-succeeded.service                                                                                                   loaded active exited  Checks and marks if boot has ever succeeded before
  host-containerd.service                                                                                                           loaded active running containerd runtime for host containers
  [email protected]                                                                                                     loaded active running Host container: admin
  kmod-static-nodes.service                                                                                                         loaded active exited  Create List of Static Device Nodes
  kubelet.service                                                                                                                   loaded active running Kubelet
  ldconfig.service                                                                                                                  loaded active exited  Rebuild Dynamic Linker Cache
  load-crash-kernel.service                                                                                                         loaded active exited  Load crash kernel
  mark-successful-boot.service                                                                                                      loaded active exited  Call signpost to mark the boot as successful after all required targets are met.
  mask-local-mnt.service                                                                                                            loaded active exited  Mask Local Mnt Directory (/local/mnt)
  mask-local-opt.service                                                                                                            loaded active exited  Mask Local Opt Directory (/local/opt)
  mask-local-var.service                                                                                                            loaded active exited  Mask Local Var Directory (/local/var)
  migrator.service                                                                                                                  loaded active exited  Bottlerocket data store migrator
  [email protected]                                                                                                         loaded active exited  Load Kernel Module configfs
  [email protected]                                                                                                              loaded active exited  Load Kernel Module drm
  modprobe@efi_pstore.service                                                                                                       loaded active exited  Load Kernel Module efi_pstore
  [email protected]                                                                                                             loaded active exited  Load Kernel Module fuse
  prepare-boot.service                                                                                                              loaded active exited  Prepare Boot Directory (/boot)
  prepare-local-fs.service                                                                                                          loaded active exited  Prepare Local Filesystem (/local)
  prepare-opt.service                                                                                                               loaded active exited  Prepare Opt Directory (/opt)
  prepare-var-lib-containerd.service                                                                                                loaded active exited  Prepare Containerd Directory (/var/lib/containerd)
  prepare-var-lib-kubelet.service                                                                                                   loaded active exited  Prepare Kubelet Directory (/var/lib/kubelet)
  prepare-var.service                                                                                                               loaded active exited  Prepare Var Directory (/var)
  repart-local.service                                                                                                              loaded active exited  Resize Data Partition
  selinux-policy-files.service                                                                                                      loaded active exited  Copy SELinux policy files
  send-boot-success.service                                                                                                         loaded active exited  Send boot success
  set-hostname.service                                                                                                              loaded active exited  Sets the hostname
  settings-applier.service                                                                                                          loaded active exited  Applies settings to create config files
  storewolf.service                                                                                                                 loaded active exited  Datastore creator
  sundog.service                                                                                                                    loaded active exited  User-specified setting generators
  systemd-journal-flush.service                                                                                                     loaded active exited  Flush Journal to Persistent Storage
  systemd-journald.service                                                                                                          loaded active running Journal Service
  systemd-logind.service                                                                                                            loaded active running User Login Management
  systemd-machine-id-commit.service                                                                                                 loaded active exited  Commit a transient machine-id on disk
  systemd-modules-load.service                                                                                                      loaded active exited  Load Kernel Modules
  systemd-network-generator.service                                                                                                 loaded active exited  Generate network units from Kernel command line
  systemd-networkd-wait-online.service                                                                                              loaded active exited  Wait for Network to be Configured
  systemd-networkd.service                                                                                                          loaded active running Network Configuration
  systemd-random-seed.service                                                                                                       loaded active exited  Load/Save Random Seed
  systemd-remount-fs.service                                                                                                        loaded active exited  Remount Root and Kernel File Systems
  systemd-resolved.service                                                                                                          loaded active running Network Name Resolution
  systemd-sysctl.service                                                                                                            loaded active exited  Apply Kernel Variables
  systemd-sysusers.service                                                                                                          loaded active exited  Create System Users
  systemd-tmpfiles-setup-dev.service                                                                                                loaded active exited  Create Static Device Nodes in /dev
  systemd-tmpfiles-setup.service                                                                                                    loaded active exited  Create Volatile Files and Directories
  systemd-udev-trigger.service                                                                                                      loaded active exited  Coldplug All udev Devices
  systemd-udevd.service                                                                                                             loaded active running Rule-based Manager for Device Events and Files
  systemd-update-done.service                                                                                                       loaded active exited  Update is Completed
  vmtoolsd.service                                                                                                                  loaded active running VMware Tools service
  write-network-status.service                                                                                                      loaded active exited  Write network status                                                                                                        
  -.slice                                                                                                                           loaded active active  Root Slice
  kubepods-besteffort.slice                                                                                                         loaded active active  libcontainer container kubepods-besteffort.slice
  kubepods-burstable-pod685bdd4b55003427ad0393d955f76f2e.slice                                                                      loaded active active  libcontainer container kubepods-burstable-pod685bdd4b55003427ad0393d955f76f2e.slice
  kubepods-burstable.slice                                                                                                          loaded active active  libcontainer container kubepods-burstable.slice
  kubepods.slice                                                                                                                    loaded active active  libcontainer container kubepods.slice
  runtime.slice                                                                                                                     loaded active active  Kubernetes and container runtime slice
  system-host\x2dcontainers.slice                                                                                                   loaded active active  Slice /system/host-containers
  system-modprobe.slice                                                                                                             loaded active active  Slice /system/modprobe
  system.slice                                                                                                                      loaded active active  System Slice
  user.slice                                                                                                                        loaded active active  User and Session Slice                                                                                                      
  dbus.socket                                                                                                                       loaded active running D-Bus System Message Bus Socket
  systemd-journald-audit.socket                                                                                                     loaded active running Journal Audit Socket
  systemd-journald-dev-log.socket                                                                                                   loaded active running Journal Socket (/dev/log)
  systemd-journald.socket                                                                                                           loaded active running Journal Socket
  systemd-networkd.socket                                                                                                           loaded active running Network Service Netlink Socket
  systemd-udevd-control.socket                                                                                                      loaded active running udev Control Socket
  systemd-udevd-kernel.socket                                                                                                       loaded active running udev Kernel Socket                                                                                                                                                                                                                         loaded active active  Basic System                                                                                                                 loaded active active  Bottlerocket final configuration complete                                                                                                        loaded active active  First Boot Complete                                                                                                                      loaded active active  Login Prompts                                                                                                               loaded active active  Preparation for Local File Systems                                                                                                                   loaded active active  Local File Systems                                                                                                                 loaded active active  Multi-User System                                                                                                             loaded active active  Network is Online                                                                                                                loaded active active  Preparation for Network                                                                                                                    loaded active active  Network                                                                                                                 loaded active active  Host and Network Name Lookups                                                                                                                      loaded active active  Path Units                                                                                                              loaded active active  Bottlerocket initial configuration complete                                                                                                                  loaded active active  Remote File Systems                                                                                                                     loaded active active  Slice Units                                                                                                                    loaded active active  Socket Units                                                                                                                       loaded active active  Swaps                                                                                                                    loaded active active  System Initialization                                                                                                                     loaded active active  Timer Units                                                                                                                 
  metricdog.timer                                                                                                                   loaded active waiting Scheduled Metricdog Pings
  systemd-tmpfiles-clean.timer                                                                                                      loaded active waiting Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
164 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Copy link

so I can't see anything wrong there... all the modules seem to be ok and none have failed... can your bootstrap node (assuming its your laptop) access the nodes on their network? I think the bootstrap node needs to communicate directly with the etcd service to know it is healthy.

But it is now getting to the limit of my debugging abilities..!

Copy link

I have a desktop running Ubuntu 24.04 I was starting it up with. I also tried a VM running Ubuntu 24.04 on VMWare. They both have the same problem starting up the system and both can ssh into the etcd machine ... is there a port number for the etcd daemon I can try to communicate with to see if it is starting? What do I say to it when I connect to try a low level debug?

Is there someone with the EKS anywhere project who wrote the code for VMware or ported the code to VMware? Is there a different configuration I can modify to my network which I could test with? Maybe it's my configuration?

Copy link

Starting classes on kubernetes. Would it make sense to use etcdctl to communicate with the etcd to see if it is listening and healthy?

Copy link

tcooperma commented Dec 12, 2024

The communication with the etcd with etcdctl looks like the TLS is not working. Do I need to specify the certificate somehow?

[ec2-user@admin]$ /tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 endpoint status
{"level":"warn","ts":"2024-12-12T17:45:24.635105Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000ae000/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}
Failed to get the status of endpoint localhost:2379 (context deadline exceeded)
[ec2-user@admin]$ /tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 endpoint status
{"level":"warn","ts":"2024-12-12T17:45:56.815841Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e6000/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}
Failed to get the status of endpoint localhost:2379 (context deadline exceeded)
[ec2-user@admin]$ /tmp/etcd-download-test/etcdctl --endpoints=localhost:2379 endpoint health
{"level":"warn","ts":"2024-12-12T17:46:55.045687Z","logger":"client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e0000/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}
localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

The conatiner log looks like:

2024-12-12T17:46:52.538570028Z stderr F {"level":"warn","ts":"2024-12-12T17:46:52.538280Z","caller":"embed/config_logging.go:170","msg":"rejected connection on client endpoint","remote-addr":"","server-name":"","error":"tls: first record does not look like a TLS handshake"}

Copy link

tcooperma commented Dec 30, 2024

Could it be etcd is configured to listen on the IPv6 port?
root 2318 2077 0 17:56 ? 00:00:00 /opt/bin/etcdadm join phase membership https://2601:189:427f:e270:250:56ff:fea8:8357:2379 -l debug --version 3.5.15-eks-1-31-7 --init-system kubelet --image-repository --certs-dir /var/lib/etcd/pki --data-dir /var/lib/etcd/data --kubelet-pod-manifest-path ./manifests --cipher-suites TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256

Also there must be a log file somewhere on disk?

Copy link

I have the same problem. The first etcd vm itself got stuck with a message lingering as seen below on the vsphere web ui vm console.


With almost of a week of troubleshooting, reinstalled admin machine, rebuilt ubuntu node templates, nothing worked.

Finally, I am resorting to a workaround, where when etcd machine is hung I detect it in vsphere web ui and give it a power off/on and then everything comes up smooth to a point of full cluster creation. In some cases, the etcd vm had to hard-slapped with multiple power off/on cycles. In a couple of instances, the same issue happened with worker nodes... same workaround applied.. Hope there is a solution in the roadmap.

Copy link

markoradisa commented Mar 10, 2025

Exact same issues today. Both for Nutanix and Docker providers. Stops in the same place as you and unable to create clusters. I think there has been some serious bug introduced recently. If you get this resolved please let me know.

I never had issues creating the clusters before. Tried so many things today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

5 participants