[Bug]: One control plane node stuck waiting for MicroOS #1484

heysarver · 2024-09-20T15:40:48Z

Description

I'm trying to deploy a cluster with 3 or 5 control nodes, both have the same result. N-1 nodes come up successfully but after several terraform destroy and apply plans there's always 1 control node that is stuck in "Waiting for MicroOS to become available..." until terraform times out.

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "~> 2.14.0"
  hcloud_token = var.hcloud_token

  ssh_public_key = var.ssh_public_key
  ssh_private_key = var.ssh_private_key

  network_region = "us-east"
  
  initial_k3s_channel    = "v1.29"

  cluster_name = "k8s-primary"
  base_domain = "hzr.*******.net"

  control_plane_nodepools = [
    {
      name        = "control",
      server_type = "cpx41",
      location    = "ash",
      count       = 3,
      labels      = [],
      taints      = []
    }
  ]

  agent_nodepools = [
    {
      name        = "worker",
      server_type = "cpx21",
      location    = "ash",
      labels      = [],
      taints      = [],
      count       = 0
    }
  ]

  allow_scheduling_on_control_plane = true

  load_balancer_type     = "lb11"
  load_balancer_location = "ash"
  load_balancer_disable_ipv6 = true
  load_balancer_algorithm_type = "least_connections"
  load_balancer_health_check_interval = "5s"
  load_balancer_health_check_timeout = "3s"

  use_control_plane_lb = true
  control_plane_lb_type = "lb11"

  cluster_autoscaler_version   = "20240226"
  cluster_autoscaler_log_level = 4

  ingress_controller = "nginx"
  ingress_target_namespace = "ingress-nginx"
  ingress_replica_count = 3

  kured_options = {
    "reboot-days": "su",
    "start-time": "3am",
    "end-time": "8am",
    "time-zone": "America/New_York",
    "lock-ttl" : "30m",
  }

  dns_servers = [
    "8.8.8.8",
    "1.1.1.1",
  ]

  cert_manager_values = <<EOT
installCRDs: true
replicaCount: 3
webhook:
  replicaCount: 3
cainjector:
  replicaCount: 3
  EOT

  nginx_values = <<EOT
controller:
  watchIngressWithoutClass: "true"
  kind: "DaemonSet"
  config:
    "use-forwarded-headers": "true"
    "compute-full-forwarded-for": "true"
    "use-proxy-protocol": "true"
  service:
    annotations:
      "load-balancer.hetzner.cloud/name": "k8s-primary-nginx"
      "load-balancer.hetzner.cloud/use-private-ip": "false"
      "load-balancer.hetzner.cloud/disable-private-ingress": "false"
      "load-balancer.hetzner.cloud/location": "ash"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
  EOT

}

Screenshots

Failed Node:

Platform

MacOS, Terraform Cloud

mysticaltech · 2024-09-23T01:33:55Z

@heysarver Please try rebooting the node with hcloud, see if it fixes it.

JWDobken · 2024-09-23T20:32:02Z

I experience the same problem!

Two things that I noted.

the particular nodes don't have a Private IP

They are both not connected to the private network:

rebooting does not help
manually connecting to the network neither

heysarver · 2024-09-23T20:39:34Z

Rebooting solved it but it's still an issue. I added another worker pool and had the same results, all but 1 came up ok and a reboot of that fixed it again.

mysticaltech · 2024-09-24T07:29:26Z

@heysarver Remove the kured-ttl setting. Remove also the autoscaler version (the default set value is needed).

terraform init -upgrade

Plan B

Make sure the underlying image is good, rebuild it if needed, with the packer command.

Debug cloud-init and what could be happening on boot, ask https://claude.ai for the exact commands and give it the logs.

mysticaltech · 2024-09-24T07:31:47Z

@JWDobken please create a new issue with all the details.

JWDobken · 2024-09-24T08:41:50Z

rebuilding the image seemed to have solved my issue, thank you.

heysarver · 2024-09-27T22:32:55Z

@mysticaltech I've started using it already and have hit my limits on a new account so I'll have to wait to try, but sounds reasonable.

heysarver · 2024-11-01T22:22:27Z

I can confirm this was my issue with kured_options lock-ttl set to 30m.

When I made a new cluster to confirm, I also had to manually open the firewall ports for the nginx ingress load balancer with this config. Any ideas on that or should I open a new issue?

mysticaltech · 2024-11-07T14:12:47Z

@heysarver Please reframe the issue, I'm not understanding clearly the issues you are still facing.

heysarver · 2024-11-08T17:49:24Z

@mysticaltech I'm having to add rules for the destination nginx-ingress ports manually to the firewall after creating, otherwise all the targets are unhealthy. This causes terraform state to get out of sync.

mysticaltech · 2024-11-08T18:36:38Z

@heysarver Please open a new issue with the full working kube.tf apart from private info, and steps to reproduce please.

adamthewilliam · 2024-12-20T23:25:26Z

Make sure that the IP of the machine you're running the terraform commands is included in the firewall_ssh_source array in the kube.tf. For example, I'm running on Github action workflow runners so I set this variable with the following values:

firewall_ssh_source = [
    "185.199.108.0/22",
    "140.82.112.0/20",
    "143.55.64.0/20",
    "20.248.137.0/24",
    "20.55.157.0/24",
    "20.248.138.0/24"
  ]

heysarver added the bug Something isn't working label Sep 20, 2024

heysarver closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: One control plane node stuck waiting for MicroOS #1484

[Bug]: One control plane node stuck waiting for MicroOS #1484

heysarver commented Sep 20, 2024 •

edited

Loading

mysticaltech commented Sep 23, 2024

JWDobken commented Sep 23, 2024

heysarver commented Sep 23, 2024

mysticaltech commented Sep 24, 2024 •

edited

Loading

mysticaltech commented Sep 24, 2024

JWDobken commented Sep 24, 2024

heysarver commented Sep 27, 2024

heysarver commented Nov 1, 2024 •

edited

Loading

mysticaltech commented Nov 7, 2024

heysarver commented Nov 8, 2024

mysticaltech commented Nov 8, 2024

adamthewilliam commented Dec 20, 2024 •

edited

Loading

[Bug]: One control plane node stuck waiting for MicroOS #1484

[Bug]: One control plane node stuck waiting for MicroOS #1484

Comments

heysarver commented Sep 20, 2024 • edited Loading

Description

Kube.tf file

Screenshots

Platform

mysticaltech commented Sep 23, 2024

JWDobken commented Sep 23, 2024

heysarver commented Sep 23, 2024

mysticaltech commented Sep 24, 2024 • edited Loading

mysticaltech commented Sep 24, 2024

JWDobken commented Sep 24, 2024

heysarver commented Sep 27, 2024

heysarver commented Nov 1, 2024 • edited Loading

mysticaltech commented Nov 7, 2024

heysarver commented Nov 8, 2024

mysticaltech commented Nov 8, 2024

adamthewilliam commented Dec 20, 2024 • edited Loading

heysarver commented Sep 20, 2024 •

edited

Loading

mysticaltech commented Sep 24, 2024 •

edited

Loading

heysarver commented Nov 1, 2024 •

edited

Loading

adamthewilliam commented Dec 20, 2024 •

edited

Loading