Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: One control plane node stuck waiting for MicroOS #1484

Closed
heysarver opened this issue Sep 20, 2024 · 11 comments
Closed

[Bug]: One control plane node stuck waiting for MicroOS #1484

heysarver opened this issue Sep 20, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@heysarver
Copy link

heysarver commented Sep 20, 2024

Description

I'm trying to deploy a cluster with 3 or 5 control nodes, both have the same result. N-1 nodes come up successfully but after several terraform destroy and apply plans there's always 1 control node that is stuck in "Waiting for MicroOS to become available..." until terraform times out.

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  source = "kube-hetzner/kube-hetzner/hcloud"
  version = "~> 2.14.0"
  hcloud_token = var.hcloud_token

  ssh_public_key = var.ssh_public_key
  ssh_private_key = var.ssh_private_key

  network_region = "us-east"
  
  initial_k3s_channel    = "v1.29"

  cluster_name = "k8s-primary"
  base_domain = "hzr.*******.net"

  control_plane_nodepools = [
    {
      name        = "control",
      server_type = "cpx41",
      location    = "ash",
      count       = 3,
      labels      = [],
      taints      = []
    }
  ]

  agent_nodepools = [
    {
      name        = "worker",
      server_type = "cpx21",
      location    = "ash",
      labels      = [],
      taints      = [],
      count       = 0
    }
  ]

  allow_scheduling_on_control_plane = true

  load_balancer_type     = "lb11"
  load_balancer_location = "ash"
  load_balancer_disable_ipv6 = true
  load_balancer_algorithm_type = "least_connections"
  load_balancer_health_check_interval = "5s"
  load_balancer_health_check_timeout = "3s"

  use_control_plane_lb = true
  control_plane_lb_type = "lb11"

  cluster_autoscaler_version   = "20240226"
  cluster_autoscaler_log_level = 4

  ingress_controller = "nginx"
  ingress_target_namespace = "ingress-nginx"
  ingress_replica_count = 3

  kured_options = {
    "reboot-days": "su",
    "start-time": "3am",
    "end-time": "8am",
    "time-zone": "America/New_York",
    "lock-ttl" : "30m",
  }

  dns_servers = [
    "8.8.8.8",
    "1.1.1.1",
  ]

  cert_manager_values = <<EOT
installCRDs: true
replicaCount: 3
webhook:
  replicaCount: 3
cainjector:
  replicaCount: 3
  EOT

  nginx_values = <<EOT
controller:
  watchIngressWithoutClass: "true"
  kind: "DaemonSet"
  config:
    "use-forwarded-headers": "true"
    "compute-full-forwarded-for": "true"
    "use-proxy-protocol": "true"
  service:
    annotations:
      "load-balancer.hetzner.cloud/name": "k8s-primary-nginx"
      "load-balancer.hetzner.cloud/use-private-ip": "false"
      "load-balancer.hetzner.cloud/disable-private-ingress": "false"
      "load-balancer.hetzner.cloud/location": "ash"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
  EOT

}

Screenshots

Failed Node:
Screenshot 2024-09-20 at 12 21 57 PM

Platform

MacOS, Terraform Cloud

@heysarver heysarver added the bug Something isn't working label Sep 20, 2024
@mysticaltech
Copy link
Collaborator

@heysarver Please try rebooting the node with hcloud, see if it fixes it.

@JWDobken
Copy link

I experience the same problem!

Two things that I noted.

  1. the particular nodes don't have a Private IP
image
  1. They are both not connected to the private network:
image
  • rebooting does not help
  • manually connecting to the network neither

@heysarver
Copy link
Author

Rebooting solved it but it's still an issue. I added another worker pool and had the same results, all but 1 came up ok and a reboot of that fixed it again.

@mysticaltech
Copy link
Collaborator

mysticaltech commented Sep 24, 2024

@heysarver Remove the kured-ttl setting. Remove also the autoscaler version (the default set value is needed).

terraform init -upgrade

Plan B

Make sure the underlying image is good, rebuild it if needed, with the packer command.

Debug cloud-init and what could be happening on boot, ask https://claude.ai for the exact commands and give it the logs.

@mysticaltech
Copy link
Collaborator

@JWDobken please create a new issue with all the details.

@JWDobken
Copy link

rebuilding the image seemed to have solved my issue, thank you.

@heysarver
Copy link
Author

@mysticaltech I've started using it already and have hit my limits on a new account so I'll have to wait to try, but sounds reasonable.

@heysarver
Copy link
Author

heysarver commented Nov 1, 2024

I can confirm this was my issue with kured_options lock-ttl set to 30m.

When I made a new cluster to confirm, I also had to manually open the firewall ports for the nginx ingress load balancer with this config. Any ideas on that or should I open a new issue?

@mysticaltech
Copy link
Collaborator

@heysarver Please reframe the issue, I'm not understanding clearly the issues you are still facing.

@heysarver
Copy link
Author

@mysticaltech I'm having to add rules for the destination nginx-ingress ports manually to the firewall after creating, otherwise all the targets are unhealthy. This causes terraform state to get out of sync.

Screenshot 2024-11-08 at 12 45 39 PM
Screenshot 2024-11-08 at 12 46 22 PM

@mysticaltech
Copy link
Collaborator

@heysarver Please open a new issue with the full working kube.tf apart from private info, and steps to reproduce please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants