Merge pull request #527 from vitobotta/masters-different-locations

Masters in different locations
vitobotta · Jan 30, 2025 · dd802e8 · dd802e8
2 parents 20362f1 + 9695aa9
commit dd802e8
Show file tree

Hide file tree

Showing 20 changed files with 276 additions and 72 deletions.
diff --git a/README.md b/README.md
@@ -58,6 +58,8 @@ See my public profile with links for connecting with me [here](https://vitobotta
 
 - [Installation](docs/Installation.md)
 - [Creating a cluster](docs/Creating_a_cluster.md)
+- [Masters in different locations](docs/Masters_in_different_locations.md)
+- [Upgrading a 1.x cluster to 2.x](Upgrading_a_cluster_from_1x_to_2x.md)
 - [Setting up a cluster](docs/Setting%20up%20a%20cluster.md)
 - [Recommendations](docs/Recommendations.md)
 - [Maintenance](docs/Maintenance.md)

diff --git a/docs/Creating_a_cluster.md b/docs/Creating_a_cluster.md
@@ -57,8 +57,11 @@ schedule_workloads_on_masters: false
 
 masters_pool:
   instance_type: cpx21
-  instance_count: 3
-  location: nbg1
+  instance_count: 3 # for HA; you can also create a single master cluster for dev and testing (not recommended for production)
+  locations: # you can specify a single location as well for single masters clusters or if you want all masters in the same location. For regional clusters (only eu-central network zone), each master must be in a different location
+    - fsn1
+    - hel1
+    - nbg1
 
 worker_node_pools:
 - name: small-static

diff --git a/docs/Installation.md b/docs/Installation.md
@@ -33,15 +33,15 @@ You need to install these dependencies first:
 ##### Intel / x86
 
 ```bash
-wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.2/hetzner-k3s-macos-amd64
+wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.3/hetzner-k3s-macos-amd64
 chmod +x hetzner-k3s-macos-amd64
 sudo mv hetzner-k3s-macos-amd64 /usr/local/bin/hetzner-k3s
 ```
 
 ##### Apple Silicon / ARM
 
 ```bash
-wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.2/hetzner-k3s-macos-arm64
+wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.3/hetzner-k3s-macos-arm64
 chmod +x hetzner-k3s-macos-arm64
 sudo mv hetzner-k3s-macos-arm64 /usr/local/bin/hetzner-k3s
 ```
@@ -51,15 +51,15 @@ sudo mv hetzner-k3s-macos-arm64 /usr/local/bin/hetzner-k3s
 #### amd64
 
 ```bash
-wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.2/hetzner-k3s-linux-amd64
+wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.3/hetzner-k3s-linux-amd64
 chmod +x hetzner-k3s-linux-amd64
 sudo mv hetzner-k3s-linux-amd64 /usr/local/bin/hetzner-k3s
 ```
 
 #### arm
 
 ```bash
-wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.2/hetzner-k3s-linux-arm64
+wget https://github.com/vitobotta/hetzner-k3s/releases/download/v2.2.3/hetzner-k3s-linux-arm64
 chmod +x hetzner-k3s-linux-arm64
 sudo mv hetzner-k3s-linux-arm64 /usr/local/bin/hetzner-k3s
 ```

diff --git a/docs/Masters_in_different_locations.md b/docs/Masters_in_different_locations.md
@@ -0,0 +1,63 @@
+# Masters in Different Locations
+
+You can set up a regional cluster for maximum availability by placing each master in a different European location. This means the first master will be in Falkenstein, the second in Helsinki, and the third in Nuremberd (listed in alphabetical order). This setup is only possible in network zones with multiple locations, and currently, the only such zone is `eu-central`, which includes these three European locations. For other regions, only zonal clusters are supported. Additionally, regional clusters are limited to 3 masters because we only have these three locations available.
+
+To create a regional cluster, simply set the `instance_count` for the masters pool to 3 and specify the `locations` setting as `fsn1`, `hel1`, and `nbg1`.
+
+## Converting a Single Master or Zonal Cluster to a Regional One
+
+If you already have a cluster with a single master or three masters in the same European location, converting it to a regional cluster is straightforward. Just follow these steps carefully and be patient. Note that this requires hetzner-k3s version 2.2.3 or higher.
+
+Before you begin, make sure to back up all your applications and data! This is crucial. While the migration process is relatively simple, there is always some level of risk involved.
+
+- [ ] Set the `instance_type` for the masters pool to 3 if your cluster currently has only one master.
+- [ ] Update the `locations` setting for the masters pool to include `fns1`, `hel1`, and `nbg1` like this:
+
+```yaml
+locations:
+- fns1
+- hel1
+- nbg1
+```
+
+The locations are always processed in alphabetical order, regardless of how you list them in the `locations` property. This ensures consistency, especially when replacing a master due to node failure or other issues.
+
+- [ ] If your cluster currently has a single master, run the `create` command with the updated configuration. This will create `master2` in Helsinki and `master3` in Nuremberg. Wait for the operation to complete and confirm that all three masters are in a ready state.
+- [ ] If `master1` is not in Falkenstein (fns1):
+   - Drain `master1`.
+   - Delete `master1` using the command `kubectl delete node {cluster-name}-master1`.
+   - Remove the `master1` instance via the Hetzner Console or the `hcloud` utility (see: https://github.com/hetznercloud/cli).
+   - Run the `create` command again. This will recreate `master1` in Falkenstein.
+   - SSH into each master and run the following commands to ensure `master1` has joined the cluster correctly:
+
+```bash
+sudo apt-get update
+sudo apt-get install etcd-client
+
+export ETCDCTL_API=3
+export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
+export ETCDCTL_CACERT=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
+export ETCDCTL_CERT=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt
+export ETCDCTL_KEY=/var/lib/rancher/k3s/server/tls/etcd/server-client.key
+
+etcdctl member list
+```
+
+The last command should display something like this if everything is working properly:
+
+```
+285ab4b980c2c8c, started, test-master2-d25722af, https://10.0.0.3:2380, https://10.0.0.3:2379, false
+aad3fac89b68bfb7, started, test-master1-5e550de0, https://10.0.0.4:2380, https://10.0.0.4:2379, false
+c11852e25aef34e8, started, test-master3-0ed051a3, https://10.0.0.2:2380, https://10.0.0.2:2379, false
+```
+
+- [ ] If `master2` is not in Helsinki, follow the same steps as with `master1` but for `master2`. This will recreate `master2` in Helsinki.
+- [ ] If `master3` is not in Nuremberg, repeat the process for `master3`. This will recreate `master3` in Nuremberg.
+
+That’s it! You now have a regional cluster, which ensures continued operation even if one of the Hetzner locations experiences a temporary failure. I also recommend enabling the `create_load_balancer_for_the_kubernetes_api` setting to `true` if you don’t already have a load balancer for the Kubernetes API.
+
+## Performance Considerations
+
+This feature has been frequently requested, but I delayed implementing it until I could thoroughly test the configuration. I was concerned about latency issues, as etcd is sensitive to delays, and I wanted to ensure that the latency between the German locations and Helsinki wouldn’t cause problems.
+
+It turns out that the default heartbeat interval for etcd is 100ms, and the latency between Helsinki and Falkenstein/Nuremberg is only 25-27ms. This means the total round-trip time (RTT) for the Raft consensus is around 60-70ms, which is well within etcd’s acceptable limits. After running benchmarks, everything works smoothly! So, there’s no need to adjust the etcd configuration for this setup.
diff --git a/docs/Upgrading_a_cluster_from_1x_to_2x.md b/docs/Upgrading_a_cluster_from_1x_to_2x.md
@@ -0,0 +1,88 @@
+# Upgrading a cluster created with hetzner-k3s v1.x to v2.x
+
+The v1 version of hetzner-k3s is quite old and hasn't been supported for a while, but I know that some people haven't upgraded to v2 because until now there wasn't a straightforward process to do this.
+
+This migration is now possible and straightforward provided you follow these instructions carefully and are patient. The migration also allows you to replace deprecated instance types (series `CX`) with new instance types. This migration requires hetzner-k3s v2.2.3 or higher.
+
+## Prerequisites
+
+- [ ] I recommend you install the [hcloud utility](https://github.com/hetznercloud/cli) to more easily/quickly delete old masters
+
+## Upgrading configuration and first steps
+
+- [ ] ==Backup apps and data== - like with all migrations, there is some risk involved, so be prepared in case something doesn't go according to the plan
+- [ ] ==Backup kubeconfig and old config file==
+- [ ] Uninstall the System Upgrade Controller
+- [ ] Create resolv file on existing nodes, either manually or automate it with the `hcloud` CLI
+```bash
+hcloud server list | awk '{print $4}' | tail -n +2 | while read ip; do
+  echo "Setting DNS for ${ip}"
+  ssh -n root@${ip} "echo nameserver 8.8.8.8 | tee /etc/k8s-resolv.conf"
+  ssh -n root@${ip} "cat /etc/k8s-resolv.conf"
+done
+```
+- [ ] Convert config file to new format https://github.com/vitobotta/hetzner-k3s/releases/tag/v2.0.0
+- [ ] Comment out or remove empty node pools from the config file
+- [ ] Set `embedded_registry_mirror: enabled: false` if needed, depending on the current version of k3s (https://docs.k3s.io/installation/registry-mirror)
+- [ ] Add `legacy_instance_type` to ==ALL== node pools, both master and workers, set to the current instance type (regardless of whether it's deprecated or not). ==This is crucial for the migration==
+- [ ] Run `create` command ==with latest hetnzer-k3s using the new config file==
+- [ ] Wait for all CSI pods in `kube-system` to restart, ==ensure everything is running==
+
+## Rotating control plane instances with the new instance type
+
+One master per time (==Switch context before rotating master1== unless your cluster has a load balancer for the Kubernetes API):
+
+- [ ] Drain and delete the master both with kubectl and from the Hetzner console (or using the `hcloud` CLI) to also delete the actual instance
+- [ ] Rerun the `create` command to recreate the master with the new instance type, wait for it to join the control plane and be in "ready" status
+- [ ] SSH into each master and verify that the etcd members have been updated correctly and are in sync
+```bash
+sudo apt-get update
+sudo apt-get install etcd-client
+
+export ETCDCTL_API=3
+export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
+export ETCDCTL_CACERT=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
+export ETCDCTL_CERT=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt
+export ETCDCTL_KEY=/var/lib/rancher/k3s/server/tls/etcd/server-client.key
+
+etcdctl member list
+```
+
+Repeat the process for each master carefully. After the three masters have been replaced:
+
+- [ ] Rerun the `create` command once or twice to ensure config is stable and the masters don't get restarted anymore
+- [ ] [Debug DNS resolution](https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/). If there are issues with it, restart the agents for DNS resolution with the command below, then restart CoreDNS
+```bash
+hcloud server list | grep worker | awk '{print $4}'| while read ip; do
+  echo "${ip}"
+  ssh -n root@${ip} "systemctl restart k3s-agent"
+  sleep 10
+done
+```
+- [ ] Address any issues with your workloads, if any, before proceeding with the rotation of the worker nodes
+
+## Rotating a worker node pool
+
+- [ ] Increase node count for the pool by 1
+- [ ] Run the `create` command to create the extra node required during the pool rotation
+
+One worker node per time (apart from the last one you've just added):
+
+- [ ] Drain a node
+- [ ] Delete the drained node both with kubectl and from the Hetzner console (or using the `hcloud` CLI)
+- [ ] Rerun `create` command to recreate the deleted node
+- [ ] Verify that all works as expected before proceeding with the next node in the pool
+
+Once all the existing nodes have been rotated:
+
+- [ ] Drain the very last node in the pool which we added earlier
+- [ ] Verify that all looks good
+- [ ] Delete the very last node both with kubectl and from the Hetzner console (or using the `hcloud` CLI)
+- [ ] Update the `instance_count` for the node pool by -1
+- [ ] Proceed with the next pool
+
+## Finalizing
+
+- [ ] Remove the `legacy_instance_type` setting from both master and worker node pools
+- [ ] Re-run the `create` command once again to double check
+- [ ] Optionaly, convert the currently zonal cluster to a regional one with masters in different locations (see [this](Upgrading_a_cluster_from_1x_to_2x.md)).
diff --git a/src/cluster/create.cr b/src/cluster/create.cr
@@ -18,7 +18,7 @@ class Cluster::Create
   private getter configuration : Configuration::Loader
   private getter hetzner_client : Hetzner::Client { configuration.hetzner_client }
   private getter settings : Configuration::Main { configuration.settings }
-  private getter autoscaling_worker_node_pools : Array(Configuration::NodePool) { settings.worker_node_pools.select(&.autoscaling_enabled) }
+  private getter autoscaling_worker_node_pools : Array(Configuration::WorkerNodePool) { settings.worker_node_pools.select(&.autoscaling_enabled) }
   private getter ssh_client : Util::SSH { Util::SSH.new(settings.networking.ssh.private_key_path, settings.networking.ssh.public_key_path) }
   private getter network : Hetzner::Network?
   private getter ssh_key : Hetzner::SSHKey
@@ -102,16 +102,16 @@ class Cluster::Create
     "#{settings.cluster_name}-#{instance_type_part}#{prefix}#{index + 1}"
   end
 
-  private def create_master_instance(index : Int32, placement_group : Hetzner::PlacementGroup?) : Hetzner::Instance::Create
-    legacy_instance_type = settings.masters_pool.legacy_instance_type
-    instance_type = settings.masters_pool.instance_type
+  private def create_master_instance(index : Int32, placement_group : Hetzner::PlacementGroup?, location : String) : Hetzner::Instance::Create
+    legacy_instance_type = masters_pool.legacy_instance_type
+    instance_type = masters_pool.instance_type
 
     legacy_instance_name = build_instance_name(legacy_instance_type, index, true)
     instance_name = build_instance_name(instance_type, index, settings.include_instance_type_in_instance_name)
 
-    image = settings.masters_pool.image || settings.image
-    additional_packages = settings.masters_pool.additional_packages || settings.additional_packages
-    additional_post_create_commands = settings.masters_pool.post_create_commands || settings.post_create_commands
+    image = masters_pool.image || settings.image
+    additional_packages = masters_pool.additional_packages || settings.additional_packages
+    additional_post_create_commands = masters_pool.post_create_commands || settings.post_create_commands
 
     Hetzner::Instance::Create.new(
       settings: settings,
@@ -125,16 +125,20 @@ class Cluster::Create
       network: network,
       placement_group: placement_group,
       additional_packages: additional_packages,
-      additional_post_create_commands: additional_post_create_commands
+      additional_post_create_commands: additional_post_create_commands,
+      location: location
     )
   end
 
   private def initialize_master_instances
-    masters_pool = settings.masters_pool
     placement_group = create_placement_group_for_masters
+    location_counts = Hash(String, Int32).new(0)
 
     Array(Hetzner::Instance::Create).new(masters_pool.instance_count) do |i|
-      create_master_instance(i, placement_group)
+      location = masters_locations.min_by { |loc| location_counts[loc] }
+      location_counts[location] += 1
+
+      create_master_instance(i, placement_group, location)
     end
   end
 
@@ -157,7 +161,7 @@ class Cluster::Create
       instance_name: instance_name,
       instance_type: instance_type,
       image: image,
-      location: node_pool.location,
+      location: node_pool.location || default_masters_Location,
       ssh_key: ssh_key,
       network: network,
       placement_group: placement_group,
@@ -304,7 +308,7 @@ class Cluster::Create
     @load_balancer = Hetzner::LoadBalancer::Create.new(
       settings: settings,
       hetzner_client: hetzner_client,
-      location: configuration.masters_location,
+      location: default_masters_Location,
       network_id: network.try(&.id)
     ).run
 
@@ -332,10 +336,18 @@ class Cluster::Create
       settings: settings,
       hetzner_client: hetzner_client,
       network_name: settings.cluster_name,
-      locations: configuration.locations
+      network_zone: ::Configuration::Settings::NodePool::Location.network_zone_by_location(default_masters_Location)
     ).run
   end
 
+  private def masters_locations
+    masters_pool.locations.sort
+  end
+
+  private def default_masters_Location
+    masters_locations.first
+  end
+
   private def find_or_create_network
     find_existing_network(settings.networking.private_network.existing_network_name) || create_new_network
   end
@@ -359,4 +371,8 @@ class Cluster::Create
       settings: settings
     ).run
   end
+
+  private def masters_pool
+    settings.masters_pool
+  end
 end
diff --git a/src/configuration/loader.cr b/src/configuration/loader.cr
@@ -46,15 +46,15 @@ class Configuration::Loader
     Path[settings.kubeconfig_path].expand(home: true).to_s
   end
 
-  getter masters_location : String | Nil do
-    settings.masters_pool.try &.location
+  getter masters_pool : Configuration::MasterNodePool do
+    settings.masters_pool
   end
 
   getter instance_types : Array(Hetzner::InstanceType) do
     hetzner_client.instance_types
   end
 
-  getter locations : Array(Hetzner::Location) do
+  getter all_locations : Array(Hetzner::Location) do
     hetzner_client.locations
   end
 
@@ -135,9 +135,9 @@ class Configuration::Loader
       errors: errors,
       pool: settings.masters_pool,
       pool_type: :masters,
-      masters_location: masters_location,
+      masters_pool: masters_pool,
       instance_types: instance_types,
-      locations: locations,
+      all_locations: all_locations,
       datastore: settings.datastore
     ).validate
   end
@@ -172,9 +172,9 @@ class Configuration::Loader
         errors: errors,
         pool: worker_node_pool,
         pool_type: :workers,
-        masters_location: masters_location,
+        masters_pool: masters_pool,
         instance_types: instance_types,
-        locations: locations,
+        all_locations: all_locations,
         datastore: settings.datastore
       ).validate
     end

diff --git a/src/configuration/main.cr b/src/configuration/main.cr
@@ -1,6 +1,7 @@
 require "yaml"
 
-require "./node_pool"
+require "./master_node_pool"
+require "./worker_node_pool"
 require "./datastore"
 require "./manifests"
 require "./embedded_registry_mirror"
@@ -15,8 +16,8 @@ class Configuration::Main
   getter k3s_version : String
   getter api_server_hostname : String?
   getter schedule_workloads_on_masters : Bool = false
-  getter masters_pool : Configuration::NodePool
-  getter worker_node_pools : Array(Configuration::NodePool) = [] of Configuration::NodePool
+  getter masters_pool : Configuration::MasterNodePool
+  getter worker_node_pools : Array(Configuration::WorkerNodePool) = [] of Configuration::WorkerNodePool
   getter post_create_commands : Array(String) = [] of String
   getter additional_packages : Array(String) = [] of String
   getter kube_api_server_args : Array(String) = [] of String

diff --git a/src/configuration/master_node_pool.cr b/src/configuration/master_node_pool.cr
@@ -0,0 +1,5 @@
+require "./node_pool"
+
+class Configuration::MasterNodePool < Configuration::NodePool
+  property locations : Array(String) = ["fsn1"] of String
+end
diff --git a/src/configuration/node_pool.cr b/src/configuration/node_pool.cr
@@ -4,13 +4,12 @@ require "./node_label"
 require "./node_taint"
 require "./autoscaling"
 
-class Configuration::NodePool
+abstract class Configuration::NodePool
   include YAML::Serializable
 
   property name : String?
   property legacy_instance_type : String = ""
   property instance_type : String
-  property location : String
   property image : String | Int64 | Nil
   property instance_count : Int32 = 1
   property labels : Array(::Configuration::NodeLabel) = [] of ::Configuration::NodeLabel