You can use a saved etcd backup to restore back to a previous cluster state.
-
Access to the cluster as a user with the
cluster-admin
role. -
SSH access to master hosts.
-
An etcd backup file that includes an etcd snapshot and static Kubernetes API server resources.
This backup file must be in the format of
snapshot_db_kuberesources_<datetimestamp>.tar.gz
.NoteYou must use the same etcd backup file on all master hosts in the cluster.
-
Prepare each master host in your cluster to be restored.
You should run the restore script on all of your master hosts within a short period of time so that the cluster members come up at about the same time and form a quorum. For this reason, it is recommended to stage each master host in a separate terminal, so that the restore script can then be started quickly on each.
-
Copy the etcd backup file to a master host.
This procedure assumes that you copied the
snapshot_db_kuberesources_<datetimestamp>.tar.gz
file containing the etcd snapshot and static Kubernetes API server resources to the/home/core/
directory of your master host. -
Access the master host.
-
Set the
INITIAL_CLUSTER
variable to the list of members in the format of<name>=<url>
. This variable will be passed to the restore script and must be exactly the same for each member.[core@ip-10-0-143-125 ~]$ export INITIAL_CLUSTER="etcd-member-ip-10-0-143-125.ec2.internal=https://etcd-0.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-35-108.ec2.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-10-16.ec2.internal=https://etcd-2.clustername.devcluster.openshift.com:2380"
-
Repeat these steps on your other master hosts, each in a separate terminal. Be sure to use the same etcd backup file on each master host.
-
-
Run the restore script on all of your master hosts.
-
Start the
etcd-snapshot-restore.sh
script on your first master host. Pass in two parameters: the path to the etcd backup file and list of members, which is defined by theINITIAL_CLUSTER
variable.[core@ip-10-0-143-125 ~]$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/snapshot_db_kuberesources_<datetimestamp>.tar.gz $INITIAL_CLUSTER Creating asset directory ./assets Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/ Stopping all static pods.. ..stopping kube-scheduler-pod.yaml ..stopping kube-controller-manager-pod.yaml ..stopping kube-apiserver-pod.yaml ..stopping etcd-member.yaml Stopping etcd.. Waiting for etcd-member to stop Stopping kubelet.. Stopping all containers.. bd44e4bc942276eb1a6d4b48ecd9f5fe95570f54aa9c6b16939fa2d9b679e1ea d88defb9da5ae623592b81619e3690faeb4fa645440e71c029812cb960ff586f 3920ced20723064a379739c4a586f909497a7b6705a5b3cf367d9b930f23a5f1 d470f7a2d962c90f3a21bcc021970bde96bc8908f317ec70f1c21720b322c25c Backing up etcd data-dir.. Removing etcd data-dir /var/lib/etcd Restoring etcd member etcd-member-ip-10-0-143-125.ec2.internal from snapshot.. 2019-05-15 19:03:34.647589 I | pkg/netutil: resolving etcd-0.clustername.devcluster.openshift.com:2380 to 10.0.143.125:2380 2019-05-15 19:03:34.883545 I | mvcc: restore compact to 361491 2019-05-15 19:03:34.915679 I | etcdserver/membership: added member cbe982c74cbb42f [https://etcd-0.clustername.devcluster.openshift.com:2380] to cluster 807ae3bffc8d69ca Starting static pods.. ..starting kube-scheduler-pod.yaml ..starting kube-controller-manager-pod.yaml ..starting kube-apiserver-pod.yaml ..starting etcd-member.yaml Starting kubelet..
-
Once the restore starts, run the script on your other master hosts.
-
-
Verify that the Machine Configs have been applied.
In a terminal that has access to the cluster as a
cluster-admin
user, run the following command.$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING master rendered-master-50e7e00374e80b767fcc922bdfbc522b True False
When the snapshot has been applied, the
currentConfig
of the master will match the ID from when the etcd snapshot was taken. ThecurrentConfig
name for masters is in the formatrendered-master-<currentConfig>
. -
Verify that all master hosts have started and joined the cluster.
-
Access a master host and connect to the running etcd container.
[core@ip-10-0-143-125 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
-
In the etcd container, export variables needed for connecting to etcd.
sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
-
In the etcd container, execute
etcdctl member list
and verify that the three members show as started.sh-4.3# etcdctl member list -w table +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+ | 29e461db6be4eaaa | started | etcd-member-ip-10-0-164-170.ec2.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.164.170:2379 | | cbe982c74cbb42f | started | etcd-member-ip-10-0-143-125.ec2.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.143.125:2379 | | a752f80bcb0da3e8 | started | etcd-member-ip-10-0-156-2.ec2.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | https://10.0.156.2:2379 | +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
It may take up to 20 minutes for each new member to start.
-