|
1 | 1 | # Disaster Recovery |
2 | 2 |
|
3 | 3 | ## Backup |
4 | | -The state of the Hopsworks cluster is divided into data and metadata and distributed across the different node groups. This section of the guide allows you to take a consistent backup between data in the offline and online feature store as well as the metadata. |
| 4 | +The state of a Hopsworks cluster is split between data and metadata and distributed across multiple services. This section explains how to take consistent backups for the offline and online feature stores as well as cluster metadata. |
5 | 5 |
|
6 | | -The following services contain critical state that should be backed up: |
| 6 | +In Hopsworks, a consistent backup should back up the following services: |
7 | 7 |
|
8 | | -* **RonDB**: as mentioned above, the RonDB is used by Hopsworks to store the cluster metadata as well as the data for the online feature store. |
9 | | -* **HopsFS**: HopsFS stores the data for the batch feature store as well as checkpoints and logs for feature engineering applications. |
| 8 | +* **RonDB**: cluster metadata and the online feature store data. |
| 9 | +* **HopsFS**: offline feature store data plus checkpoints and logs for feature engineering applications. |
| 10 | +* **Opensearch**: search metadata, logs, dashboards, and user embeddings. |
| 11 | +* **Kubernetes objects**: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps. |
10 | 12 |
|
11 | | -Backing up service/application metrics and services/applications logs are out of the scope of this guide. By default metrics and logs are rotated after 7 days. Application logs are available on HopsFS when the application has finished and, as such, are backed up with the rest of HopsFS’ data. |
| 13 | +Besides the above services, Hopsworks uses also Apache Kafka which carries in-flight data heading to the online feature store. In the event of a total cluster loss, running jobs with in-flight data must be replayed. |
12 | 14 |
|
13 | | -Apache Kafka and OpenSearch are additional services maintaining state. The OpenSearch metadata can be reconstructed from the metadata stored on RonDB. |
| 15 | +### Prerequisites |
| 16 | +When enabling backup in Hopsworks, cron jobs are configured for RonDB and Opensearch. For HopsFS, backups rely on versioning in the object store. For Kubernetes objects, Hopsworks uses Velero to snapshot the required resources. Before enabling backups: |
14 | 17 |
|
15 | | -Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with in-flight data will have to be replayed. |
| 18 | +- Enable versioning on the S3-compatible bucket used for HopsFS. |
| 19 | +- Install and configure Velero with the AWS plugin (S3). |
16 | 20 |
|
17 | | -### Configuration Backup |
| 21 | +#### Install Velero |
| 22 | +Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs [here](https://velero.io/docs/v1.17/basic-install/)). |
18 | 23 |
|
19 | | -Hopsworks adopts an Infrastructure-as-code philosophy, as such all the configuration files for the different Hopsworks services are generated during the deployment phase. Cluster-specific customizations should be centralized in the cluster definition used to deploy the cluster. As such the cluster definition should be backed up (e.g., by committing it to a git repository) to be able to recreate the same cluster in case it needs to be recreated. |
20 | | - |
21 | | -### RonDB Backup |
22 | | - |
23 | | -The RonDB backup is divided into two parts: user and privileges backup and data backup. |
24 | | - |
25 | | -To take the backup of users and privileges you can run the following command from any of the nodes in the head node group. This command generates a SQL file containing all the user definitions for both the metadata services (Hopsworks, HopsFS, Metastore) as well as the user and permission grants for the online feature store. This command needs to be run as user ‘mysql’ or with sudo privileges. |
26 | | - |
27 | | -```sh |
28 | | -/srv/hops/mysql/bin/mysqlpump -S /srv/hops/mysql-cluster/mysql.sock --exclude-databases=% --exclude-users=root,mysql.sys,mysql.session,mysql.infoschema --users > users.sql |
| 24 | +- Using the Velero CLI, set up the CRDs and deployment: |
| 25 | +```bash |
| 26 | +velero install \ |
| 27 | + --plugins velero/velero-plugin-for-aws:v1.13.0 \ |
| 28 | + --no-default-backup-location \ |
| 29 | + --no-secret \ |
| 30 | + --use-volume-snapshots=false \ |
| 31 | + --wait |
29 | 32 | ``` |
30 | 33 |
|
31 | | -The second step is to trigger the backup of the data. This can be achieved by running the following command as user ‘mysql’ on one of the nodes of the head node group. |
32 | | - |
33 | | -```sh |
34 | | -/srv/hops/mysql-cluster/ndb/scripts/mgm-client.sh -e "START BACKUP [replace_backup_id] SNAPSHOTEND WAIT COMPLETED" |
| 34 | +- Using the Velero Helm chart: |
| 35 | +```bash |
| 36 | +helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts |
| 37 | +helm repo update |
| 38 | + |
| 39 | +helm install velero vmware-tanzu/velero \ |
| 40 | + --namespace velero \ |
| 41 | + --create-namespace \ |
| 42 | + --set "initContainers[0].name=velero-plugin-for-aws" \ |
| 43 | + --set "initContainers[0].image=velero/velero-plugin-for-aws:v1.13.0" \ |
| 44 | + --set "initContainers[0].volumeMounts[0].mountPath=/target" \ |
| 45 | + --set "initContainers[0].volumeMounts[0].name=plugins" \ |
| 46 | + --set-json configuration.backupStorageLocation='[]' \ |
| 47 | + --set "credentials.useSecret=false" \ |
| 48 | + --set "snapshotsEnabled=false" \ |
| 49 | + --wait |
35 | 50 | ``` |
36 | 51 |
|
37 | | -The backup ID is an integer greater or equal than 1. The script uses the following: `$(date +'%y%m%d%H%M')` instead of an integer as backup id to make it easier to identify backups over time. |
| 52 | +### Configuring Backup |
| 53 | +!!! Note |
| 54 | + Backup is only supported for clusters that use S3-compatible object storage. |
38 | 55 |
|
39 | | -The command instructs each RonDB datanode to backup the data it is responsible for. The backup will be located locally on each datanode under the following path: |
| 56 | +You can enable backups during installation or a later upgrade. Set the schedule with a cron expression in the values file: |
40 | 57 |
|
41 | | -```sh |
42 | | -/srv/hops/mysql-cluster/ndb/backups/BACKUP - the directory name will be BACKUP-[backup_id] |
| 58 | +```yaml |
| 59 | +global: |
| 60 | + _hopsworks: |
| 61 | + backups: |
| 62 | + enabled: true |
| 63 | + schedule: "@weekly" |
43 | 64 | ``` |
44 | 65 |
|
45 | | -A more comprehensive backup script is available [here](https://github.com/logicalclocks/ndb-chef/blob/master/templates/default/native_ndb_backup.sh.erb) - The script includes the steps above as well as collecting all the partial RonDB backups on a single node. The script is a good starting point and can be adapted to ship the database backup outside the cluster. |
46 | | - |
47 | | -### HopsFS Backup |
48 | | - |
49 | | -HopsFS is a distributed file system based on Apache HDFS. HopsFS stores its metadata in RonDB, as such metadata backup has already been discussed in the section above. The data is stored in the form of blocks on the different data nodes. |
50 | | -For availability reasons, the blocks are replicated across three different data nodes. |
51 | | - |
52 | | -Within a node, the blocks are stored by default under the following directory, under the ownership of the ‘hdfs’ user: |
53 | | - |
54 | | -```sh |
55 | | -/srv/hopsworks-data/hops/hopsdata/hdfs/dn/ |
56 | | -``` |
| 66 | +After configuring backups, go to the cluster settings and open the Backup tab. You should see `enabled` at the top level and for all services if everything is configured correctly. |
57 | 67 |
|
58 | | -To safely backup all the data, a copy of all the datanodes should be taken. As the data is replicated across the different nodes, excluding a set of nodes might result in data loss. |
| 68 | +<figure> |
| 69 | + <img width="800px" src="../../../../assets/images/admin/ha_dr/backup.png" alt="Backup overview page"/> |
| 70 | + <figcaption>Backup overview page</figcaption> |
| 71 | +</figure> |
59 | 72 |
|
60 | | -Additionally, as HopsFS blocks are files on the file system and the filesystem can be quite large, the backup is not transactional. Consistency is dictated by the metadata. Blocks being added during the copying process will not be visible when restoring as they are not part of the metadata backup taken prior to cloning the HopsFS blocks. |
| 73 | +If any service is misconfigured, the backup status shows as `partial`. In the example below, Velero is disabled because it was not configured correctly. Fix partial backups before relying on them for recovery. |
61 | 74 |
|
62 | | -When the HopsFS data blocks are stored in a cloud block storage, for example, Amazon S3, then it is sufficient to only backup the metadata. The blob cloud storage service will ensure durability of the data blocks. |
| 75 | +<figure> |
| 76 | + <img width="800px" src="../../../../assets/images/admin/ha_dr/backup_partial.png" alt="Backup overview page (partial setup)"/> |
| 77 | + <figcaption>Backup overview page (partial setup)</figcaption> |
| 78 | +</figure> |
63 | 79 |
|
64 | 80 | ## Restore |
65 | | - |
66 | | -As with the backup phase, the restore operation is broken down in different steps. |
67 | | - |
68 | | -### Cluster deployment |
69 | | - |
70 | | -The first step to redeploy the cluster is to redeploy the binaries and configuration. You should reuse the same cluster definition used to deploy the first (original) cluster. This will re-create the same cluster with the same configuration. |
71 | | - |
72 | | -### RonDB restore |
73 | | - |
74 | | -The deployment step above created a functioning empty cluster. To restore the cluster, the first step is to restore the metadata and online feature store data stored on RonDB. |
75 | | -To restore the state of RonDB, we first need to restore its schemas and tables, then its data, rebuild the indices, and finally restore the users and grants. |
76 | | - |
77 | | -#### Restore RonDB schemas and tables |
78 | | - |
79 | | -This command should be executed on one of the nodes in the head node group and is going to recreate the schemas, tables, and internal RonDB metadata. In the command below, you should replace the node_id with the id of the node you are running the command on, backup_id with the id of the backup you want to restore. Finally, you should replace the mgm_node_ip with the address of the node where the RonDB management service is running. |
80 | | - |
81 | | -```sh |
82 | | -/srv/hops/mysql/bin/ndb_restore -n [node_id] -b [backup_id] -m --disable-indexes --ndb-connectstring=[mgm_node_ip]:1186 --backup_path=/srv/hops/mysql-cluster/ndb/backups/BACKUP/BACKUP-[backup_id] |
| 81 | +!!! Note |
| 82 | + Restore is only supported in a newly created cluster; in-place restore is not supported. |
| 83 | + |
| 84 | +The restore process has two phases: |
| 85 | + |
| 86 | +- Restore Kubernetes objects required for the cluster restore. |
| 87 | +- Install the cluster with Helm using the correct backup IDs. |
| 88 | + |
| 89 | +### Restore Kubernetes objects |
| 90 | +Restore the Kubernetes objects that were backed up using Velero. |
| 91 | + |
| 92 | +- Ensure that Velero is installed and configured with the AWS plugin as described in the [prerequisites](#prerequisites). |
| 93 | +- Set up a [Velero backup storage location](https://velero.io/docs/v1.17/api-types/backupstoragelocation/) to point to the S3 bucket. |
| 94 | + - If you are using AWS S3: |
| 95 | + ```bash |
| 96 | + kubectl apply -f - <<EOF |
| 97 | + apiVersion: velero.io/v1 |
| 98 | + kind: BackupStorageLocation |
| 99 | + metadata: |
| 100 | + name: hopsworks-bsl |
| 101 | + namespace: velero |
| 102 | + spec: |
| 103 | + provider: aws |
| 104 | + config: |
| 105 | + region: REGION |
| 106 | + objectStorage: |
| 107 | + bucket: BUCKET_NAME |
| 108 | + prefix: k8s_backup |
| 109 | + EOF |
| 110 | + ``` |
| 111 | + - If you are using an S3-compatible object storage, provide credentials and endpoint: |
| 112 | + ```bash |
| 113 | + cat << EOF > bsl-credentials |
| 114 | + [default] |
| 115 | + aws_access_key_id=YOUR_ACCESS_KEY |
| 116 | + aws_secret_access_key=YOUR_SECRET_KEY |
| 117 | + EOF |
| 118 | +
|
| 119 | + kubectl create secret generic -n velero bsl-credentials --from-file=cloud=bsl-credentials |
| 120 | +
|
| 121 | + kubectl apply -f - <<EOF |
| 122 | + apiVersion: velero.io/v1 |
| 123 | + kind: BackupStorageLocation |
| 124 | + metadata: |
| 125 | + name: hopsworks-bsl |
| 126 | + namespace: velero |
| 127 | + spec: |
| 128 | + provider: aws |
| 129 | + config: |
| 130 | + region: REGION |
| 131 | + s3Url: ENDPOINT |
| 132 | + credential: |
| 133 | + key: cloud |
| 134 | + name: bsl-credentials |
| 135 | + objectStorage: |
| 136 | + bucket: BUCKET_NAME |
| 137 | + prefix: k8s_backup |
| 138 | + EOF |
| 139 | + ``` |
| 140 | +- After the backup storage location becomes available, restore the backups. The following script restores the latest available backup. To restore a specific backup, set `backupName` instead of `scheduleName`. |
| 141 | + |
| 142 | +```bash |
| 143 | +echo "=== Waiting for Velero BackupStorageLocation hopsworks-bsl to become Available ===" |
| 144 | +until [ "$(kubectl get backupstoragelocations hopsworks-bsl -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Available" ]; do |
| 145 | + echo "Still waiting..."; sleep 5; |
| 146 | +done |
| 147 | +
|
| 148 | +echo "=== Waiting for Velero to sync the backups from hopsworks-bsl ===" |
| 149 | +until [ "$(kubectl get backups -n velero -ojson | jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl")] | length' 2>/dev/null)" != "0" ]; do |
| 150 | + echo "Still waiting..."; sleep 5; |
| 151 | +done |
| 152 | +
|
| 153 | +
|
| 154 | +# Restores the latest - if specific backup is needed then backupName instead |
| 155 | +echo "=== Creating Velero Restore object for k8s-backups-main ===" |
| 156 | +RESTORE_SUFFIX=$(date +%s) |
| 157 | +kubectl apply -f - <<EOF |
| 158 | +apiVersion: velero.io/v1 |
| 159 | +kind: Restore |
| 160 | +metadata: |
| 161 | + name: k8s-backups-main-restore-$RESTORE_SUFFIX |
| 162 | + namespace: velero |
| 163 | +spec: |
| 164 | + scheduleName: k8s-backups-main |
| 165 | +EOF |
| 166 | +
|
| 167 | +echo "=== Waiting for Velero restore to finish ===" |
| 168 | +until [ "$(kubectl get restore k8s-backups-main-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do |
| 169 | + echo "Still waiting..."; sleep 5; |
| 170 | +done |
| 171 | +
|
| 172 | +# Restores the latest - if specific backup is needed then backupName instead |
| 173 | +echo "=== Creating Velero Restore object for k8s-backups-users-resources ===" |
| 174 | +kubectl apply -f - <<EOF |
| 175 | +apiVersion: velero.io/v1 |
| 176 | +kind: Restore |
| 177 | +metadata: |
| 178 | + name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX |
| 179 | + namespace: velero |
| 180 | +spec: |
| 181 | + scheduleName: k8s-backups-users-resources |
| 182 | +EOF |
| 183 | +
|
| 184 | +echo "=== Waiting for Velero restore to finish ===" |
| 185 | +until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do |
| 186 | + echo "Still waiting..."; sleep 5; |
| 187 | +done |
83 | 188 | ``` |
84 | 189 |
|
85 | | -#### Restore RonDB data |
86 | | - |
87 | | -This command should be executed on all the RonDB datanodes. Each command should be customized with the node id of the node you are trying to restore (i.e., replace the node_id). As for the command above you should replace the backup_id and mgm_node_ip. |
88 | | - |
89 | | -```sh |
90 | | -/srv/hops/mysql/bin/ndb_restore -n [node_id] -b [backup_id] -r --ndb-connectstring=[mgm_node_ip]:1186 --backup_path=/srv/hops/mysql-cluster/ndb/backups/BACKUP/BACKUP-[backup_id] |
91 | | -``` |
92 | | - |
93 | | -#### Rebuild the indices |
94 | | - |
95 | | -In the first command we disable the indices for recovery. This last command will take care of enabling them again. This command needs to run only once on one of the nodes of the head node group. As for the commands above, you should replace node_id, backup_id and mgm_node_id. |
96 | | - |
97 | | -```sh |
98 | | -/srv/hops/mysql/bin/ndb_restore -n [node_id] -b [backup_id] --rebuild-indexes --ndb-connectstring=[mgm_node_ip]:1186 --backup_path=/srv/hops/mysql-cluster/ndb/backups/BACKUP/BACKUP-[backup_ip] |
99 | | -``` |
100 | | - |
101 | | -#### Restore Users and Grants |
102 | | - |
103 | | -In the backup phase, we took the backup of the user and grants separately. The last step of the RonDB restore process is to re-create all the users and grants both for Hopsworks services as well as for the online feature store users. This can be achieved by running the following command on one node of the head node group: |
104 | | - |
105 | | -```sh |
106 | | -/srv/hops/mysql-cluster/ndb/scripts/mysql-client.sh source users.sql |
| 190 | +### Restore on Cluster installation |
| 191 | +To restore a cluster during installation, configure the backup ID in the values YAML file: |
| 192 | + |
| 193 | +```yaml |
| 194 | +global: |
| 195 | + _hopsworks: |
| 196 | + backups: |
| 197 | + enabled: true |
| 198 | + schedule: "@weekly" |
| 199 | + restoreFromBackup: |
| 200 | + backupId: "254811200" |
107 | 201 | ``` |
108 | 202 |
|
109 | | -### HopsFS restore |
110 | | - |
111 | | -With the metadata restored, you can now proceed to restore the file system blocks on HopsFS and restart the file system. When starting the datanode, it will advertise it’s ID/ClusterID and Storage ID based on the VERSION file that can be found in this directory: |
112 | | - |
113 | | -```sh |
114 | | -/srv/hopsworks-data/hops/hopsdata/hdfs/dn/current |
| 203 | +#### Customizations |
| 204 | +!!! Warning |
| 205 | + Even if you override the backup IDs for RonDB and Opensearch, you must still set `.global._hopsworks.restoreFromBackup.backupId` to ensure HopsFS is restored. |
| 206 | + |
| 207 | +To restore a different backup ID for RonDB: |
| 208 | + |
| 209 | +```yaml |
| 210 | +global: |
| 211 | + _hopsworks: |
| 212 | + backups: |
| 213 | + enabled: true |
| 214 | + schedule: "@weekly" |
| 215 | + restoreFromBackup: |
| 216 | + backupId: "254811200" |
| 217 | +
|
| 218 | +rondb: |
| 219 | + rondb: |
| 220 | + restoreFromBackup: |
| 221 | + backupId: "254811140" |
115 | 222 | ``` |
116 | 223 |
|
117 | | -It’s important that all the datanodes are restored and they report their block to the namenodes processes running on the head nodes. By default the namenodes in HopsFS will exit “SAFE MODE” (i.e., the mode that allows only read operations) only when the datanodes have reported 99.9% of the blocks the namenodes have in the metadata. As such, the namenodes will not resume operations until all the file blocks have been restored. |
118 | | - |
119 | | -### OpenSearch state rebuild |
120 | | - |
121 | | -The OpenSearch state can be rebuilt using the Hopsworks metadata stored on RonDB. The rebuild process is done by using the re-indexing mechanism provided by ePipe. |
122 | | -The re-indexing can be triggered by running the following command on the head node where ePipe is running: |
123 | | - |
124 | | -```sh |
125 | | -/srv/hops/epipe/bin/reindex-epipe.sh |
| 224 | +To restore a different backup for Opensearch: |
| 225 | + |
| 226 | +```yaml |
| 227 | +global: |
| 228 | + _hopsworks: |
| 229 | + backups: |
| 230 | + enabled: true |
| 231 | + schedule: "@weekly" |
| 232 | + restoreFromBackup: |
| 233 | + backupId: "254811200" |
| 234 | +
|
| 235 | +olk: |
| 236 | + opensearch: |
| 237 | + restore: |
| 238 | + repositories: |
| 239 | + default: |
| 240 | + snapshots: |
| 241 | + default: |
| 242 | + snapshot_name: "254811140" |
126 | 243 | ``` |
127 | 244 |
|
128 | | -The script is deployed and configured during the platform deployment. |
129 | | - |
130 | | -### Kafka topics rebuild |
131 | | - |
132 | | -The backup and restore plan doesn’t cover the data in transit in Kafka, for which the jobs producing it will have to be replayed. However, the RonDB backup contains the information necessary to recreate the topics of all the feature groups. |
133 | | -You can run the following command, as super user, to recreate all the topics with the correct partitioning and replication factors: |
134 | | - |
135 | | -```sh |
136 | | -/srv/hops/kafka/bin/kafka-restore.sh |
| 245 | +You can also customize the Opensearch restore process to skip specific indices: |
| 246 | + |
| 247 | +```yaml |
| 248 | +global: |
| 249 | + _hopsworks: |
| 250 | + backups: |
| 251 | + enabled: true |
| 252 | + schedule: "@weekly" |
| 253 | + restoreFromBackup: |
| 254 | + backupId: "254811200" |
| 255 | +
|
| 256 | +olk: |
| 257 | + opensearch: |
| 258 | + restore: |
| 259 | + repositories: |
| 260 | + default: |
| 261 | + snapshots: |
| 262 | + default: |
| 263 | + snapshot_name: "254811140" |
| 264 | + payload: |
| 265 | + indices: "-myindex" |
137 | 266 | ``` |
138 | | - |
139 | | -The script is deployed and configured during the platform deployment. |
|
0 commit comments