Skip to content

Commit e2c8184

Browse files
committed
[CLOUD-785] Rewrite backups page
1 parent 537d4ff commit e2c8184

File tree

4 files changed

+246
-122
lines changed

4 files changed

+246
-122
lines changed
208 KB
Loading
45.1 KB
Loading

docs/setup_installation/admin/ha-dr/dr.md

Lines changed: 234 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -1,139 +1,266 @@
11
# Disaster Recovery
22

33
## Backup
4-
The state of the Hopsworks cluster is divided into data and metadata and distributed across the different node groups. This section of the guide allows you to take a consistent backup between data in the offline and online feature store as well as the metadata.
4+
The state of a Hopsworks cluster is split between data and metadata and distributed across multiple services. This section explains how to take consistent backups for the offline and online feature stores as well as cluster metadata.
55

6-
The following services contain critical state that should be backed up:
6+
In Hopsworks, a consistent backup should back up the following services:
77

8-
* **RonDB**: as mentioned above, the RonDB is used by Hopsworks to store the cluster metadata as well as the data for the online feature store.
9-
* **HopsFS**: HopsFS stores the data for the batch feature store as well as checkpoints and logs for feature engineering applications.
8+
* **RonDB**: cluster metadata and the online feature store data.
9+
* **HopsFS**: offline feature store data plus checkpoints and logs for feature engineering applications.
10+
* **Opensearch**: search metadata, logs, dashboards, and user embeddings.
11+
* **Kubernetes objects**: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
1012

11-
Backing up service/application metrics and services/applications logs are out of the scope of this guide. By default metrics and logs are rotated after 7 days. Application logs are available on HopsFS when the application has finished and, as such, are backed up with the rest of HopsFS’ data.
13+
Besides the above services, Hopsworks uses also Apache Kafka which carries in-flight data heading to the online feature store. In the event of a total cluster loss, running jobs with in-flight data must be replayed.
1214

13-
Apache Kafka and OpenSearch are additional services maintaining state. The OpenSearch metadata can be reconstructed from the metadata stored on RonDB.
15+
### Prerequisites
16+
When enabling backup in Hopsworks, cron jobs are configured for RonDB and Opensearch. For HopsFS, backups rely on versioning in the object store. For Kubernetes objects, Hopsworks uses Velero to snapshot the required resources. Before enabling backups:
1417

15-
Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with in-flight data will have to be replayed.
18+
- Enable versioning on the S3-compatible bucket used for HopsFS.
19+
- Install and configure Velero with the AWS plugin (S3).
1620

17-
### Configuration Backup
21+
#### Install Velero
22+
Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs [here](https://velero.io/docs/v1.17/basic-install/)).
1823

19-
Hopsworks adopts an Infrastructure-as-code philosophy, as such all the configuration files for the different Hopsworks services are generated during the deployment phase. Cluster-specific customizations should be centralized in the cluster definition used to deploy the cluster. As such the cluster definition should be backed up (e.g., by committing it to a git repository) to be able to recreate the same cluster in case it needs to be recreated.
20-
21-
### RonDB Backup
22-
23-
The RonDB backup is divided into two parts: user and privileges backup and data backup.
24-
25-
To take the backup of users and privileges you can run the following command from any of the nodes in the head node group. This command generates a SQL file containing all the user definitions for both the metadata services (Hopsworks, HopsFS, Metastore) as well as the user and permission grants for the online feature store. This command needs to be run as user ‘mysql’ or with sudo privileges.
26-
27-
```sh
28-
/srv/hops/mysql/bin/mysqlpump -S /srv/hops/mysql-cluster/mysql.sock --exclude-databases=% --exclude-users=root,mysql.sys,mysql.session,mysql.infoschema --users > users.sql
24+
- Using the Velero CLI, set up the CRDs and deployment:
25+
```bash
26+
velero install \
27+
--plugins velero/velero-plugin-for-aws:v1.13.0 \
28+
--no-default-backup-location \
29+
--no-secret \
30+
--use-volume-snapshots=false \
31+
--wait
2932
```
3033

31-
The second step is to trigger the backup of the data. This can be achieved by running the following command as user ‘mysql’ on one of the nodes of the head node group.
32-
33-
```sh
34-
/srv/hops/mysql-cluster/ndb/scripts/mgm-client.sh -e "START BACKUP [replace_backup_id] SNAPSHOTEND WAIT COMPLETED"
34+
- Using the Velero Helm chart:
35+
```bash
36+
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
37+
helm repo update
38+
39+
helm install velero vmware-tanzu/velero \
40+
--namespace velero \
41+
--create-namespace \
42+
--set "initContainers[0].name=velero-plugin-for-aws" \
43+
--set "initContainers[0].image=velero/velero-plugin-for-aws:v1.13.0" \
44+
--set "initContainers[0].volumeMounts[0].mountPath=/target" \
45+
--set "initContainers[0].volumeMounts[0].name=plugins" \
46+
--set-json configuration.backupStorageLocation='[]' \
47+
--set "credentials.useSecret=false" \
48+
--set "snapshotsEnabled=false" \
49+
--wait
3550
```
3651

37-
The backup ID is an integer greater or equal than 1. The script uses the following: `$(date +'%y%m%d%H%M')` instead of an integer as backup id to make it easier to identify backups over time.
52+
### Configuring Backup
53+
!!! Note
54+
Backup is only supported for clusters that use S3-compatible object storage.
3855

39-
The command instructs each RonDB datanode to backup the data it is responsible for. The backup will be located locally on each datanode under the following path:
56+
You can enable backups during installation or a later upgrade. Set the schedule with a cron expression in the values file:
4057

41-
```sh
42-
/srv/hops/mysql-cluster/ndb/backups/BACKUP - the directory name will be BACKUP-[backup_id]
58+
```yaml
59+
global:
60+
_hopsworks:
61+
backups:
62+
enabled: true
63+
schedule: "@weekly"
4364
```
4465
45-
A more comprehensive backup script is available [here](https://github.com/logicalclocks/ndb-chef/blob/master/templates/default/native_ndb_backup.sh.erb) - The script includes the steps above as well as collecting all the partial RonDB backups on a single node. The script is a good starting point and can be adapted to ship the database backup outside the cluster.
46-
47-
### HopsFS Backup
48-
49-
HopsFS is a distributed file system based on Apache HDFS. HopsFS stores its metadata in RonDB, as such metadata backup has already been discussed in the section above. The data is stored in the form of blocks on the different data nodes.
50-
For availability reasons, the blocks are replicated across three different data nodes.
51-
52-
Within a node, the blocks are stored by default under the following directory, under the ownership of the ‘hdfs’ user:
53-
54-
```sh
55-
/srv/hopsworks-data/hops/hopsdata/hdfs/dn/
56-
```
66+
After configuring backups, go to the cluster settings and open the Backup tab. You should see `enabled` at the top level and for all services if everything is configured correctly.
5767

58-
To safely backup all the data, a copy of all the datanodes should be taken. As the data is replicated across the different nodes, excluding a set of nodes might result in data loss.
68+
<figure>
69+
<img width="800px" src="../../../../assets/images/admin/ha_dr/backup.png" alt="Backup overview page"/>
70+
<figcaption>Backup overview page</figcaption>
71+
</figure>
5972

60-
Additionally, as HopsFS blocks are files on the file system and the filesystem can be quite large, the backup is not transactional. Consistency is dictated by the metadata. Blocks being added during the copying process will not be visible when restoring as they are not part of the metadata backup taken prior to cloning the HopsFS blocks.
73+
If any service is misconfigured, the backup status shows as `partial`. In the example below, Velero is disabled because it was not configured correctly. Fix partial backups before relying on them for recovery.
6174

62-
When the HopsFS data blocks are stored in a cloud block storage, for example, Amazon S3, then it is sufficient to only backup the metadata. The blob cloud storage service will ensure durability of the data blocks.
75+
<figure>
76+
<img width="800px" src="../../../../assets/images/admin/ha_dr/backup_partial.png" alt="Backup overview page (partial setup)"/>
77+
<figcaption>Backup overview page (partial setup)</figcaption>
78+
</figure>
6379

6480
## Restore
65-
66-
As with the backup phase, the restore operation is broken down in different steps.
67-
68-
### Cluster deployment
69-
70-
The first step to redeploy the cluster is to redeploy the binaries and configuration. You should reuse the same cluster definition used to deploy the first (original) cluster. This will re-create the same cluster with the same configuration.
71-
72-
### RonDB restore
73-
74-
The deployment step above created a functioning empty cluster. To restore the cluster, the first step is to restore the metadata and online feature store data stored on RonDB.
75-
To restore the state of RonDB, we first need to restore its schemas and tables, then its data, rebuild the indices, and finally restore the users and grants.
76-
77-
#### Restore RonDB schemas and tables
78-
79-
This command should be executed on one of the nodes in the head node group and is going to recreate the schemas, tables, and internal RonDB metadata. In the command below, you should replace the node_id with the id of the node you are running the command on, backup_id with the id of the backup you want to restore. Finally, you should replace the mgm_node_ip with the address of the node where the RonDB management service is running.
80-
81-
```sh
82-
/srv/hops/mysql/bin/ndb_restore -n [node_id] -b [backup_id] -m --disable-indexes --ndb-connectstring=[mgm_node_ip]:1186 --backup_path=/srv/hops/mysql-cluster/ndb/backups/BACKUP/BACKUP-[backup_id]
81+
!!! Note
82+
Restore is only supported in a newly created cluster; in-place restore is not supported.
83+
84+
The restore process has two phases:
85+
86+
- Restore Kubernetes objects required for the cluster restore.
87+
- Install the cluster with Helm using the correct backup IDs.
88+
89+
### Restore Kubernetes objects
90+
Restore the Kubernetes objects that were backed up using Velero.
91+
92+
- Ensure that Velero is installed and configured with the AWS plugin as described in the [prerequisites](#prerequisites).
93+
- Set up a [Velero backup storage location](https://velero.io/docs/v1.17/api-types/backupstoragelocation/) to point to the S3 bucket.
94+
- If you are using AWS S3:
95+
```bash
96+
kubectl apply -f - <<EOF
97+
apiVersion: velero.io/v1
98+
kind: BackupStorageLocation
99+
metadata:
100+
name: hopsworks-bsl
101+
namespace: velero
102+
spec:
103+
provider: aws
104+
config:
105+
region: REGION
106+
objectStorage:
107+
bucket: BUCKET_NAME
108+
prefix: k8s_backup
109+
EOF
110+
```
111+
- If you are using an S3-compatible object storage, provide credentials and endpoint:
112+
```bash
113+
cat << EOF > bsl-credentials
114+
[default]
115+
aws_access_key_id=YOUR_ACCESS_KEY
116+
aws_secret_access_key=YOUR_SECRET_KEY
117+
EOF
118+
119+
kubectl create secret generic -n velero bsl-credentials --from-file=cloud=bsl-credentials
120+
121+
kubectl apply -f - <<EOF
122+
apiVersion: velero.io/v1
123+
kind: BackupStorageLocation
124+
metadata:
125+
name: hopsworks-bsl
126+
namespace: velero
127+
spec:
128+
provider: aws
129+
config:
130+
region: REGION
131+
s3Url: ENDPOINT
132+
credential:
133+
key: cloud
134+
name: bsl-credentials
135+
objectStorage:
136+
bucket: BUCKET_NAME
137+
prefix: k8s_backup
138+
EOF
139+
```
140+
- After the backup storage location becomes available, restore the backups. The following script restores the latest available backup. To restore a specific backup, set `backupName` instead of `scheduleName`.
141+
142+
```bash
143+
echo "=== Waiting for Velero BackupStorageLocation hopsworks-bsl to become Available ==="
144+
until [ "$(kubectl get backupstoragelocations hopsworks-bsl -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Available" ]; do
145+
echo "Still waiting..."; sleep 5;
146+
done
147+
148+
echo "=== Waiting for Velero to sync the backups from hopsworks-bsl ==="
149+
until [ "$(kubectl get backups -n velero -ojson | jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl")] | length' 2>/dev/null)" != "0" ]; do
150+
echo "Still waiting..."; sleep 5;
151+
done
152+
153+
154+
# Restores the latest - if specific backup is needed then backupName instead
155+
echo "=== Creating Velero Restore object for k8s-backups-main ==="
156+
RESTORE_SUFFIX=$(date +%s)
157+
kubectl apply -f - <<EOF
158+
apiVersion: velero.io/v1
159+
kind: Restore
160+
metadata:
161+
name: k8s-backups-main-restore-$RESTORE_SUFFIX
162+
namespace: velero
163+
spec:
164+
scheduleName: k8s-backups-main
165+
EOF
166+
167+
echo "=== Waiting for Velero restore to finish ==="
168+
until [ "$(kubectl get restore k8s-backups-main-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
169+
echo "Still waiting..."; sleep 5;
170+
done
171+
172+
# Restores the latest - if specific backup is needed then backupName instead
173+
echo "=== Creating Velero Restore object for k8s-backups-users-resources ==="
174+
kubectl apply -f - <<EOF
175+
apiVersion: velero.io/v1
176+
kind: Restore
177+
metadata:
178+
name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
179+
namespace: velero
180+
spec:
181+
scheduleName: k8s-backups-users-resources
182+
EOF
183+
184+
echo "=== Waiting for Velero restore to finish ==="
185+
until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
186+
echo "Still waiting..."; sleep 5;
187+
done
83188
```
84189

85-
#### Restore RonDB data
86-
87-
This command should be executed on all the RonDB datanodes. Each command should be customized with the node id of the node you are trying to restore (i.e., replace the node_id). As for the command above you should replace the backup_id and mgm_node_ip.
88-
89-
```sh
90-
/srv/hops/mysql/bin/ndb_restore -n [node_id] -b [backup_id] -r --ndb-connectstring=[mgm_node_ip]:1186 --backup_path=/srv/hops/mysql-cluster/ndb/backups/BACKUP/BACKUP-[backup_id]
91-
```
92-
93-
#### Rebuild the indices
94-
95-
In the first command we disable the indices for recovery. This last command will take care of enabling them again. This command needs to run only once on one of the nodes of the head node group. As for the commands above, you should replace node_id, backup_id and mgm_node_id.
96-
97-
```sh
98-
/srv/hops/mysql/bin/ndb_restore -n [node_id] -b [backup_id] --rebuild-indexes --ndb-connectstring=[mgm_node_ip]:1186 --backup_path=/srv/hops/mysql-cluster/ndb/backups/BACKUP/BACKUP-[backup_ip]
99-
```
100-
101-
#### Restore Users and Grants
102-
103-
In the backup phase, we took the backup of the user and grants separately. The last step of the RonDB restore process is to re-create all the users and grants both for Hopsworks services as well as for the online feature store users. This can be achieved by running the following command on one node of the head node group:
104-
105-
```sh
106-
/srv/hops/mysql-cluster/ndb/scripts/mysql-client.sh source users.sql
190+
### Restore on Cluster installation
191+
To restore a cluster during installation, configure the backup ID in the values YAML file:
192+
193+
```yaml
194+
global:
195+
_hopsworks:
196+
backups:
197+
enabled: true
198+
schedule: "@weekly"
199+
restoreFromBackup:
200+
backupId: "254811200"
107201
```
108202

109-
### HopsFS restore
110-
111-
With the metadata restored, you can now proceed to restore the file system blocks on HopsFS and restart the file system. When starting the datanode, it will advertise it’s ID/ClusterID and Storage ID based on the VERSION file that can be found in this directory:
112-
113-
```sh
114-
/srv/hopsworks-data/hops/hopsdata/hdfs/dn/current
203+
#### Customizations
204+
!!! Warning
205+
Even if you override the backup IDs for RonDB and Opensearch, you must still set `.global._hopsworks.restoreFromBackup.backupId` to ensure HopsFS is restored.
206+
207+
To restore a different backup ID for RonDB:
208+
209+
```yaml
210+
global:
211+
_hopsworks:
212+
backups:
213+
enabled: true
214+
schedule: "@weekly"
215+
restoreFromBackup:
216+
backupId: "254811200"
217+
218+
rondb:
219+
rondb:
220+
restoreFromBackup:
221+
backupId: "254811140"
115222
```
116223

117-
It’s important that all the datanodes are restored and they report their block to the namenodes processes running on the head nodes. By default the namenodes in HopsFS will exit “SAFE MODE” (i.e., the mode that allows only read operations) only when the datanodes have reported 99.9% of the blocks the namenodes have in the metadata. As such, the namenodes will not resume operations until all the file blocks have been restored.
118-
119-
### OpenSearch state rebuild
120-
121-
The OpenSearch state can be rebuilt using the Hopsworks metadata stored on RonDB. The rebuild process is done by using the re-indexing mechanism provided by ePipe.
122-
The re-indexing can be triggered by running the following command on the head node where ePipe is running:
123-
124-
```sh
125-
/srv/hops/epipe/bin/reindex-epipe.sh
224+
To restore a different backup for Opensearch:
225+
226+
```yaml
227+
global:
228+
_hopsworks:
229+
backups:
230+
enabled: true
231+
schedule: "@weekly"
232+
restoreFromBackup:
233+
backupId: "254811200"
234+
235+
olk:
236+
opensearch:
237+
restore:
238+
repositories:
239+
default:
240+
snapshots:
241+
default:
242+
snapshot_name: "254811140"
126243
```
127244

128-
The script is deployed and configured during the platform deployment.
129-
130-
### Kafka topics rebuild
131-
132-
The backup and restore plan doesn’t cover the data in transit in Kafka, for which the jobs producing it will have to be replayed. However, the RonDB backup contains the information necessary to recreate the topics of all the feature groups.
133-
You can run the following command, as super user, to recreate all the topics with the correct partitioning and replication factors:
134-
135-
```sh
136-
/srv/hops/kafka/bin/kafka-restore.sh
245+
You can also customize the Opensearch restore process to skip specific indices:
246+
247+
```yaml
248+
global:
249+
_hopsworks:
250+
backups:
251+
enabled: true
252+
schedule: "@weekly"
253+
restoreFromBackup:
254+
backupId: "254811200"
255+
256+
olk:
257+
opensearch:
258+
restore:
259+
repositories:
260+
default:
261+
snapshots:
262+
default:
263+
snapshot_name: "254811140"
264+
payload:
265+
indices: "-myindex"
137266
```
138-
139-
The script is deployed and configured during the platform deployment.

0 commit comments

Comments
 (0)