Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(design): adding pv migration proposal #336

Merged
merged 7 commits into from
Feb 20, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions design/pv-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Volume Migration for ZFS-LocalPV
authors:
- "@pawanpraka1"
owners:
- "@kmova"
creation-date: 2021-05-21
last-updated: 2021-05-21
---

# Volume Migration for ZFS-LocalPV

## Table of Contents

* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Problem](#problem)
* [Current Solution](#current-solution)
* [Proposal](#proposal)
* [Keys Per ZPOOL](#keys-per-zpool)
* [Migrator](#migrator)
* [Workflow](#workflow)
* [Implementation Plan](#implementation-plan)

## Summary

This is a design proposal to implement a feature for migrating a volume from one node to another node. This doc describes how we can move the persistence volumes and the application to the other node for ZFS-LocalPV CSI Driver. This design expects that the administrators will move the disks to the new node and will import the pool there as part of replacing the node. The goal of this design is volume and the pod using the volume should be moved to the new nodes. This design also assumes that admins are not having large number of ZFS POOLs configured on a node.

## Problem

The problem with the LocalPV is, it has the affinity set on the PersistenceVolume object. This will let k8s scheduler to schedule the pods to that node as data is there only. ZFS-LocalPV driver uses nodename to set the affinity which creates the problem here as if we are replacing a node, the node name will change and k8s scheduler will not be able to schedule the pods to the new node even if we have moved the disks there.

## Current Solution

Current solution depends on `openebs.io/nodeid` node label. Admin has to set the same label to the replaced node which allows the k8s scheduler to automatically schedule the old pods to this new node. We can read more about the solution [here](https://github.com/openebs/zfs-localpv/blob/master/docs/faq.md#8-how-to-migrate-pvs-to-the-new-node-in-case-old-node-is-not-accessible).

The problem with the above approach is we can not move the volumes to any existing node as we can set only one label on the node for a given key. So, for any existing node in a cluster, label `openebs.io/nodeid` will already be set so, we can not update it to a new value as the pods already running on it will not able to get scheduled here.


## Proposal

### Keys Per ZPOOL

We are proposing to have a key dedicated to ZFS POOL. This key will be used by the ZFS-LocalPV driver to set the label on the nodes where it is present. In this way we can allow the ZFS POOLs to move from any node to any other node as the key is tied to the ZFS POOL as opposed to keeping it per node. We are proposing to have a `guid.openebs.io/<pool-guid>=true` label on the node where the pool is present. Assuming admins do not have large number of pools on a node, there will be not much label set on a node.

### Migrator

ZFS POOL name should be same across all the nodes for ZFS-LocalPV. So, we have to rename the ZFS POOLs if we are moving it to any existing node. We need a Migrator workflow to update the POOL name in the ZFSVolume object. This will find all the volumes present in a ZFS POOL on that node and updates the ZFSVolume object with the correct PoolName.

**Note:** We can not edit PV volumeAttributes with the new pool name as it is immutable field.

The migrator will look for all the volumes for all the pools present on the node and will look for corresponding ZFSVolume object and will update it with the correct poolname and ownernodeId field.

### Workflow

- user will setup all the nodes and setup the ZFS pool on each of those nodes.
- the ZFS-LocalPV CSI driver will look for all the pools on the node and will set the `guid.openebs.io/<pool-guid>=true` label for all ZFS POOLs that is present on that node. Let's say node-1 has two pools(say pool1 with guid as 14820954593456176137 and pool2 with guid as 16291571091328403547) present then the labels will be like this :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q_: Does k8s put a limit on the number of labels that can be attached to the node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are cloud providers which have some limit like gcp has limit of 64.

```
$ kubectl get node pawan-node-1 --show-labels
NAME STATUS ROLES AGE VERSION LABELS
node-1 Ready worker 351d v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=true,openebs.io/nodeid=node1,openebs.io/nodename=node-1,guid.openebs.io/14820954593456176137=true,guid.openebs.io/16291571091328403547=true
```
- If we are moving the pool1 from node1 to node2, then there are two cases here :-

#### 1. if node2 is a fresh node

- we can simply import the pool and restart the ZFS-LocalPV driver to make it aware of that pool to set the corresponding node topology
- the ZFS-LocalPV driver will look for `guid.openebs.io/14820954593456176137=true` and will remove the label from the nodes where pool is not present
- the ZFS-LocalPV driver will update the new node with `guid.openebs.io/14820954593456176137=true` label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q_: How to avoid a race conditions where two nodes in the cluster will have the same label? Also consider the cases like:

  • Node is shutdown or in not ready state and the pools have been moved to new node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous step handles that, it has to make sure there are no node with the same label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the previous steps - ZFS - Local PV driver - is it the node driver running on the old node or new node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the node driver will remove/set the label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in uninstall, cases - will the labels be cleared from nodes.

Copy link
Contributor Author

@pawanpraka1 pawanpraka1 Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need that, at the installation time, node label will be handle/moved accordingly. Btw CSI does not provide any framework to unset the topology. We can only set it at the registration time only. But can be done out of the box.

- the migrator will look for ZFSVolume resource and update the OwnerNodeID with the new node id for all the volumes.
- the k8s scheduler will be able to see the new label and should schedule the pods to this new node.

#### 2. if node2 is existing node and Pool of the same name is present there

- here we need import the pool with the different name and restart the ZFS-LocalPV driver to make it aware of that pool to set the corresponding node topology
- the ZFS-LocalPV driver will look for `guid.openebs.io/14820954593456176137=true` and will remove the label from the nodes where the pool is not present
- the ZFS-LocalPV driver will update the new node with `guid.openebs.io/14820954593456176137=true` label
- the migrator will look for ZFSVolume resource and update the PoolName and OwnerNodeID for all the volumes.
- the k8s scheduler will be able to see the new label and should schedule the pods to this new node.

## Implementation Plan

### Phase 1
1. Implement replacement with fresh node

### Phase 2
1. Implement replacement with exisitng node (implement Migrator).