-
Notifications
You must be signed in to change notification settings - Fork 16
How to create a copy of a volume on a different node with SPDK
In this wiki we will describe how to create a copy of a volume on a different node where another instance of SPDK is running. We can perform the copy operation meanwhile the owner of the volume continues to read and write over it, so we call this procedure Online Volume Rebuild.
A longhorn volume is made of different replicas of the same data put together into a SPDK RAID1 bdev. Each replica is a stack of different SPDK logical volumes: the head lvol (live data) and all its snapshots. So, to create a new copy of the volume over a different node, we have to recreate all the layer of the stack and to link them together.
Copying all the volumes a node must contain, it will be possible to rebuild a new node with all the data it will have to manage.
Even if in SPDK both head lvol and snapshots are lvols (with the only difference that a snapshots is a read only lvol), in this page for simplicity we will refer only to the head lvol as lvol.
During the snapshot stack rebuild we want to continue writing over the longhorn volume, and to do this we have to perform a COW (copy on write): when we write data to the head lvol into a not allocated cluster, SPDK before reads cluster data from the underlying snapshot and then write the new data. So we need a "temporary" snapshot for the new lvol created: this can be the external snapshot. In this way we preserve the writes that happen to new lvol while the layered snapshots rebuilding goes on in the background.
Suppose we have a node, called node1
, where an instance of spdk_tgt
is running: here we have a logical volume store called lvstore1
and a volume called lvol1
, which have a snapshot called snapshot1_2
.
Generally we would have another node where different replicas of longhorn volume are put together into a RAID1: now, to make the process easier, we suppose that the RAID bdev is created directly on node1
.
At this point, we create a new node called node2
and what we want to do is to create a copy of lvol1
and snapshot1_2
in node2
.
First of all we have to start SPDK app over node2
with the command
./build/bin/spdk_tgt --json disk.json
where disk.json
has this content:
{
"subsystems": [
{
"subsystem": "bdev",
"config": [
{
"method": "bdev_aio_create",
"params": {
"block_size": 4096,
"name":"Aio2",
"filename": "/dev/sdc"
}
}
]
}
]
}
/dev/sdc
in this example is the path to the block device of the physical disk.
One important thing to pay attention to is the block_size
parameter, that is the block size of the bdev upon which will be created the lvol store: the block size of node1
aio bdev must be equal or a multiple of the block size of node2
aio bdev. This constraint has been put in the shallow copy operation we will see afterward to ensure better performances.
First of all we have to pause I/O over the block device connected to the RAID1. We can do this for example using device mapper.
Once paused, we must ensure that all pending I/O have been completed and we can obtain this, for example, performing a sync
over the block device connected with the RAID1. The sync
sends a flush command via NVMe-oF to the RAID1, which will route this command to all its base bdevs (we have implemented the flush command in lvolstore).
At this point we perform a snapshot over all the replicas, i.e. all lvols that are base bdevs in the RAID1 (in our case we have only lvol1
).
On node1
we create snapshot1_1
bdev_lvol_snapshot lvstore1/lvol1 snapshot1_1
snapshot1_1
becomes the upper snapshot of the stack, so above snapshot1_2
.
We have to export snapshot1_1
via nvme-of. Note that if the raid is not created on node1
, we would export lvol1
towards the node where the RAID resides. So the following subsystem would already be created and lvol1
added to it as namespace.
First of all in node1
we have to create the transport for the fabric, in this case TCP
nvmf_create_transport -t tcp
then we create a nvmf subsystem
nvmf_create_subsystem nqn.2023-07.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller
and we add snapshot1_1
as namespace to this subsystem with
nvmf_subsystem_add_ns nqn.2023-07.io.spdk:cnode1 lvstore1/snapshot1_1
To complete the operation we create a listener for this subsystem
nvmf_subsystem_add_listener nqn.2023-07.io.spdk:cnode1 -t tcp -a <node1_ipaddr> -s 4420
On node2
we have to attach to the exported snapshot1_1
bdev_nvme_attach_controller -b nvme1 -t tcp -a <node1_ipaddr> -n nqn.2023-07.io.spdk:cnode1 -s 4420 -f ipv4
This last command will create the bdev nvme1n1
that points to snapshot1_1
.
Then we have to create a new logical volume store called lvstore2
bdev_lvol_create_lvstore Aio2 lvstore2
and upon it we create lvol2
as a clone of the external snapshot snapshot1_1
bdev_lvol_clone_bdev nvme1n1 lvstore2 lvol2
We have to export the newly created volume lvol2
, in the same way we exported snapshot1_1
. So, on node2
nvmf_create_transport -t tcp
nvmf_create_subsystem nqn.2023-07.io.spdk:cnode2 -a -s SPDK00000000000002 -d SPDK_Controller
nvmf_subsystem_add_ns nqn.2023-07.io.spdk:cnode2 lvstore2/lvol2
nvmf_subsystem_add_listener nqn.2023-07.io.spdk:cnode2 -t tcp -a <node2_ipaddr> -s 4420
In node1
we have to attach to the exported lvol2
bdev_nvme_attach_controller -b `nvme2` -t tcp -a <node2_ipaddr> -n nqn.2023-07.io.spdk:cnode2 -s 4420 -f ipv4
this will create the bdev nvme2n1
.
Then we add nvme2n1
to the RAID1
bdev_raid_grow_base_bdev <raid_bdev_name> nvme2n1
At this point we can resume the I/O over the block device connected to the RAID1.
On node2
we have to create a temp lvol called lvol2_temp
where to copy all the snapshots of lvol1
bdev_lvol_create -l lvstore2 -t lvol2_temp 20
Here we must pay attention to the size of lvol2_temp
, that must be equal to or greater than the size of the source lvol lvol1
.
Then we add this temp lvol to the nvmf subsystem
nvmf_subsystem_add_ns nqn.2023-07.io.spdk:cnode2 lvstore2/lvol2_temp
This operation will create on node1
, where we are attached to this subsystem, the bdev nvme2n2
.
We have to recreate the entire snapshot stack of lvol1
, and this has to be done layer by layer. To copy the data, we perform a shallow copy operation, which copy only the cluster allocated to the logical volume. For example, if from snapshot1_1
we read data belonging to a cluster not allocated, data will be read from the parent of this volume, i.e. snapshot1_2
. So, if we want to create a copy of a snapshot, we have to copy only the cluster allocated to that snapshot and this can be done with the shallow copy.
We start in node1
copying snapshot1_2
over nvme2n2
bdev_lvol_shallow_copy lvstore1/snapshot1_2 nvme2n2
and then in node2
we perform a snapshot of lvol2_temp
bdev_lvol_snapshot lvstore2/lvol2_temp snapshot2_2
Now we reply the same operation for the layer above, so in node1
bdev_lvol_shallow_copy lvstore1/snapshot1_1 nvme2n2
and in node2
bdev_lvol_snapshot lvstore2/lvol2_temp snapshot2_1
At this point we have:
-
snapshot2_1
--snapshot2_2
is a copy ofsnapshot1_1
--snapshot1_2
, i.e each layer contains the same data of the corresponding one -
lvol2_temp
is empty - live data are contained both in
lvol1
andlvol2
So, if we could makelvol2
to point to the snapshot stacksnapshot2_1
--snapshot2_2
, we would have completed the rebuild oflvol1
.
Before to start we have to stop again the I/O over the block device connected to the RAID1.
On node2
we change the parent of lvol2
from the external snapshot nvme1n1
to the local snapshot snapshot2_1
bdev_lvol_set_parent lvstore2/lvol2 lvstore2/snapshot2_1
Now lvol2
is a complete copy of lvol1
, so finally we can resume I/O over the block device connected to the RAID1.
On node2
we can detach from node1
bdev_nvme_detach_controller nvme1 -t tcp -a <node1_ipaddr> -n nqn.2023-07.io.spdk:cnode1 -s 4420 -f ipv4
and on node1
from node2
bdev_nvme_detach_controller nvme2 -t tcp -a <node2_ipaddr> -n nqn.2023-07.io.spdk:cnode2 -s 4420 -f ipv4
We can also remove snapshot1_1
from nvmf subsystem on node1
nvmf_subsystem_remove_ns nqn.2023-07.io.spdk:cnode1 1
Note that if the RAID is not created on node1, snapshot1_1
would have namespace index 2 and not 1, because 1 would be the namespace created adding lvol1
to the subsystem (operation done to export lvol1
towards the node where the RAID resides).
Then we remove lvol2_temp
from nvmf subsystem on node2
nvmf_subsystem_remove_ns nqn.2023-07.io.spdk:cnode2 2
Finally we delete lvol2_temp
on node2
bdev_lvol_delete lvstore2/lvol2_temp