How to create a copy of a volume on a different node with SPDK

Summary

In this wiki we will describe how to create a copy of a volume on a different node where another instance of SPDK is running. We can perform the copy operation meanwhile the owner of the volume continues to read and write over it, so we call this procedure Online Volume Rebuild.

Introduction

A longhorn volume is made of different replicas of the same data put together into a SPDK RAID1 bdev. Each replica is a stack of different SPDK logical volumes: the head lvol (live data) and all its snapshots. So, to create a new copy of the volume over a different node, we have to recreate all the layer of the stack and to link them together.

Copying all the volumes a node must contain, it will be possible to rebuild a new node with all the data it will have to manage.

Terminology

Even if in SPDK both head lvol and snapshots are lvols (with the only difference that a snapshots is a read only lvol), in this page for simplicity we will refer only to the head lvol as lvol.

Basic idea

During the snapshot stack rebuild we want to continue writing over the longhorn volume, and to do this we have to perform a COW (copy on write): when we write data to the head lvol into a not allocated cluster, SPDK before reads cluster data from the underlying snapshot and then write the new data. So we need a "temporary" snapshot for the new lvol created: this can be the external snapshot. In this way we preserve the writes that happen to new lvol while the layered snapshots rebuilding goes on in the background.

Nodes setup

Suppose we have a node, called node1, where an instance of spdk_tgt is running: here we have a logical volume store called lvstore1 and a volume called lvol1, which have a snapshot called snapshot1_2.

Generally we would have another node where different replicas of longhorn volume are put together into a RAID1: now, to make the process easier, we suppose that the RAID bdev is created directly on node1.

At this point, we create a new node called node2 and what we want to do is to create a copy of lvol1 and snapshot1_2 in node2.

App start on destination node

First of all we have to start SPDK app over node2 with the command

./build/bin/spdk_tgt --json disk.json

where disk.json has this content:

{
    "subsystems": [
        {
            "subsystem": "bdev",
            "config": [
                {
                    "method": "bdev_aio_create",
                    "params": {
                        "block_size": 4096,
                        "name":"Aio2",
                        "filename": "/dev/sdc"
                    }
                }
            ]
        }
    ]
}

/dev/sdc in this example is the path to the block device of the physical disk.

One important thing to pay attention to is the block_size parameter, that is the block size of the bdev upon which will be created the lvol store: the block size of node1 aio bdev must be equal or a multiple of the block size of node2 aio bdev. This constraint has been put in the shallow copy operation we will see afterward to ensure better performances.

Prepare the rebuilding

Freeze the data

First of all we have to pause I/O over the block device connected to the RAID1. We can do this for example using device mapper. Once paused, we must ensure that all pending I/O have been completed and we can obtain this, for example, performing a sync over the block device connected with the RAID1. The sync sends a flush command via NVMe-oF to the RAID1, which will route this command to all its base bdevs (we have implemented the flush command in lvolstore). At this point we perform a snapshot over all the replicas, i.e. all lvols that are base bdevs in the RAID1 (in our case we have only lvol1).

On node1 we create snapshot1_1

bdev_lvol_snapshot lvstore1/lvol1 snapshot1_1

snapshot1_1 becomes the upper snapshot of the stack, so above snapshot1_2.

External snapshot creation

We have to export snapshot1_1 via nvme-of. Note that if the raid is not created on node1, we would export lvol1 towards the node where the RAID resides. So the following subsystem would already be created and lvol1 added to it as namespace.

First of all in node1 we have to create the transport for the fabric, in this case TCP

nvmf_create_transport -t tcp

then we create a nvmf subsystem

nvmf_create_subsystem nqn.2023-07.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller

and we add snapshot1_1 as namespace to this subsystem with

nvmf_subsystem_add_ns nqn.2023-07.io.spdk:cnode1 lvstore1/snapshot1_1

To complete the operation we create a listener for this subsystem

nvmf_subsystem_add_listener nqn.2023-07.io.spdk:cnode1 -t tcp -a <node1_ipaddr> -s 4420

Volume creation on destination node

On node2 we have to attach to the exported snapshot1_1

bdev_nvme_attach_controller -b nvme1 -t tcp -a <node1_ipaddr> -n nqn.2023-07.io.spdk:cnode1 -s 4420 -f ipv4

This last command will create the bdev nvme1n1 that points to snapshot1_1.

Then we have to create a new logical volume store called lvstore2

bdev_lvol_create_lvstore Aio2 lvstore2

and upon it we create lvol2 as a clone of the external snapshot snapshot1_1

bdev_lvol_clone_bdev nvme1n1 lvstore2 lvol2

Export new volume

We have to export the newly created volume lvol2, in the same way we exported snapshot1_1. So, on node2

nvmf_create_transport -t tcp
nvmf_create_subsystem nqn.2023-07.io.spdk:cnode2 -a -s SPDK00000000000002 -d SPDK_Controller
nvmf_subsystem_add_ns nqn.2023-07.io.spdk:cnode2 lvstore2/lvol2
nvmf_subsystem_add_listener nqn.2023-07.io.spdk:cnode2 -t tcp -a <node2_ipaddr> -s 4420

Add the new volume to the RAID1

In node1 we have to attach to the exported lvol2

bdev_nvme_attach_controller -b `nvme2` -t tcp -a <node2_ipaddr> -n nqn.2023-07.io.spdk:cnode2 -s 4420 -f ipv4

this will create the bdev nvme2n1.

Then we add nvme2n1 to the RAID1

bdev_raid_grow_base_bdev <raid_bdev_name> nvme2n1

At this point we can resume the I/O over the block device connected to the RAID1.

Rebuild the snapshot stack

Create temp volume

On node2 we have to create a temp lvol called lvol2_temp where to copy all the snapshots of lvol1

bdev_lvol_create -l lvstore2 -t lvol2_temp 20

Here we must pay attention to the size of lvol2_temp, that must be equal to or greater than the size of the source lvol lvol1.

Then we add this temp lvol to the nvmf subsystem

nvmf_subsystem_add_ns nqn.2023-07.io.spdk:cnode2 lvstore2/lvol2_temp

This operation will create on node1, where we are attached to this subsystem, the bdev nvme2n2.

Data copy

We have to recreate the entire snapshot stack of lvol1, and this has to be done layer by layer. To copy the data, we perform a shallow copy operation, which copy only the cluster allocated to the logical volume. For example, if from snapshot1_1 we read data belonging to a cluster not allocated, data will be read from the parent of this volume, i.e. snapshot1_2. So, if we want to create a copy of a snapshot, we have to copy only the cluster allocated to that snapshot and this can be done with the shallow copy.

We start in node1 copying snapshot1_2 over nvme2n2

bdev_lvol_shallow_copy lvstore1/snapshot1_2 nvme2n2

and then in node2 we perform a snapshot of lvol2_temp

bdev_lvol_snapshot lvstore2/lvol2_temp snapshot2_2

Now we reply the same operation for the layer above, so in node1

bdev_lvol_shallow_copy lvstore1/snapshot1_1 nvme2n2

and in node2

bdev_lvol_snapshot lvstore2/lvol2_temp snapshot2_1

Finalize the rebuilding

At this point we have:

snapshot2_1--snapshot2_2 is a copy of snapshot1_1--snapshot1_2, i.e each layer contains the same data of the corresponding one
lvol2_temp is empty
live data are contained both in lvol1 and lvol2 So, if we could make lvol2 to point to the snapshot stack snapshot2_1--snapshot2_2, we would have completed the rebuild of lvol1.

Change parent

Before to start we have to stop again the I/O over the block device connected to the RAID1.

On node2 we change the parent of lvol2 from the external snapshot nvme1n1 to the local snapshot snapshot2_1

bdev_lvol_set_parent lvstore2/lvol2 lvstore2/snapshot2_1

Process completed

Now lvol2 is a complete copy of lvol1, so finally we can resume I/O over the block device connected to the RAID1.

Clean up

On node2 we can detach from node1

bdev_nvme_detach_controller nvme1 -t tcp -a <node1_ipaddr> -n nqn.2023-07.io.spdk:cnode1 -s 4420 -f ipv4

and on node1 from node2

bdev_nvme_detach_controller nvme2 -t tcp -a <node2_ipaddr> -n nqn.2023-07.io.spdk:cnode2 -s 4420 -f ipv4

We can also remove snapshot1_1 from nvmf subsystem on node1

nvmf_subsystem_remove_ns nqn.2023-07.io.spdk:cnode1 1

Note that if the RAID is not created on node1, snapshot1_1 would have namespace index 2 and not 1, because 1 would be the namespace created adding lvol1 to the subsystem (operation done to export lvol1 towards the node where the RAID resides).

Then we remove lvol2_temp from nvmf subsystem on node2

nvmf_subsystem_remove_ns nqn.2023-07.io.spdk:cnode2 2

Finally we delete lvol2_temp on node2

bdev_lvol_delete lvstore2/lvol2_temp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly