Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZNS support #1298

Draft
wants to merge 11 commits into
base: develop
Choose a base branch
from
Draft
18 changes: 18 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions ci.nix
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ mkShell {
buildInputs = [
autoconf
automake
btrfs-progs
clang
cowsay
docker
Expand Down
8 changes: 7 additions & 1 deletion doc/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@ $ sudo nixos-rebuild switch --update
>
> Don't want to use `nixUnstable`? **That's ok!** Use `nix-shell` and `nix-build` as you normally would.

Check out the submodules:

```bash
git submodule update --init
```

**Want to run or hack on Mayastor?** _You need more configuration!_ See
[running][doc-run], then [testing][doc-test].

Expand Down Expand Up @@ -127,7 +133,7 @@ cargo build --release
```

**Want to run or hack on Mayastor?** _You need more configuration!_ See
[running][doc-running], then [testing][doc-testing].
[running][doc-run], then [testing][doc-test].

Whilst the nix develop will allow you to build mayastor exactly as the image build, it might not have all the necessary components required for testing.
For that you might want to use the explicit shell configuration file: ci.nix:
Expand Down
4 changes: 2 additions & 2 deletions doc/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,14 @@ In order to use the full feature set of Mayastor, some or all of the following c
```nix
# /etc/nixos/configuration.nix
boot.kernelModules = [
"nbd" "xfs" "nvmet" "nvme_fabrics" "nvmet_rdma" "nvme_tcp" "nvme_rdma" "nvme_loop"
"nbd" "xfs" "btrfs" "nvmet" "nvme_fabrics" "nvmet_rdma" "nvme_tcp" "nvme_rdma" "nvme_loop"
];
```

To load these on non-NixOS machines:

```bash
modprobe nbd nvmet nvmet_rdma nvme_fabrics nvme_tcp nvme_rdma nvme_loop
modprobe nbd xfs btrfs nvmet nvmet_rdma nvme_fabrics nvme_tcp nvme_rdma nvme_loop
```

- For Asymmetric Namespace Access (ANA) support (early preview), the following kernel build configuration enabled:
Expand Down
53 changes: 51 additions & 2 deletions doc/test.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Or, for ad-hoc:
- Ensure several kernel modules are installed:

```bash
modprobe nbd xfs nvmet nvme_fabrics nvmet_rdma nvme_tcp nvme_rdma nvme_loop
modprobe nbd xfs btrfs nvmet nvme_fabrics nvmet_rdma nvme_tcp nvme_rdma nvme_loop
```

## Running the test suite
Expand All @@ -29,7 +29,33 @@ Mayastor's unit tests, integration tests, and documentation tests via the conven
Mayastor uses [spdk][spdk] which is quite senistive to threading. This means tests need to run one at a time:

```bash
cargo test -- --test-threads 1
cd io-engine
cargo test -- --test-threads 1 --nocapture
```

## Using your own SPDK version

In order to use your own SPDK version, your SPDK tree must rebase the commit of the latest `vYY.mm.x-mayastor`
branch from the https://github.com/openebs/spdk repo.
Build SPDK with these instructions inside of your nix shell:

```bash
cd spdk-rs
git clone https://github.com/openebs/spdk
cd spdk
git checkout vYY.mm.x-mayastor
# Rebase your branch
git submodule update --init
cd -
./build_spdk.sh
```

Before you run the cargo tests again, make sure spdk-rs is rebuild:

```bash
cd ../io-engine
cargo clean -p spdk-rs
cargo test -- --test-threads 1 --nocapture
```

## Running the end-to-end test suite
Expand All @@ -55,6 +81,29 @@ Then, to run the tests:
./node_modules/mocha/bin/mocha test_csi.js
```

## Using PCIe NVMe devices in cargo tests while developing

When developing new features, testing those with real PCIe devices in the process might come in handy.
In order to do so, the PCIe device first needs to be bound to the vfio driver:

```bash
sudo PCI_ALLOWED="<PCI-ADDRESS>" ./spdk-rs/spdk/scripts/setup.sh
```

The bdev name in the cargo test case can then follow the PCIe URI pattern:

```rust
static BDEVNAME1: &str = "pcie:///<PCI-ADDRESS>";
```

After testing the device may be rebound to the NVMe driver:

```bash
sudo PCI_ALLOWED="<PCI-ADDRESS>" ./spdk-rs/spdk/scripts/setup.sh reset
```

Please do not submit pull requests with cargo test cases that require PCIe devices to be present.

[spdk]: https://spdk.io/
[doc-run]: ./run.md
[mocha]: https://mochajs.org/
Expand Down
25 changes: 25 additions & 0 deletions doc/zns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Zoned Storage Support
Mayastor supports zoned storage in the form of PCIe ZNS devices and zoned SPDK uring block devices.

## Overview
Zoned storage is a class of storage that divides its address space into zones. These zones come with a sequential write constraint. Therefore, writes can just be issued to the zones write pointer, which will be advanced with a successful write operation. If the zone's capacity is reached, the zone is being transferred to the 'Full' state by the device controller and can not be rewritten until the zone is actively reset by the user. As of now zoned storage is available in the form of SMR HDDs and ZNS SSDs. This proposal focuses on ZNS SSDs.
For more information about zoned storage visit [zonedstorage.io](https://zonedstorage.io/).

Zoned Namespace (ZNS) NVMe SSDs are defined as part of a NVMe Command Set (see 'NVM Express Zoned Namespace Command Set Specification' in the [NVMe Command Set Specifications](https://nvmexpress.org/developers/nvme-command-set-specifications/)) and is supported since Linux kernel v5.9. SPDK supports zoned storage since v20.10.

Because ZNS SSDs align their flash media with zones, no on device garbage collection is needed. This results in better throughput, predictable latency and higher capacities per dollar (because over provisioning and DRAM for page mapping is not needed) in comparison to conventional SSDs.

The concept of ZNS SSDs and its advantages are discussed in depth in the ['ZNS: Avoiding the Block Interface Tax for Flash-based SSDs'](https://www.usenix.org/conference/atc21/presentation/bjorling) paper.

[RocksDB](https://github.com/facebook/rocksdb) and [TerarkDB](https://github.com/bytedance/terarkdb) are example applications of end to end integration with zoned storage through [ZenFS](https://github.com/westerndigitalcorporation/zenfs).
POSIX file systems like f2fs and btrfs also have zone support.

## Requirements for Mayastor
Initially the ZNS support in Mayastor is targeting the non-replicated volume I/O path with a disabled volume partitioning.
Replication and volume partitioning can be addressed later on as those features require special care in regards to the sequential write constrain and the devices max active zones and max open zones restrictions.

The NexusChild of a non-replicated Nexus should allow ZNS NVMe devices via the PCIe URI scheme as well as zoned SPDK uring devices via the uring URI scheme. This results automatically in a zoned nexus which is exposed to the user as a raw zoned NVMe-oF target or formated with btrfs.

## Prerequisites
- Linux kernel v5.15.68 or higher is needed because of the patch [nvmet: fix mar and mor off-by-one errors](https://lore.kernel.org/lkml/[email protected]/)
- SPDK 23.01 is needed because of [ZNS support for NVMe-oF](https://review.spdk.io/gerrit/c/spdk/spdk/+/16044/7)
119 changes: 115 additions & 4 deletions io-engine-tests/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
//! panic macros. The caller can decide how to handle the error appropriately.
//! Panics and asserts in this file are still ok for usage & programming errors.

use std::{io, io::Write, process::Command, time::Duration};
use std::{fmt, io, io::Write, process::Command, time::Duration};

use crossbeam::channel::{after, select, unbounded};
use once_cell::sync::OnceCell;
Expand Down Expand Up @@ -143,7 +143,7 @@ pub fn mayastor_test_init_ex(log_format: LogFormat) {
})
}

["dd", "mkfs.xfs", "mkfs.ext4", "cmp", "fsck", "truncate"]
["dd", "mkfs.xfs", "mkfs.ext4", "mkfs.btrfs", "cmp", "fsck", "truncate"]
.iter()
.for_each(|binary| {
if binary_present(binary).is_err() {
Expand Down Expand Up @@ -202,8 +202,9 @@ pub fn fscheck(device: &str) {

pub fn mkfs(path: &str, fstype: &str) -> bool {
let (fs, args) = match fstype {
"xfs" => ("mkfs.xfs", ["-f", path]),
"ext4" => ("mkfs.ext4", ["-F", path]),
"xfs" => ("mkfs.xfs", vec!["-f", path]),
"ext4" => ("mkfs.ext4", vec!["-F", path]),
"btrfs" => ("mkfs.btrfs", vec!["-f", "-m", "single", "-d", "single", path]),
_ => {
panic!("unsupported fstype");
}
Expand Down Expand Up @@ -568,4 +569,114 @@ macro_rules! test_diag {
}}
}

/// The null block device driver emulates block devices and is used for benchmarking and testing.
/// https://docs.kernel.org/block/null_blk.html
pub struct NullBlk(u32);
impl Drop for NullBlk {
fn drop(&mut self) {
delete_nullblk_device(self.0);
}
}
impl fmt::Display for NullBlk {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.0)
}
}

/// Create a zoned nullblk device with the given parameters. This emulated device exists entirely
/// in memory.
pub fn create_zoned_nullblk_device(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some doc comments explaining what this nullblk is, for example, does it just ignore writes and return zeroes?

block_size: u32,
zone_size: u32,
zone_cap: u32,
nr_conv_zones: u32,
nr_seq_zones: u32,
max_active_zones: u32,
max_open_zones: u32,
) -> Result<NullBlk, (i32, String)> {
//Get the next free nullblk device number
let mut nid = 1;
while std::path::Path::new(&format!(
"/sys/kernel/config/nullb/nullb{}",
nid
))
.exists()
{
nid += 1;
}
let (exit, stdout, stderr) = run_script::run(
r#"
set -e
modprobe null_blk nr_devices=0 > /dev/null || return $?
nid=$1
bs=$2
zs=$3
zc=$4
nr_conv=$5
nr_seq=$6
max_active_zones=$7
max_open_zones=$8

cap=$(( zs * (nr_conv + nr_seq) ))

dev="/sys/kernel/config/nullb/nullb$nid"
mkdir "$dev"

echo $bs > "$dev"/blocksize
echo 0 > "$dev"/completion_nsec
echo 0 > "$dev"/irqmode
echo 2 > "$dev"/queue_mode
echo 1024 > "$dev"/hw_queue_depth
echo 1 > "$dev"/memory_backed
echo 1 > "$dev"/zoned

echo $cap > "$dev"/size
echo $zs > "$dev"/zone_size
echo $zc > "$dev"/zone_capacity
echo $nr_conv > "$dev"/zone_nr_conv
echo $max_active_zones > "$dev"/zone_max_active
echo $max_open_zones > "$dev"/zone_max_open

echo 1 > "$dev"/power

echo mq-deadline > /sys/block/nullb$nid/queue/scheduler

echo "$nid"
"#,
&vec![
nid.to_string(),
block_size.to_string(),
zone_size.to_string(),
zone_cap.to_string(),
nr_conv_zones.to_string(),
nr_seq_zones.to_string(),
max_active_zones.to_string(),
max_open_zones.to_string(),
],
&run_script::ScriptOptions::new(),
)
.unwrap();
if exit != 0 {
return Err((exit, stderr));
}
return Ok(NullBlk(stdout.trim().parse::<u32>().unwrap()));
}

pub fn delete_nullblk_device(nid: u32) -> i32 {
let (exit, _, _) = run_script::run(
r#"
set -e
nid=$1
dev="/sys/kernel/config/nullb/nullb$nid"

echo 0 > "$dev"/power
rmdir $dev
"#,
&vec![nid.to_string()],
&run_script::ScriptOptions::new(),
)
.unwrap();
exit
}

pub use io_engine_tests_macros::spdk_test;
1 change: 1 addition & 0 deletions io-engine/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ async-process = { version = "1.8.1" }
rstack = { version = "0.3.3" }
tokio-stream = "0.1.14"
rustls = "0.21.12"
jemalloc-sys = "0.5.2+5.3.0-patched"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this used for? A patched version makes wonder if this is stable enough..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for mentioning this!
I needed a way to allocate memory for the spdk_bdev_zone_info c struct. I am using calloc and free of this allocator crate.

Can you suggest a nicer way to do this? :)

@hrudaya21 Unfortunately, there is no convenient way to allocate this struct within SPDK.


devinfo = { path = "../utils/dependencies/devinfo" }
jsonrpc = { path = "../jsonrpc"}
Expand Down
Loading