2.2.8 staging prep #17325

robn · 2025-05-12T03:49:46Z

Motivation and Context

Bringing various things back to 2.2. Linux kernel version update, the recent dataset hold fixes, handful of other bits.

Description

Includes:

How Has This Been Tested?

Compile checked latest release of all LTS kernels back to 4.19, and all 6.x kernels.

Also compile checked on FreeBSD 14.2.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

For UIO_ITER, we are just wrapping a kernel iterator. It will take care of its own offsets if necessary. We don't need to do anything, and if we do try to do anything with it (like advancing the iterator by the skip in zfs_uio_advance) we're just confusing the kernel iterator, ending up at the wrong position or worse, off the end of the memory region. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17298 (cherry picked from commit 2ee5b51)

This is a convenience for filesystems that need the inode of their parent or their own name, as its often complicated to get that information. We don't need those things, so this is just detecting which prototype is expected and adjusting our callback to match. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> (cherry picked from commit 7ef6b70)

According to the upstream change, all callers set it, and all block devices either honoured it or ignored it, so removing it entirely allows a bunch of handling for the "unset" case to be removed, and it becomes effectively implied. We follow suit, and keep setting it for older kernels. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> (cherry picked from commit 2ca91ba)

since 4.10, bio->bi_opf needs to be checked to determine all kinds of flush requests. this was the case prior to the commit referenced below, but the order of ifdefs was not the usual one (newest up top), which might have caused this to slip through. this fixes a regression when using zvols as Qemu block devices, but might have broken other use cases as well. the symptoms are that all sync writes from within a VM configured to use such a virtual block devices are ignored and treated as async writes by the host ZFS layer. this can be verified using fio in sync mode inside the VM, for example with fio \ --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \ --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \ --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \ --numjobs=1 --sync=1 which shows an IOPS number way above what the physical device underneath supports, with "zpool iostat -r 1" on the hypervisor side showing no sync IO occuring during the benchmark. with the regression fixed, both fio inside the VM and the IO stats on the host show the expected numbers. Fixes: 846b598 "config: remove HAVE_REQ_OP_* and HAVE_REQ_*" Signed-off-by: Fabian-Gruenbichler <[email protected]> Co-authored-by: Alexander Motin <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Closes openzfs#17131 (cherry picked from commit 41823a0)

It turns out that approach taken in the original version of the patch was wrong. So now, we're taking approach in-line with how kernel actually does it - when sb is being torn down, access to it is serialized via sb->s_umount rwsem, only when that lock is taken is it okay to work with s_flags - and the other mistake I was doing was trying to make SB_ACTIVE work, but apparently the kernel checks the negative variant - not SB_DYING and not SB_BORN. Kernels pre-6.6 don't have SB_DYING, but check if sb is hashed instead. Signed-off-by: Pavel Snajdr <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Closes: openzfs#17121 (cherry picked from commit a0e6271)

Avoids using fallback_migrate_folio, which starts unnecessary writeback (leading to BUG in migrate_folio_extra). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: tstabrawa <[email protected]> Closes openzfs#16568 Closes openzfs#16723 (cherry picked from commit 7b6e967)

The 6.0 kernel removes the 'migratepage' VFS op. Check for migratepage. Signed-off-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected] Closes openzfs#17217 (cherry picked from commit ab9bb19)

robn · 2025-05-12T03:50:54Z

@tonyhutter I dunno if a 2.2.8 is planned (probably entirely dependent on demand). If it does happen, I'm very happy to assist with testing to get it out the door.

tonyhutter · 2025-05-12T16:25:18Z

I tried this on Fedora 42 (6.14 kernel) and wasn't able to get it to build. Pulling in 155847c helped, but then I hit:

make[2]: Entering directory '/home/hutter/zfs'
  CC       module/zfs/libzpool_la-ddt.lo
module/zfs/ddt.c: In function 'ddt_stat_update':
module/zfs/ddt.c:445:9: error: array subscript -1 is below array bounds of 'ddt_stat_t[64]' {aka 'struct ddt_stat[64]'} [-Werror=array-bounds=]
  445 |         ddt_stat_add(&ddh->ddh_stat[bucket], &dds, neg);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from ./include/sys/spa.h:44,
                 from module/zfs/ddt.c:29:
./include/sys/fs/zfs.h:1331:25: note: while referencing 'ddh_stat'
 1331 |         ddt_stat_t      ddh_stat[64];   /* power-of-two histogram buckets */
      |                         ^~~~~~~~
cc1: all warnings being treated as errors

tonyhutter · 2025-05-12T16:26:06Z

@tonyhutter I dunno if a 2.2.8 is planned (probably entirely dependent on demand).

I didn't have any plans one way or the other, but I'd be open to a 2.2.8 release.

amotin · 2025-05-14T18:48:55Z

It seems #17334 would be good too. Though it is not in 2.3 yet.

lzsaver · 2025-05-17T05:35:25Z

#17340 and #17343 look important. Both build well on top of 2.2.7.

behlendorf · 2025-05-19T17:06:22Z

These are all now in master #17334, #17340 and #17343 and would be good to include in a 2.2.8 release.

ryao · 2025-05-19T22:44:52Z

We might as well include #17353 once it is merged.

shodanshok · 2025-05-20T13:46:53Z

What about #17255 ?

For what it is worth, I welcome a 2.2.8 release - maybe even as the new default/stable branch for EL 9 (which is at 2.1.16 actually).

I've found that QEMU/KVM guest memory accounted as shared also included into NR_FILE_PAGES. But it is actually a non-evictable anonymous memory. Using it as a base for zfs_arc_pc_percent parameter makes ARC to ignore shrinker requests while page cache does not really have anything to evict, ending up in OOM killer killing the QEMU process. Instead use of NR_ACTIVE_FILE + NR_INACTIVE_FILE should represent the part of a page cache that is actually evictable, which should be safer to use as a reference for ARC scaling. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17334 (cherry picked from commit 0aa83dc)

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)

d634d20 had been intended to fix a potential information leak issue where the compiler's optimization passes appeared to remove `memset()` operations that sanitize sensitive data before memory is freed for use by the rest of the kernel. When I wrote it, I had assumed that the compiler would not remove the other `memset()` operations, but upon reflection, I have realized that this was a bad assumption to make. I would rather have a very slight amount of additional overhead when calling `gcm_clear_ctx()` than risk a future compiler remove `memset()` calls. This is likely to happen if someone decides to try doing link time optimization and the person will not think to audit the assembly output for issues like this, so it is best to preempt the possibility before it happens. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#17343 (cherry picked from commit d8a33bc)

The intent is that the filesystem may have a reference to an "old" version of the new directory, eg if it was keeping it alive because a remote NFS client still had it open. We don't need anything like that, so this really just changes things so we return error codes encoded in pointers. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17229 (cherry picked from commit bb740d6)

Renamed in 6.2, and the compat wrapper removed in 6.15. No signature or functional change apart from that, so a very minimal update for us. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17229 (cherry picked from commit 841be1d) (cherry picked from commit 1dc9bec)

This was caught when doing a manual check to see if openzfs#17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in openzfs#17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#17353 (cherry picked from commit 83fa80a) (cherry picked from commit 07d815f2a06573514f51c8601aa60db6fe1a76ad)

With certain combinations of target ARC states balance and ghost hit rates it was possible to get the fractions outside of allowed range. This patch limits maximum balance adjustment speed, which should make it impossible, and also asserts it. Fixes openzfs#17210 Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> (cherry picked from commit b1ccab1) (cherry picked from commit 677cb91)

robn · 2025-05-23T13:09:15Z

All suggestions included, top posted updated. Looks like we're putting a release together!

I went back through the ~600 commits on master that aren't on the 2.2 series, and there's a good amount of fixes for leaks, overflows, null derefs and so on. I didn't really want to load up 2.2 this late though, feels like we should be very sparing with on the LTS branch.

Anyway, feel free to nominate stuff I guess :)

aerusso · 2025-05-23T14:11:20Z

Could you include 8bf1e83 please too? It fixes some ZTS failures on Debian.

robn and others added 7 commits May 12, 2025 13:40

amotin approved these changes May 12, 2025

View reviewed changes

behlendorf approved these changes May 16, 2025

View reviewed changes

amotin and others added 3 commits May 23, 2025 22:25

robn changed the title ~~[2.2.8] various Linux kernel updates~~ 2.2.8 staging prep May 23, 2025

robn and others added 4 commits May 23, 2025 22:56

robn force-pushed the zfs-2.2.8-staging branch from 81f2feb to a6865d7 Compare May 23, 2025 12:57

amotin approved these changes May 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2.2.8 staging prep #17325

2.2.8 staging prep #17325

robn commented May 12, 2025 •

edited

Loading

Uh oh!

robn commented May 12, 2025

Uh oh!

tonyhutter commented May 12, 2025

Uh oh!

tonyhutter commented May 12, 2025

Uh oh!

amotin commented May 14, 2025

Uh oh!

lzsaver commented May 17, 2025

Uh oh!

behlendorf commented May 19, 2025

Uh oh!

ryao commented May 19, 2025

Uh oh!

shodanshok commented May 20, 2025

Uh oh!

robn commented May 23, 2025

Uh oh!

aerusso commented May 23, 2025

Uh oh!

Uh oh!

2.2.8 staging prep #17325

Are you sure you want to change the base?

2.2.8 staging prep #17325

Conversation

robn commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

robn commented May 12, 2025

Uh oh!

tonyhutter commented May 12, 2025

Uh oh!

tonyhutter commented May 12, 2025

Uh oh!

amotin commented May 14, 2025

Uh oh!

lzsaver commented May 17, 2025

Uh oh!

behlendorf commented May 19, 2025

Uh oh!

ryao commented May 19, 2025

Uh oh!

shodanshok commented May 20, 2025

Uh oh!

robn commented May 23, 2025

Uh oh!

aerusso commented May 23, 2025

Uh oh!

Uh oh!

robn commented May 12, 2025 •

edited

Loading