Skip to content

Allow txg_wait_synced_flags() and dmu_tx_assign() to return when the pool suspends #17355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented May 20, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Under failmode=continue, if the pool suspends we will return EIO on syscalls at the "edge" of OpenZFS, but once they're inside the house, they will usually block. We would like to get to a world where it does what it sounds like it will do: throw out in-flight syscalls rather than blocking them in the kernel. This is the first part of realising that.

Description

There's two parts to this change.

At the pool level, txg_wait_synced_flags() gets a new flag TXG_WAIT_SUSPEND. If provided, the call will be allowed to block, but will return if the pool suspends while waiting. To trigger this, a new function txg_wait_kicked() will throw all waiters out of the condvar so they can reconsider their exit conditions. zio_suspend() now calls this.

Further up, dmu_tx_try_assign() gets changed to make use of this flag. It already had a suspend check on first entry and returns EIO or ERESTART depending on the the presence of DMU_TX_WAIT and the current failmode. However, if it then ended up failing the assign because there is no space left in the tx (usually the write throttle), it would fall back into an infinite wait for the next transaction. That wait is now done with TXG_WAIT_SUSPEND, and if the pool suspends, it loops and reconsiders.

There's a few callers of the form VERIFY0(dmu_tx_assign(tx, DMU_TX_WAIT)), which (obviously) cannot handle a surprise error return if the pool suspends. To support them, DMU_TX_SUSPEND has been added, which lets them get the previous behaviour of blocking until the next txg, even if it has to wait until the heat death of the universe. This is sort of a cop-out, but working out a whole bunch of additional error responses is not what I want for an incremental step.

Future plans

On the surface, this looks like a lot of change with a somewhat dubious justification: make the much-maligned failmode=continue work in slightly more places than it does currently. But, this is also laying the groundwork for some future work:

  • being able to break out of txg_wait_synced_flags() on suspend is needed to make syscalls backed by zil_commit() (eg fsync()) also return error on suspend. That code is written, but not shown here because the changes inside the ZIL are very invasive, and I wanted to keep this PR focused.
  • being able to break out of txg_wait_synced_flags() and dmu_tx_assign() is also needed for forced export (currently being rebuilt on top of this) and for some other ideas we're playing with (eg failmode=readonly).
  • there's general interest in making failmode=continue do what people expect from it when they first discover it: eject everything from the kernel. I want to get us there.

How Has This Been Tested?

Tests included to try to test dmu_tx_assign() behaviour during suspend with different failmodes. It's not perfect, as it's somewhat at-a-distance, but it should be good enough for now.

Full ZTS run completed against Linux 6.1.137. FreeBSD 14.2-p1 in progress (will update when done).

“Success” possibly doesn’t say a lot, as there’s few (no?) tests that change failmode= in any meaningful way, or that take much interest in userspace applications when the pool suspends, but at least it suggests normal operation is not broken.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member

amotin commented May 20, 2025

FreeBSD is unhappy:

ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/failmode/failmode_dmu_tx_wait
  panic: VERIFY3S(zio->io_error, ==, ENXIO) failed (5 == 6)
  
  cpuid = 0
  time = 1747707582
  KDB: stack backtrace:
  #0 0xffffffff80c3dd05 at kdb_backtrace+0x65
  #1 0xffffffff80bf1882 at vpanic+0x152
  #2 0xffffffff82a4250a at spl_panic+0x3a
  #3 0xffffffff82c79220 at zio_vdev_io_done+0x140
  #4 0xffffffff82c70548 at zio_execute+0x78
  #5 0xffffffff80c52322 at taskqueue_run_locked+0x182
  #6 0xffffffff80c535a2 at taskqueue_thread_loop+0xc2
  #7 0xffffffff80bad28d at fork_exit+0x7d
  #8 0xffffffff8108610e at fork_trampoline+0xe

@amotin amotin added the Status: Code Review Needed Ready for review and testing label May 20, 2025
@robn
Copy link
Member Author

robn commented May 20, 2025

ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/failmode/failmode_dmu_tx_wait
panic: VERIFY3S(zio->io_error, ==, ENXIO) failed (5 == 6)

Ahh, I didn't have a debug build. It has to be something to do with the errors I'm injecting but I'm not seeing it. I'll dig further but it'll be nbd.

robn added 2 commits May 21, 2025 10:25
…suspended

This allows a caller to request a wait for txg sync, with an appropriate
error return if the pool is suspended or becomes suspended during the
wait.

To support this, txg_wait_kick() is added to signal the sync condvar,
which wakes up the waiters, causing them to loop and reconsider their
wait conditions again. zio_suspend() now calls this to trigger the break
if the pool suspends while waiting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Mostly for a little more type checking and debugging visibility.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the dmu-tx-suspend-break branch from 22d4666 to a9ca834 Compare May 21, 2025 01:13
@robn
Copy link
Member Author

robn commented May 21, 2025

ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/failmode/failmode_dmu_tx_wait
panic: VERIFY3S(zio->io_error, ==, ENXIO) failed (5 == 6)

See top commit. Basically, injecting ENXIO actually injects EIO, which can't beat the assert when injected into a top-level vdev_geom device. It's a whole stupid mess; this patch is the narrowest I could think of to retain the intent while still allowing it to happen. Tell me if its bonkers (my FreeBSD-fu is not strong) and/or if you'd prefer it in a separate PR.

(running my own ZTS run on 14.2 right now).

@robn robn force-pushed the dmu-tx-suspend-break branch from a9ca834 to 414f513 Compare May 21, 2025 03:47
Copy link
Contributor

@pcd1193182 pcd1193182 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite what the PR intro says, I don't see the DMU_TX_NOSUSPEND flag; is that just outdated?

robn added 4 commits May 22, 2025 10:24
This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool
suspends while they're waiting, rather than just on the initial check
before falling back into a wait.

Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore
suspend entirely, effectively returning to the previous behaviour.

With that, it shouldn't be possible for anything with a standard
dmu_tx_assign/wait/abort loop to block under failmode=continue.

Also should be a bit tighter than the old behaviour, where a
VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool
is already suspended and failmode=continue.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
This is the cheap way to keep non-user functions working after
break-on-suspend becomes default.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
By the assertion, vdev_geom_io_done() only expects ENXIO on an error
when the geom is a top-level (allocating) vdev[1][2]. However, zinject
currently can't insert ENXIO directly, possibly because on Solaris
outright disk failures were reported with EIO[2][3].

This is a narrow workaround to convert EIO to ENXIO when injections are
enabled, to avoid the assertion and allow the test suite to test
behaviour related to probe failure on FreeBSD.

1. freebsd/freebsd-src@37ec52ca7a2a
2. freebsd/freebsd-src@cd730bd6b2b1
3. illumos/illumos-gate@ea8dc4b6d2

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the dmu-tx-suspend-break branch from 414f513 to 5879832 Compare May 22, 2025 00:24
@robn
Copy link
Member Author

robn commented May 22, 2025

Despite what the PR intro says, I don't see the DMU_TX_NOSUSPEND flag; is that just outdated?

It was changed to DMU_TX_SUSPEND upthread. I hadn't updated the OP or commit messages, have done now, ty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants