-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Allow txg_wait_synced_flags()
and dmu_tx_assign()
to return when the pool suspends
#17355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
FreeBSD is unhappy:
|
Ahh, I didn't have a debug build. It has to be something to do with the errors I'm injecting but I'm not seeing it. I'll dig further but it'll be nbd. |
…suspended This allows a caller to request a wait for txg sync, with an appropriate error return if the pool is suspended or becomes suspended during the wait. To support this, txg_wait_kick() is added to signal the sync condvar, which wakes up the waiters, causing them to loop and reconsider their wait conditions again. zio_suspend() now calls this to trigger the break if the pool suspends while waiting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Mostly for a little more type checking and debugging visibility. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
22d4666
to
a9ca834
Compare
See top commit. Basically, injecting ENXIO actually injects EIO, which can't beat the assert when injected into a top-level (running my own ZTS run on 14.2 right now). |
a9ca834
to
414f513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite what the PR intro says, I don't see the DMU_TX_NOSUSPEND flag; is that just outdated?
This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool suspends while they're waiting, rather than just on the initial check before falling back into a wait. Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore suspend entirely, effectively returning to the previous behaviour. With that, it shouldn't be possible for anything with a standard dmu_tx_assign/wait/abort loop to block under failmode=continue. Also should be a bit tighter than the old behaviour, where a VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool is already suspended and failmode=continue. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
This is the cheap way to keep non-user functions working after break-on-suspend becomes default. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
By the assertion, vdev_geom_io_done() only expects ENXIO on an error when the geom is a top-level (allocating) vdev[1][2]. However, zinject currently can't insert ENXIO directly, possibly because on Solaris outright disk failures were reported with EIO[2][3]. This is a narrow workaround to convert EIO to ENXIO when injections are enabled, to avoid the assertion and allow the test suite to test behaviour related to probe failure on FreeBSD. 1. freebsd/freebsd-src@37ec52ca7a2a 2. freebsd/freebsd-src@cd730bd6b2b1 3. illumos/illumos-gate@ea8dc4b6d2 Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
414f513
to
5879832
Compare
It was changed to |
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
Under
failmode=continue
, if the pool suspends we will returnEIO
on syscalls at the "edge" of OpenZFS, but once they're inside the house, they will usually block. We would like to get to a world where it does what it sounds like it will do: throw out in-flight syscalls rather than blocking them in the kernel. This is the first part of realising that.Description
There's two parts to this change.
At the pool level,
txg_wait_synced_flags()
gets a new flagTXG_WAIT_SUSPEND
. If provided, the call will be allowed to block, but will return if the pool suspends while waiting. To trigger this, a new functiontxg_wait_kicked()
will throw all waiters out of the condvar so they can reconsider their exit conditions.zio_suspend()
now calls this.Further up,
dmu_tx_try_assign()
gets changed to make use of this flag. It already had a suspend check on first entry and returnsEIO
orERESTART
depending on the the presence ofDMU_TX_WAIT
and the current failmode. However, if it then ended up failing the assign because there is no space left in the tx (usually the write throttle), it would fall back into an infinite wait for the next transaction. That wait is now done withTXG_WAIT_SUSPEND
, and if the pool suspends, it loops and reconsiders.There's a few callers of the form
VERIFY0(dmu_tx_assign(tx, DMU_TX_WAIT))
, which (obviously) cannot handle a surprise error return if the pool suspends. To support them,DMU_TX_SUSPEND
has been added, which lets them get the previous behaviour of blocking until the next txg, even if it has to wait until the heat death of the universe. This is sort of a cop-out, but working out a whole bunch of additional error responses is not what I want for an incremental step.Future plans
On the surface, this looks like a lot of change with a somewhat dubious justification: make the much-maligned
failmode=continue
work in slightly more places than it does currently. But, this is also laying the groundwork for some future work:txg_wait_synced_flags()
on suspend is needed to make syscalls backed byzil_commit()
(egfsync()
) also return error on suspend. That code is written, but not shown here because the changes inside the ZIL are very invasive, and I wanted to keep this PR focused.txg_wait_synced_flags()
anddmu_tx_assign()
is also needed for forced export (currently being rebuilt on top of this) and for some other ideas we're playing with (egfailmode=readonly
).failmode=continue
do what people expect from it when they first discover it: eject everything from the kernel. I want to get us there.How Has This Been Tested?
Tests included to try to test
dmu_tx_assign()
behaviour during suspend with different failmodes. It's not perfect, as it's somewhat at-a-distance, but it should be good enough for now.Full ZTS run completed against Linux 6.1.137. FreeBSD 14.2-p1 in progress (will update when done).
“Success” possibly doesn’t say a lot, as there’s few (no?) tests that change
failmode=
in any meaningful way, or that take much interest in userspace applications when the pool suspends, but at least it suggests normal operation is not broken.Types of changes
Checklist:
Signed-off-by
.