Skip to content

Conversation

pcd1193182
Copy link
Contributor

Sponsored by: [Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable.

Description

With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Advancing a ZIO from the VDEV_IO_START stage to the VDEV_IO_DONE stage causes a problem if the IO hadn't actually been issued yet, since the IO_DONE stage unconditionally removes the zio from the active list. That fails if the IO hasn't been added to the active list.

To prevent this, we just only wake IOs that are actually active leaf vdev IOs. Any other I/O will get ignored; this does reduce the scope of what the continue failmode can address, but I think it does so in a helpful way. Feedback on other ideas is welcome, though.

How Has This Been Tested?

I used mdadm to create virtual devices, and then built a pool on top of them. I then used mdadm suspend to cause the IOs to one of the devices to hang. I then set the failmode to continue after the deadman messages started to appear in the log. Without the patch, the system reliably kernel panics. With the patch, the IOs are reexecuted

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.

With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.

Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: this is about zfs_deadman_failmode=continue (tunable/modparam), not failmode=continue (pool property), that is, spa->spa_deadman_failmode vs spa->spa_failmode.

(Important to me, because I'm in the middle of rehabilitating failmode=continue proper, via #17355 and beyond. So if we are bumping into each other, we should have a chat about that. But I don't think we are!).

Property lawyering aside, this is good for what it is! My gut feel is that at least some of this should be "inside" the queue, but that's vague and this function obviously needs more work than this, so we good. Good stuff!

@amotin amotin added the Status: Code Review Needed Ready for review and testing label May 27, 2025
@amotin
Copy link
Member

amotin commented May 27, 2025

It may crash less often, as you say, but damn it is dirty! I'd vote for removal. Every I/O sent to block layer must complete. Period. Otherwise we are looking for indefinite number of problems, including random memory corruptions, etc.

@behlendorf
Copy link
Contributor

Do we know if this is actually still useful for anybody? This code was originally added long ago as an ugly last resort to attempt to deal with some broken scsi error handling observed in a now ancient and unsupported Linux kernel. Things should be better now, personally I haven't needed to set this in many years. Unless we have a real need I'd be inclined to remove it as well.

@pcd1193182
Copy link
Contributor Author

The reason I filed the PR is that we had a customer who ran into a case where their disks dropped an I/O on the floor. The pool hung waiting for the I/O to complete for multiple hours, so it was clearly not going to complete. While forcing a reset would not have been the end of the world, it would have been better if they could have forced the I/O to retry and continued the system, even if it resulted in a memory leak of the bio or some other issue.

I would prefer not to remove the functionality, personally. It's true that it's not super stable, and it's impossible to guarantee that it will work correctly. But in the cases where it comes up, it is reasonable for the user to want to attempt something, rather than just letting the system hang or panic. That said, I do think that we should either try to improve it or get rid of it. I think this patch makes it usable enough to be worth keeping, but I am open to the argument that it's not even worth attempting.

@amotin
Copy link
Member

amotin commented May 28, 2025

If some disk is misbehaving, it should be detached or disabled in some way, so that its HBA could complete all stuck I/Os. Just making ZFS issue new I/Os does not mean they won't stuck either. Same time if previous I/Os proceed, they might read/write memory that is not longer allocated, or even worse -- repurposed. And if they complete at some point, ZFS may no longer wait for them, so we'll get some double completion. Considering this is pretty isolated chunk of code, I don't think I care about this patch to go in, but I personally won't touch it even with a 10 foot pole.

@behlendorf
Copy link
Contributor

Ok, well since this functionality is still clearly handy for its original purpose I agree let's improve it as much as we reasonably can. It'll never be entirely safe, but sometimes you're just left with no good options.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 28, 2025
@behlendorf behlendorf merged commit c464f1d into openzfs:master May 28, 2025
21 of 24 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.

With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>
Closes openzfs#17372
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.

With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>
Closes #17372
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.

With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>
Closes openzfs#17372
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.

With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>
Closes openzfs#17372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants