-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Only interrupt active disk I/Os in failmode=continue #17372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: this is about zfs_deadman_failmode=continue
(tunable/modparam), not failmode=continue
(pool property), that is, spa->spa_deadman_failmode
vs spa->spa_failmode
.
(Important to me, because I'm in the middle of rehabilitating failmode=continue
proper, via #17355 and beyond. So if we are bumping into each other, we should have a chat about that. But I don't think we are!).
Property lawyering aside, this is good for what it is! My gut feel is that at least some of this should be "inside" the queue, but that's vague and this function obviously needs more work than this, so we good. Good stuff!
It may crash less often, as you say, but damn it is dirty! I'd vote for removal. Every I/O sent to block layer must complete. Period. Otherwise we are looking for indefinite number of problems, including random memory corruptions, etc. |
Do we know if this is actually still useful for anybody? This code was originally added long ago as an ugly last resort to attempt to deal with some broken scsi error handling observed in a now ancient and unsupported Linux kernel. Things should be better now, personally I haven't needed to set this in many years. Unless we have a real need I'd be inclined to remove it as well. |
The reason I filed the PR is that we had a customer who ran into a case where their disks dropped an I/O on the floor. The pool hung waiting for the I/O to complete for multiple hours, so it was clearly not going to complete. While forcing a reset would not have been the end of the world, it would have been better if they could have forced the I/O to retry and continued the system, even if it resulted in a memory leak of the I would prefer not to remove the functionality, personally. It's true that it's not super stable, and it's impossible to guarantee that it will work correctly. But in the cases where it comes up, it is reasonable for the user to want to attempt something, rather than just letting the system hang or panic. That said, I do think that we should either try to improve it or get rid of it. I think this patch makes it usable enough to be worth keeping, but I am open to the argument that it's not even worth attempting. |
If some disk is misbehaving, it should be detached or disabled in some way, so that its HBA could complete all stuck I/Os. Just making ZFS issue new I/Os does not mean they won't stuck either. Same time if previous I/Os proceed, they might read/write memory that is not longer allocated, or even worse -- repurposed. And if they complete at some point, ZFS may no longer wait for them, so we'll get some double completion. Considering this is pretty isolated chunk of code, I don't think I care about this patch to go in, but I personally won't touch it even with a 10 foot pole. |
Ok, well since this functionality is still clearly handy for its original purpose I agree let's improve it as much as we reasonably can. It'll never be entirely safe, but sometimes you're just left with no good options. |
failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Co-authored-by: Paul Dagnelie <[email protected]> Closes openzfs#17372
failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Co-authored-by: Paul Dagnelie <[email protected]> Closes #17372
failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Co-authored-by: Paul Dagnelie <[email protected]> Closes openzfs#17372
failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Co-authored-by: Paul Dagnelie <[email protected]> Closes openzfs#17372
Sponsored by: [Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable.
Description
With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Advancing a ZIO from the VDEV_IO_START stage to the VDEV_IO_DONE stage causes a problem if the IO hadn't actually been issued yet, since the IO_DONE stage unconditionally removes the zio from the active list. That fails if the IO hasn't been added to the active list.
To prevent this, we just only wake IOs that are actually active leaf vdev IOs. Any other I/O will get ignored; this does reduce the scope of what the
continue
failmode can address, but I think it does so in a helpful way. Feedback on other ideas is welcome, though.How Has This Been Tested?
I used
mdadm
to create virtual devices, and then built a pool on top of them. I then usedmdadm suspend
to cause the IOs to one of the devices to hang. I then set the failmode to continue after the deadman messages started to appear in the log. Without the patch, the system reliably kernel panics. With the patch, the IOs are reexecutedTypes of changes
Checklist:
Signed-off-by
.