-
Notifications
You must be signed in to change notification settings - Fork 1.8k
ZFS corruption related to snapshots post-2.0.x upgrade #12014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Two other interesting tidbits... When I do the reboot after this issue occurs, the mounting of the individual zfs datasets is S L O W. Several seconds each, and that normally just flies by. After scrubbing, it is back to normal speed of mounting. The datasets that have snapshot issues vary with each one. Sometimes it's just one, sometimes many. But var is almost always included. (Though its parent, which has almost no activity ever, also is from time to time, so that's odd.) |
Same symptoms here, more or less. See also issue #11688. |
I also have the symptom with the corrupted snapshots, without kernel panics so far. So far it only affected my Debian system with Linux 5.10 and zfs 2.0.3 (I've turned the server off for today, I can check the exact versions tomorrow). Also, while the system has the 2.0.3 zfs utils + module, the pool is still left on 0.8.6 format. I wasn't able to execute On the corrupted system, after I got the mail from ZED, I manually ran a scrub at first, after which the I've rebooted the server into an Ubuntu 20.10 live with zfs 0.8.4-1ubuntu11 The errors didn't seem to affect the data on the zvols (all 4 affected snapshots are of zvols). The zvols are used as disks for VMs with ext4 on them. I have two other Ubuntu 21.04 based Systems with zfs-2.0.2-1ubuntu5 which are not affected until now. However, they have their pools already upgraded to 2. All are snapshotted with sanoid and have the datasets encrypted.
EDIT:
EDIT 2:
|
I'm seeing this too, on Ubuntu 21.04, also using zfs encryption I have znapzend running, and it makes a lot of snapshots. Sometimes, some of them are bad, and can't be used (for example, attempting to send them to a replica destination fails). I now use the In the most recent case (this morning) I had something like 4300 errors (many more than I'd seen previously). There are no block-level errors (read/write/cksum). They're cleared after destroying the affected snapshots and scrubbing (and maybe a reboot, depending on .. day?) Warning! Speculation below:
|
@jgoerzen Can you
In my case, #11688 (which you already reference), I've discovered that rebooting "heals" the snapshot -- at least using the patchset I mentioned there |
I'll be glad to. Unfortunately, I rebooted the machine yesterday, so I expect it will be about a week before the problem recurs. It is interesting to see the discussion today in #11688. The unique factor about the machine that doesn't work for me is that I have encryption enabled. It wouldn't surprise me to see the same thing here, but I will of course wait for it to recur and let you know. |
Hello @aerusso, The problem recurred over the weekend and I noticed it this morning. Unfortunately, the incident that caused it had already expired out of the
It should be noted that my hourly snap/send stuff runs at 17 minutes past the hour, so that may explain this timestamp correlation. zpool status reported:
Unfortunately I forgot to attempt to do a
So I think that answers the question. After a reboot but before a scrub, the |
I have similar symptoms, on an encrypted single-ssd ubuntu 21.04 boot pool, using stock zfs from ubuntu's repos. Deleting the affected snapshots and scrubbing previously cleared the errors, but on reoccurence, repeated scrubbing (without deleting them) caused a deadlock. My system has ECC memory, so it's probably not RAM related.
|
@cbreak-black Was there a system restart between the occurrence of the corrupted snapshot and the problems? Restarting has "fixed" this symptom for me (though you will need to scrub twice for the message to disappear, I believe). I have a suspicion that this may be a version of #10737 , which has an MR under way there. The behavior I am experiencing could be explained by that bug (syncoid starts many I'm holding off on trying to bisect this issue (at least) until testing that MR. (And all the above is conjecture!) |
@aerusso No, without a restart I got into the scrub-hang, and had to restart hard. Afterwards, the scrub finished, and several of the errors vanished. The rest of the errors vanished after deleting the snapshots and scrubbing again. |
Can I join the club too? #10019 |
@InsanePrawn I can't seem to find commit 4d5b4a33d in any repository I know of (and neither can github, apparently, either). However, in your report you say this was a "recent git master" and the commit I'm currently betting on being guilty is da92d5c, which was committed in November of the previous year, so I can't use your data point to rule out my theory! Also, it sounds like you didn't have any good way to reproduce the error --- however, you were using a test pool. Compared to my reproduction strategy (which is just, turn my computer on and browse the web, check mail, etc.) it might be easier to narrow in on a test case (or might have been easier a year and a half ago, when this was all fresh). Anyway, if you have any scripts or ideas of what you were doing that caused this besides "snapshots being created and deleted every couple minutes", it might be useful too. (I already tried lots of snapshot creations and deletions during fio on several datasets in a VM). |
Yeah, idk why I didn't go look for the commit in my issue - luckily for us, that server (and pool; it does say yolo, but it's my private server's root pool. it's just that i won't cry much if it breaks; originally due to then unreleased crypto) and the git repo on it still exist. Looks like 4d5b4a33d was two systemd-generator commits by me after 610eec4 |
FWIW the dataset the issue appeared on was an empty filesystem (maybe a single small file inside) dataset that had snapshots (without actual fs activity) taken in quick intervals (somewhere between 30s and 5m intervals) in parallel with a few (5-15) other similarly empty datasets. The pool is a raidz2 on 3.5" spinning SATA disks. Edit: Turns out the dataset also still exists, the defective snapshot however does not anymore. I doubt that's helpful? |
@InsanePrawn Does running the zrepl workload reproduce the bug on 2.0.5 (or another recent release?) I don't think the snapshot is terribly important --- unless you're able to really dig into it with zdb (which I have not developed sufficient expertise to do). Rather, I think it's the workload, hardware setup, and (possibly, but I don't understand the mechanism at all) the dataset itself. Encryption also is a common theme, but that might just affect the presentation (i.e., there's no MAC to fail in the unencrypted, unauthenticated, case). Getting at |
I've since added redundancy to my pool (it's now a mirror with two devices), and disabled autotrim. The snapshot corruption still happens. Still don't know what is causing it. And I also don't know if the corruption happens when creating the snapshot, and only later gets discovered (when I try to zfs send the snapshots), or if snapshots get corrupted some time in between creation and sending. |
@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of This will get the output of I'll repeat this here: if anyone gets me a reliable reproducer on a new pool, I have no doubt we'll be able to solve this in short order. |
Just mentioning here that we saw this on TrueNAS 12.0-U5 with OpenZFS 2.0.5 as well -- see #11688 (comment) for our story. |
Since I don't see anyone mentioning it here yet, #11679 contains a number of stories about the ARC getting confused when encryption is involved and, in a very similar looking illumos bug linked from there, eating data at least once. |
@gamanakis Nope, I'm not using raw (-w). |
it's present in v2.1.1 as well:
|
@aerusso you wrote that da92d5c may be the cause of this issue. My workstation at work panics after a couple of days and I need to reset it. Could you provide a branch of 2.1.1 with this commit reverted (as revert causes merge conflicts I can't fix myself) so I could test if the machine no longer crashes? |
@phreaker0 Unfortunately, the bug that da92d5c introduced (#10737) was fixed by #12299, which I believe is present in all maintained branches now. It does not fix #11688, (which I suspect is the same as this bug). I'm currently running 0.8.6 on Linux 5.4.y, and am hoping to wait out this bug (I don't have a lot of time right now, or for the foreseeable future). But, If you have a reliable reproducer (or a whole lot of time) you could bisect while running 5.4 (or some other pre-5.10 kernel). I can help anyone who wants to do that. If we can find the guilty commit, I have no doubt this can be resolved. |
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
I updated two of my machines (which experienced the issue quite heavily) with zfs 2.3.2 patched with the commit gamanakis@ab60df7. @HankB @pcd1193182 @gamanakis and everyone else who helped, thanks for your hard work! |
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in a panic in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]>
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]>
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]>
Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12014 Closes #17340
Signed-off-by: Richard Yao <[email protected]>
This check is currently limited to checking mismatches that occur in the same stack frame. It does not detect across stack frames. Signed-off-by: Richard Yao <[email protected]>
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)
Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)
Were you using raw send? |
no |
Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12014 Closes #17340 (cherry picked from commit ea74cde)
Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12014 Closes #17340 (cherry picked from commit ea74cde)
Uh oh!
There was an error while loading. Please reload this page.
System information
Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
Of note, the
<0xeb51>
is sometimes a snapshot name; if Izfs destroy
the snapshot, it is replaced by this tag.Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of
zpool status
output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
Include any warning/errors/backtraces from the system logs
See above
Potentially related bugs
arc_buf_destroy
is in silent corruption for thousands files gives input/output error but cannot be detected with scrub - at least for openzfs 2.0.0 #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.The text was updated successfully, but these errors were encountered: