Description
Among experienced zfs users and developers, it seems to be conventional wisdom that zfs native encryption is not suitable for production usage, particularly when combined with snapshotting and zfs send/recv. There is a long standing data corruption issue with many firsthand user reports:
openzfs/zfs#12014
openzfs/zfs#11688
(Also see the issues linked from those)
--
Update 2025-05-31: At least 2 bugs in non-raw send with encryption were found and fixed. They will be included in zfs 2.2.8 and zfs 2.3.3, which are not yet released at the time of this writing. See the following:
Issue: openzfs/zfs#12014
2.2.8-staging branch commit - openzfs/zfs@b144b16
2.3.3-staging branch commit - openzfs/zfs@f28c685
To what extent this will resolve all corruption issues with zfs encryption will need to be assessed over a longer period of time, but this is very promising and exciting.
--
Added 2025-04-19: A native encryption + snapshot corruption issue can be consistently reproduced using the following repro scripts and repro steps. The author of the reproducer also bisected it with the following conclusion. Unfortunately, the particular commit where the issue started happening (openzfs/zfs@30af21b) is quite large, and it doesn't bring as much immediate clarity to the cause or a fix as one might hope for. The issue is being tracked at openzfs/zfs#12014.
Additionally, if you join #zfs or #zfsonlinux on libera.chat and mention that you're having an issue with zfs native encryption, you'll be met with advice from developers that zfs native encryption is simply not reliable.
Should warnings be added to the sections of the documentation and/or the zfs command itself that mention native encryption that this combination of features (native encryption + send/recv) is known to be unsuitable for production usage? As-is, there don't appear to be any warnings, and it just seems inappropriate to guide new zfs users down a path toward potential data corruption, or even -- at best -- unscheduled reboots and scrubs. I have attempted writing a warning message below. This can of course be adjusted and is just here to get the ball rolling:
Begin message
ZFS has a known issue where using "zfs send" from an encrypted dataset may result in checksum errors being reported on snapshots within that dataset.
Please note:
- In many configurations and workloads, this problem does not occur at all, even when "zfs send" is used from an encrypted dataset. Many are using zfs encryption along with zfs send without issue.
- It is not understood precisely which hardware configuration, software configuration, and workloads cause this issue to manifest.
- In some cases, when the issue occurs, the checksum errors can be eliminated by rebooting and scrubbing the affected pool twice, but it is not known with absolute certainty if this is always successful in eliminating the checksum errors.
- In some cases, the issue can be avoided entirely by using a "raw" zfs send instead of an unencrypted zfs send, but it is not known with absolute certainty if this is always successful in avoiding the issue.
If you are considering using zfs encryption along with snapshot send/receive in use cases where unscheduled reboots and/or unscheduled scrubs are not acceptable, you may wish to thoroughly test your software and hardware configuration with your workload before putting it into production. If this is not practical, it may be best to explore other options for data encryption until this known issue is rectified. For more information, see the following github link:
#494
End Message
Update (Feb 2024):
I received some feedback that this was not well-substantiated enough. So for some additional context, here is a reddit comment from a zfs developer / contributor:
I have a strange little testbed next to me that reproduces one of the issues over 50% of the time you test it. Depending on which problem, sometimes this is "just" a kernel panic, sometimes it mangles your key settings so you need something custom and magic to let you reach in and fix it, sometimes it writes records that should not have been allowed in an encrypted dataset and then errors out trying to read them again. (To pick three examples.) (The illumos folks reported permanent data loss from what looks like a similar bug to one on OpenZFS, but that's not exactly the same code, so YMMV how worried that makes you.)
In addition, there is the constant stream of user reports in the issues referenced above.
I think there's already an understanding that this issue may be very difficult to fix, but in the meantime I'm just suggesting that it would be good if layman users such as myself had some documentation and zfs command level warning against using these features in production until this is resolved.