FreeBSD: Correct _PC_MIN_HOLE_SIZE #17750

dag-erling · 2025-09-15T18:41:50Z

Motivation and Context

The actual minimum hole size on ZFS is variable, but we always report SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe that they can reliably create holes at 512-byte boundaries and waste resources trying to punch holes that ZFS ends up filling anyway.

Description

In zfs_pathconf(), if the vnode is a regular file, return its block size; if it is a directory, return the dataset record size; if it is neither, return EINVAL.

In zfsctl_common_pathconf(), always return EINVAL for _PC_MIN_HOLE_SIZE.

How Has This Been Tested?

Tested in FreeBSD 16.0-CURRENT.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-09-15T19:23:08Z

While SPA_MINBLOCKSIZE is obviously over-optimistic, it is really the smallest hole size that might theoretically exist on ZFS. I wonder how exactly reporting 1 value may help any app to behave better, other than trying to hardcode something, which is not great.

I wonder if apps could actually request it more specific. As I see, pathconf(3) allows requests for both files and directories. If application really like to know about a specific file, it should ask it for a specific file. In that case we could report correct value of a block size of the specific object. If not that, we could report dataset's record size, which should be more or less accurate in most cases.

dag-erling · 2025-09-15T19:36:32Z

As I see, pathconf(3) allows requests for both files and directories. If application really like to know about a specific file, it should ask it for a specific file. In that case we could report correct value of a block size of the specific object. If not that, we could report dataset's record size, which should be more or less accurate in most cases.

That sounds reasonable, I'll give it a shot. But I wonder if we should still report 1 in the ctldir case.

dag-erling · 2025-09-15T19:42:43Z

Does this look reasonable?

	case _PC_MIN_HOLE_SIZE:
		if (vp->v_type == VREG) {
			zp = VTOZ(vp);
			sa_object_size(zp->z_sa_hdl, &blksize, &nblocks);
			*valp = (int)blksize;
			return (0);
		}
		if (vp->v_type == VDIR) {
			*valp = (int)vp->v_mount->mnt_stat.f_iosize;
			return (0);
		}
		return (EINVAL);

amotin · 2025-09-15T19:43:06Z

But I wonder if we should still report 1 in the ctldir case.

I don't remember us having any files in the ctldir that could potentially have holes, so I wonder if we should just delete that part instead.

amotin · 2025-09-15T19:59:37Z

Does this look reasonable?

Not bad. I just wonder what are the best APIs to use here. From actual functionality side though, for files of only one block we should report maximum of the file block size and the dataset record size, since the file block size may increase (but never decrease) if the file grow, while for a file of one block holes reporting is not really productive.

dag-erling · 2025-09-16T11:59:28Z

I don't remember us having any files in the ctldir that could potentially have holes, so I wonder if we should just delete that part instead.

Does the ctldir even ever contain any files? afaict the only entries it contains are ., .., and snapshot, so we may as well return EINVAL.

dag-erling · 2025-09-16T12:09:30Z

[...] for files of only one block we should report maximum of the file block size and the dataset record size, [...]

Isn't that just the record size?

I've been trying various things and it looks like files smaller than the record size can contain holes only if the entire file is a hole. As soon as there is even one nonzero byte anywhere in the file, the entire file gets allocated. So I'm starting to think the dataset record size is the correct value in all cases.

amotin · 2025-09-16T13:26:47Z

Does the ctldir even ever contain any files?

No, I don't think it does now, but I think it was discussed at some point to expose some information/controls that way.

So I'm starting to think the dataset record size is the correct value in all cases.

Dataset record size might change, but that does not affect already existing files with more than one block or with a block size already bigger than the new value. So the couple more conditions could give better result.

kevans91 · 2025-09-17T14:23:14Z

Does this look reasonable?

Not bad. I just wonder what are the best APIs to use here. From actual functionality side though, for files of only one block we should report maximum of the file block size and the dataset record size, since the file block size may increase (but never decrease) if the file grow, while for a file of one block holes reporting is not really productive.

I note that I paper over this in our version of openrsync today in a way that kind of sucks, too. See, e.g., https://gist.github.com/kevans91/87ff85f9d85cf8c6f93369928a5bdb74

root@ifrit:/tmp # ./a.out
blksz: 131072
blksz: 4096

When the file is freshly created we report an st_blksize of the recordsize, which shrinks down as the file is actually written (and scales up to the recordsize, as you noted), but that's not really helpful. The rsync protocol means that I don't get to see holes in the original file as a hint to try and speed things up, so I have to check all incoming blocks at possible hole-boundaries for opportunities to create holes. So:

With the current _PC_MIN_HOLE_SIZE of 512, I do a boatload more memcmp than I need to for files that will grow to be larger than the recordsize
If I try to use the larger of that and st_blksize, it's also a crapshoot since the blksize scales with the filesize, so I end up in largely the same situation

For us, --sparse is worded in such a way that we're not going to be in trouble if we miss a chance to punch holes smaller than the recordsize and a larger MIN_HOLE_SIZE would be somewhat handy, performance-wise, since we don't really have an API to get the minimum hole size as a function of a large size (and we don't really want to have to ftruncate-up, grab the min_hole_size, then ftruncate back down to avoid a discrepancy in behavior from reference rsync).

dag-erling · 2025-10-02T11:08:05Z

ping

amotin · 2025-10-02T14:09:32Z

ping

@dag-erling I don't see any reaction on my proposal to take into account dataset recordsize for files of one block. I mean something like this:

        if (nblocks < 2)
                blksize = MAX(blksize, vp->v_mount->mnt_stat.f_iosize)

dag-erling · 2025-10-03T12:35:53Z

@amotin are you certain nblocks is correct here? I tested this on a small file expecting to see MIN_HOLE_SIZE equal to the record size, but:

% getconf MIN_HOLE_SIZE /usr/src/COPYRIGHT
6144
% stat -f%k,%b,%z /usr/src/COPYRIGHT    
6144,17,6070

Looking at the code, the value returned in nblocks by sa_object_size() appears to be in units of 512 bytes, rather than in actual file system blocks (see dmu_object_size_from_db() in module/zfs/dmu.c).

amotin

are you certain nblocks is correct here?

Not any more. I should have checked the function semantics, but I jumped to conclusion. Sorry.

I wonder what if instead of this similar to code below we just get zp, and similar to the original code in zfs_write() use zp->z_size <= zp->z_blksz?

The actual minimum hole size on ZFS is variable, but we always report SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe that they can reliably create holes at 512-byte boundaries and waste resources trying to punch holes that ZFS ends up filling anyway. * In the general case, if the vnode is a regular file, return its current block size, or the record size if the file is smaller than its own block size. If the vnode is a directory, return the dataset record size. If it is neither a regular file nor a directory, return EINVAL. * In the control directory case, always return EINVAL. Signed-off-by: Dag-Erling Smørgrav <[email protected]>

amotin

If this works for you, I don't have objections.

behlendorf · 2025-10-06T23:47:18Z

Likewise. @dag-erling can you just confirm this latest version works for you.

dag-erling · 2025-10-07T08:44:53Z

Likewise. @dag-erling can you just confirm this latest version works for you.

I get the record size in all cases. I'm not sure how to create a situation where a file is larger than its own block size but the block size is not equal to the record size.

amotin · 2025-10-07T13:53:40Z

I'm not sure how to create a situation where a file is larger than its own block size but the block size is not equal to the record size.

You'd need to increase recordsize to lets say 1MB, create 1MB file, reduce recordsize back to 128K, read _PC_MIN_HOLE_SIZE to get 1MB.

Or in opposite, reduce recordsize to lets say 32KB, create 64KB file, increase recordsize back to 128K, read _PC_MIN_HOLE_SIZE to get 32KB.

A bit more interesting case might be if you increase recordsize to lets say 1MB, create 768K file, reduce recordsize back to 128K, read _PC_MIN_HOLE_SIZE to get 768KB, append the file to 2MB, read _PC_MIN_HOLE_SIZE to get 1MB. The question whether we would like to round the first output to the 1MB in advance I guess may be discussed.

dag-erling · 2025-10-07T14:54:28Z

Thanks, I didn't realize the record size was a modifiable property, I thought it was derived from the pool geometry. I'm satisfied that the current logic produces the desired result:

# zfs create zroot/test
# cd /test
# ll
total 0
# zfs set recordsize=1048576 zroot/test
# dd if=/dev/zero of=file bs=256k count=3
3+0 records in
3+0 records out
786432 bytes transferred in 0.000949 secs (828857922 bytes/sec)
# getconf MIN_HOLE_SIZE file
1048576
# zfs set recordsize=131072 zroot/test
# getconf MIN_HOLE_SIZE file
786432
# dd if=/dev/zero bs=256k count=5 >>file
5+0 records in
5+0 records out
1310720 bytes transferred in 0.001538 secs (852230871 bytes/sec)
# getconf MIN_HOLE_SIZE file
1048576
# zfs set recordsize=32768 zroot/test
# dd if=/dev/zero of=small bs=64k count=1
1+0 records in
1+0 records out
65536 bytes transferred in 0.000446 secs (147025106 bytes/sec)
# zfs set recordsize=131072 zroot/test
# getconf MIN_HOLE_SIZE small
32768

dag-erling · 2025-10-08T13:47:20Z

Thanks, any chance of merging this into FreeBSD stable/15 before 15.0?

amotin · 2025-10-08T13:52:10Z

@dag-erling For that it would need to be merged into ZFS 2.4 branch to be a part of the next RC. I don't see a big problem.

dag-erling force-pushed the des/min_hole_size branch from 4174951 to a748c90 Compare September 15, 2025 18:42

behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 15, 2025

dag-erling force-pushed the des/min_hole_size branch from a748c90 to 9212c0b Compare September 16, 2025 13:33

dag-erling changed the title ~~FreeBSD: Return 1 for _PC_MIN_HOLE_SIZE~~ FreeBSD: Correct _PC_MIN_HOLE_SIZE Sep 16, 2025

dag-erling force-pushed the des/min_hole_size branch from 9212c0b to 1bfe710 Compare October 2, 2025 11:08

dag-erling force-pushed the des/min_hole_size branch from 1bfe710 to 059ae68 Compare October 2, 2025 16:47

amotin approved these changes Oct 2, 2025

View reviewed changes

behlendorf approved these changes Oct 2, 2025

View reviewed changes

amotin requested changes Oct 3, 2025

View reviewed changes

dag-erling force-pushed the des/min_hole_size branch from 059ae68 to 4381f21 Compare October 3, 2025 17:09

amotin approved these changes Oct 6, 2025

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 6, 2025

amotin merged commit 6e5b836 into openzfs:master Oct 8, 2025
23 of 26 checks passed

FreeBSD: Correct _PC_MIN_HOLE_SIZE #17750

FreeBSD: Correct _PC_MIN_HOLE_SIZE #17750

Conversation

dag-erling commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin commented Sep 15, 2025

Uh oh!

dag-erling commented Sep 15, 2025

Uh oh!

dag-erling commented Sep 15, 2025

Uh oh!

amotin commented Sep 15, 2025

Uh oh!

amotin commented Sep 15, 2025

Uh oh!

dag-erling commented Sep 16, 2025

Uh oh!

dag-erling commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Sep 16, 2025

Uh oh!

kevans91 commented Sep 17, 2025

Uh oh!

dag-erling commented Oct 2, 2025

Uh oh!

amotin commented Oct 2, 2025

Uh oh!

dag-erling commented Oct 3, 2025

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

behlendorf commented Oct 6, 2025

Uh oh!

dag-erling commented Oct 7, 2025

Uh oh!

amotin commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dag-erling commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dag-erling commented Oct 8, 2025

Uh oh!

amotin commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dag-erling commented Sep 15, 2025 •

edited

Loading

dag-erling commented Sep 16, 2025 •

edited

Loading

amotin commented Oct 7, 2025 •

edited

Loading

dag-erling commented Oct 7, 2025 •

edited

Loading