Skip to content

Conversation

dag-erling
Copy link
Contributor

@dag-erling dag-erling commented Sep 15, 2025

Motivation and Context

The actual minimum hole size on ZFS is variable, but we always report SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe that they can reliably create holes at 512-byte boundaries and waste resources trying to punch holes that ZFS ends up filling anyway.

Description

In zfs_pathconf(), if the vnode is a regular file, return its block size; if it is a directory, return the dataset record size; if it is neither, return EINVAL.

In zfsctl_common_pathconf(), always return EINVAL for _PC_MIN_HOLE_SIZE.

How Has This Been Tested?

Tested in FreeBSD 16.0-CURRENT.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 15, 2025
@amotin
Copy link
Member

amotin commented Sep 15, 2025

While SPA_MINBLOCKSIZE is obviously over-optimistic, it is really the smallest hole size that might theoretically exist on ZFS. I wonder how exactly reporting 1 value may help any app to behave better, other than trying to hardcode something, which is not great.

I wonder if apps could actually request it more specific. As I see, pathconf(3) allows requests for both files and directories. If application really like to know about a specific file, it should ask it for a specific file. In that case we could report correct value of a block size of the specific object. If not that, we could report dataset's record size, which should be more or less accurate in most cases.

@dag-erling
Copy link
Contributor Author

As I see, pathconf(3) allows requests for both files and directories. If application really like to know about a specific file, it should ask it for a specific file. In that case we could report correct value of a block size of the specific object. If not that, we could report dataset's record size, which should be more or less accurate in most cases.

That sounds reasonable, I'll give it a shot. But I wonder if we should still report 1 in the ctldir case.

@dag-erling
Copy link
Contributor Author

Does this look reasonable?

	case _PC_MIN_HOLE_SIZE:
		if (vp->v_type == VREG) {
			zp = VTOZ(vp);
			sa_object_size(zp->z_sa_hdl, &blksize, &nblocks);
			*valp = (int)blksize;
			return (0);
		}
		if (vp->v_type == VDIR) {
			*valp = (int)vp->v_mount->mnt_stat.f_iosize;
			return (0);
		}
		return (EINVAL);

@amotin
Copy link
Member

amotin commented Sep 15, 2025

But I wonder if we should still report 1 in the ctldir case.

I don't remember us having any files in the ctldir that could potentially have holes, so I wonder if we should just delete that part instead.

@amotin
Copy link
Member

amotin commented Sep 15, 2025

Does this look reasonable?

Not bad. I just wonder what are the best APIs to use here. From actual functionality side though, for files of only one block we should report maximum of the file block size and the dataset record size, since the file block size may increase (but never decrease) if the file grow, while for a file of one block holes reporting is not really productive.

@dag-erling
Copy link
Contributor Author

I don't remember us having any files in the ctldir that could potentially have holes, so I wonder if we should just delete that part instead.

Does the ctldir even ever contain any files? afaict the only entries it contains are ., .., and snapshot, so we may as well return EINVAL.

@dag-erling
Copy link
Contributor Author

dag-erling commented Sep 16, 2025

[...] for files of only one block we should report maximum of the file block size and the dataset record size, [...]

Isn't that just the record size?

I've been trying various things and it looks like files smaller than the record size can contain holes only if the entire file is a hole. As soon as there is even one nonzero byte anywhere in the file, the entire file gets allocated. So I'm starting to think the dataset record size is the correct value in all cases.

@amotin
Copy link
Member

amotin commented Sep 16, 2025

Does the ctldir even ever contain any files?

No, I don't think it does now, but I think it was discussed at some point to expose some information/controls that way.

So I'm starting to think the dataset record size is the correct value in all cases.

Dataset record size might change, but that does not affect already existing files with more than one block or with a block size already bigger than the new value. So the couple more conditions could give better result.

The actual minimum hole size on ZFS is variable, but we always report
SPA_MINBLOCKSIZE, which is 512.  This may lead applications to believe
that they can reliably create holes at 512-byte boundaries and waste
resources trying to punch holes that ZFS ends up filling anyway.

* In the general case, if the vnode is a regular file, return its
  current block size.  If the vnode is a directory, return the dataset
  record size.  If it is neither a regular file nor a directory,
  return EINVAL.

* In the control directory case, always return EINVAL.

Signed-off-by: Dag-Erling Smørgrav <[email protected]>
@dag-erling dag-erling changed the title FreeBSD: Return 1 for _PC_MIN_HOLE_SIZE FreeBSD: Correct _PC_MIN_HOLE_SIZE Sep 16, 2025
@kevans91
Copy link
Contributor

Does this look reasonable?

Not bad. I just wonder what are the best APIs to use here. From actual functionality side though, for files of only one block we should report maximum of the file block size and the dataset record size, since the file block size may increase (but never decrease) if the file grow, while for a file of one block holes reporting is not really productive.

I note that I paper over this in our version of openrsync today in a way that kind of sucks, too. See, e.g., https://gist.github.com/kevans91/87ff85f9d85cf8c6f93369928a5bdb74

root@ifrit:/tmp # ./a.out
blksz: 131072
blksz: 4096

When the file is freshly created we report an st_blksize of the recordsize, which shrinks down as the file is actually written (and scales up to the recordsize, as you noted), but that's not really helpful. The rsync protocol means that I don't get to see holes in the original file as a hint to try and speed things up, so I have to check all incoming blocks at possible hole-boundaries for opportunities to create holes. So:

  • With the current _PC_MIN_HOLE_SIZE of 512, I do a boatload more memcmp than I need to for files that will grow to be larger than the recordsize
  • If I try to use the larger of that and st_blksize, it's also a crapshoot since the blksize scales with the filesize, so I end up in largely the same situation

For us, --sparse is worded in such a way that we're not going to be in trouble if we miss a chance to punch holes smaller than the recordsize and a larger MIN_HOLE_SIZE would be somewhat handy, performance-wise, since we don't really have an API to get the minimum hole size as a function of a large size (and we don't really want to have to ftruncate-up, grab the min_hole_size, then ftruncate back down to avoid a discrepancy in behavior from reference rsync).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants