Skip to content

implement strip_components option of std.tar in std.zip as per #20257 #20263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hippietrail
Copy link

Implements feature request #20257 bringing the strip_components option of std.tar over to std.zip

As per std.tar it removes 0 or more path components from the beginning of the path.

This is mostly used with 0 or 1 to handle the cases that all files are in the root of the archive vs all files are within a single directory in the root of the archive.

Like std.tar it returns BadFileName if too many components are removed leaving the empty string as the filename.

@andrewrk
Copy link
Member

Would you mind taking a look, @marler8997?

@marler8997
Copy link
Contributor

marler8997 commented Jun 13, 2024

I'm questioning the utility of this particular API. I can see how a value of 1 could be useful to strip the root directory, however, this would not only strip "the root directory" but also strip "all root directories", which seems like a pretty "niche" use case at best and at worst unintended behavior.

The current implementation of this doesn't seem to take into account what happens if/when you have the same filename underneath differing directories, and I imagine implementing this in a reasonable way (doesn't do something unexpected) could add a good amount of complexity as you'd then need to decide how to handle and distinguish between cases where you're extracting into an existing directory vs extracting files with conflicting names. I imagine applications would want to detect and assert an error if this happens instead, but this would then require tracking all files extracted (hard to do without an allocator). This also couples the result to an implementation detail, the order in which the extraction decides to create the files.

The zig repo itself doesn't need this feature even though it does support stripping the root directory. It does so by instead allowing the "extractZip" function to return either the top-level directory, or a subdirectory if it detects there is only 1. This is a good general approach since you can't know if there is in fact a single top-level directory beforehand. An application that wants to remove the top-level directory beforehand would need to be limited to only supporting zip files that contain a single top-level directory which is a more niche, but still plausible.

We could eliminate most of these decisions by simplifying the api into:

/// Strips the single root directory of the zip file.  Asserts error.ZipMultipleRoots if there is more than
/// one root directory.
strip_root_directory: bool,

Furthermore, if there was such a niche case that wanted to do something more complicated, applications can already implement their own version of the small 13-line extract function and use the lower-level API instead. This provides the flexibility to do things like stripping components, or maybe extracting only a subdirectory of the zip file, or performing custom transformations on the filenames and/or data.

@hippietrail
Copy link
Author

Thanks for having a look! I agree strip_root_directory: bool would make more sense. I added strip_components to match the behaviour in std.tar and the implementers there doubtlessly implemented it to match the behavior of the *unix tar commandline tool:

     --strip-components count
             Remove the specified number of leading path elements.  Pathnames
             with fewer elements will be silently skipped.  Note that the
             pathname is edited after checking inclusion/exclusion patterns
             but before security checks.

std.tar's extraction tests only care about the 0 and 1 cases too.

I just commented yesterday on the std.tar single file tars fail. issue that there's not really a need for std.tar or std.zip to have APIs that match the *nix tools in functionality or in name. In fact std.tar's implementation of strip_components already differs from that of *nix tar since std.tar fails when an entire path is stripped down to the empty string whereas *nix tar silently skips those and continues.

I'd say it makes more sense to align the options of std.tar and std.zip to each other than to align either with commandline tools anyway. Even the 'extract' functions have different names at the moment: pipeToFileSystem for tar and extract for zip.

@alexrp
Copy link
Member

alexrp commented Jan 26, 2025

Ping @marler8997 for thoughts on the above.

@marler8997
Copy link
Contributor

I think any comments I have would just be rehashing what's already been said. To summarize, I'm in favor of a strip_root_directory option, I think a more general strip components feature creates a complex set of edge cases to account for and decisions to be made, and the implemention that results in the behavior you'd expect would likely rival the size of the entirety of std.zip and cause the function to now require an allocator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants