-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
std.tar: support pax headers and gnu_long{name,link} #15382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
70464d7
to
f0da800
Compare
62eec46
to
de62a00
Compare
003656c
to
03da105
Compare
89081a2
to
47dda50
Compare
I have updated the testdata/ files. Previously they were .tar files and the testdata/ directory was around 136K. Now they are gzipped w/ -9 and total 6.9K (around 1/20 the size). Also, tests now use |
47dda50
to
cf081fa
Compare
6076ef9
to
a377e63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else here looks good.
ad95885
to
eb2ea02
Compare
5b68673
to
955ba4a
Compare
8aa9cfe
to
7767b61
Compare
I don't think the aarch64-macos-debug ci failure is related to this PR. Seems to be an OOM. From the log: zig test ReleaseSafe native: error: memory usage peaked at 4572626944 bytes, exceeding the declared upper bound of 4080218931 |
Yeah you're fine, fix is on master if you want to rebase. When that's the only error, all tests still passed. |
this patch converts all the crashes submitted by @squeek502 (in ziglang#15382 (comment)) to errors. HeaderIterator: * add bounds checks to PaxIterator.next() * convert several unsafe int casts to safe ones misc: * added FileType.tagName() for debugging which returns null for unnamed enum values * make unixTime() fallible
7767b61
to
348e327
Compare
@travisstaloch looks like you have some more merge conflicts - just thought I'd mention as I am very interested in this PR. |
-- 4/20/23 -- tar.zig: * add HeaderIterator() type and convert pipeToFileSystem() to use it. * add initial support for options.executable_bit_only. * initial windows support: * skip symlinks which require admin rights * workaround file.updateTimes() panic by truncating file times * add tests parseNumeric and parsePaxTime ported from https://go.dev/src/archive/tar/strconv_test.go. lib/std/compress/tar/testdata/ * copy a subset of tar files from https://go.dev/src/archive/tar/testdata * gzip them all with -9. results in around 95% file size reduction reader_test.zig: * validate headers against files from testdata/. a port of https://go.dev/src/archive/tar/reader_test.go. test_decompress.zig: * runs tar.pipeToFileSystem() on valid testdata/ files. ------------- std.tar: cleanups and fixes -- 4/22/23 -- tar.zig: * convert V7Header, UstarHeader, StarHeader, GnuHeader to have fields of byte arrays instead of accessor methods. * Header.getFormat() - optimize by using 64 bit compares rather than mem.eql(). * gnu_long{name,link} and pax headers - allow them to be longer than 512 bytes by adding an allocator param to pipeToFileSystem(). * symlinks: cleanup logic and skip if wasi * remove custom toPosixPath() wasn't necessary. * workaround header.name corruption issue after makeOpenPath() call and add notes about it. * skip setDirProperties() entrely. leave as a TODO. src/Package.zig: pass gpa to pipeToFileSystem() tests: minor cosmetic changes ------------- std.tar: fix memory errors -- 4/25/23 -- tar.zig: * previously HeaderIterator.header()'s v7 param was by value, causing it to return invalid pointers to locals for various string fields. now the param along with a few others have been made *const. this solves the perceived 'stack corruption' i thought i was seeing. because v7 is 512 bytes, it wasn't clobbered until a fn w/ a larger stack frame, dir.makeOpenPath -> dir.openDir -> os.toPosixPath, was called. * symlinks - check for and set flags.is_directory param ------------- std.tar: hardlinks, more cleanup -- 4/25/23 -- tar.zig: * properly handle hardlinks by copying files. * improve symlink is_directory check - replace dir.access() call with dir.openFile() + handle error.FileNotFound. * isValidPax(): optimize: replace mem.eql()s with ComptimeStringMap. tests: * add testdata/hardlink.tar.gz ------------- std.tar: more cleanup -- 4/25/23 -- tests: add testdata/dir-symlink.tar -------------
this patch converts all the crashes submitted by @squeek502 (in ziglang#15382 (comment)) to errors. HeaderIterator: * add bounds checks to PaxIterator.next() * convert several unsafe int casts to safe ones misc: * added FileType.tagName() for debugging which returns null for unnamed enum values * make unixTime() fallible
prevent crashes and return errors when: * any 'named type' header's file path contains a `NUL` character. * prevent std.bit_set assertion failure when `header.type` is outside the bounds of FileType.named_types_bitset.
* return error when header.size is negative via math.cast * readBlocks(): change 'size' param from u64 to usize to avoid unnecessary @intcast * std.math -> math
* builtins * mem.alignForward * fs.file.File.Kind
348e327
to
0289688
Compare
thanks for the heads up. just rebased and pushed. |
Thanks for keeping this up to date. I'm sorry for taking so long with the review but I want you to know that it's on my mind, definitely looking to get it landed before the release. |
Thanks for the update. No worries, I know there is a lot to do. |
Not helpful-I know, but wanted to mention this is the #1 most impactful thing for us in terms of being able to get rid of submodules in Mach. We have a handful of packages ready-to-go but we just can't depend on them yet due to std.tar not being able to extract the archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this and diligently keeping it rebased on top of master.
This code is not up to standards. It is rife with code smells, it looks like it blindly ports Go logic rather than building a tar implementation on first principles, it abuses comptime logic and generics for no reason, which causes code bloat and reduces compilation speeds, and it regresses the API, adding an allocator requirement with no documented justification.
Furthermore, I see no discussion of the performance delta on the PR writeup.
I'm sorry, but this is not going to make it into 0.11.0.
As a path forward, I suggest to start with incremental improvements to std.tar, rather than a +2,000 line opinionated change that takes things in many different directions at once.
return inline for (std.meta.fields(FileType)) |f| { | ||
if (@intFromEnum(ft) == f.value) break f.name; | ||
} else null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an inline switch instead.
/// merges key-value pair `kv` into hdr if its a valid PAX field. | ||
/// TODO merge PAX schilly xattrs | ||
pub fn mergePax(kv: [2][]const u8, hdr: *Header) !void { | ||
const k = kv[0]; | ||
const v = kv[1]; | ||
log.debug("mergePax k={s} v={s}", .{ k, v }); | ||
if (v.len == 0) return; | ||
|
||
pub fn is_ustar(header: Header) bool { | ||
return std.mem.eql(u8, header.bytes[257..][0..6], "ustar\x00"); | ||
const map = std.ComptimeStringMap(std.meta.FieldEnum(Header), .{ | ||
.{ Pax.path, .name }, | ||
.{ Pax.linkpath, .linkname }, | ||
.{ Pax.uname, .uname }, | ||
.{ Pax.gname, .gname }, | ||
.{ Pax.uid, .uid }, | ||
.{ Pax.gid, .gid }, | ||
.{ Pax.atime, .atime }, | ||
.{ Pax.mtime, .mtime }, | ||
.{ Pax.ctime, .ctime }, | ||
.{ Pax.size, .size }, | ||
}); | ||
|
||
if (map.get(k)) |field_enum| switch (field_enum) { | ||
.name => hdr.name = v, | ||
.linkname => hdr.linkname = v, | ||
.uname => hdr.uname = v, | ||
.gname => hdr.gname = v, | ||
.uid => hdr.uid = @truncate(try fmt.parseInt(i64, v, 10)), | ||
.gid => hdr.gid = @truncate(try fmt.parseInt(i64, v, 10)), | ||
.atime => hdr.atime = try parsePaxTime(v), | ||
.ctime => hdr.ctime = try parsePaxTime(v), | ||
.mtime => hdr.mtime = try parsePaxTime(v), | ||
.size => hdr.size = try fmt.parseInt(i64, v, 10), | ||
else => unreachable, | ||
} else { | ||
// TODO merge PAX schilly xattrs | ||
// log.debug("TODO handle pax header key={s}", .{k}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this doing here? Looks incomplete and it has log statements.
}, | ||
|
||
else => { | ||
hdr.merge(pax_hdr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this, why are we merging stuff?
hdr = try self.header(v7) orelse return null; | ||
|
||
format.setIntersection(hdr.fmt); | ||
log.debug("hdr={}", .{hdr}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unscoped log statement. Generally I think std lib code should not have these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm not planning to leave these log statements in. I just didn't remove them yet anticipating more reviews and avoiding ci churn.
.normal; | ||
} | ||
|
||
// Set the final guess at the format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we guessing? don't guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand all the details the tar format. My understanding it that It is messy and attempts to support old extensions. Maybe someone else can help explain why this guessing is necessary.
Here are some references.
https://www.gnu.org/software/tar/manual/html_node/Standard.html
https://go.dev/src/archive/tar/common.go#L201
https://go.dev/src/archive/tar/format.go#L169
|
||
// Set the final guess at the format. | ||
if (format.contains(.ustar) and format.contains(.pax)) | ||
format.setIntersection(fmt_ustar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what in the hell is going on here
@@ -600,7 +600,7 @@ fn unpackTarball( | |||
var decompress = try compression.decompress(gpa, br.reader()); | |||
defer decompress.deinit(); | |||
|
|||
try std.tar.pipeToFileSystem(out_dir, decompress.reader(), .{ | |||
try std.tar.pipeToFileSystem(gpa, out_dir, decompress.reader(), .{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
booooooo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But if we want to support long file names, it is necessary to have an allocator. See @truemedian's comment #15382 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisstaloch it might be possible to use a buffer of size std.fs.MAX_NAME_BYTES
and read the path one-component-at-a-time (erroring if any component is > MAX_NAME_BYTES
long (when encoded as UTF-8) since it shouldn't be possible to create it on the filesystem anyway), using the Dir
of the last created component to create the next component and so on.
EDIT: Unless there's a reason to need to store the full path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_NAME_BYTES
may work, but it needs to be able to reasonably handle windows systems (where such value is 260). You can't use the directory of the last component because tar files aren't necessarily laid out as a tree, you always have to create the file from the extraction root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
260 is MAX_PATH
on Windows (and is irrelevant to Zig since we use extended-length path APIs exclusively). See std.os.windows.NAME_MAX
and std.fs.MAX_NAME_BYTES
.
I'm not sure what you mean by the second part. If each file in a tar has a filepath from the extraction root, then, for example:
some-long-component/some-long-component/<... repeated however many times ...>/foo.txt
I'm suggesting something like:
- Read into
[MAX_NAME_BYTES]u8
until either you hit a path separator or run out of room in the buffer (in which case, fatal error since the component won't be create-able on the filesystem) - Use a temporary buffer of one byte to check if this is the last component (read into it until you hit a non-path-separator or end-of-filepath)
- If this is not the last component, create/open it as a directory (using something like
Dir.makePath
with the extraction root'sDir
or the last created component'sDir
)- If this is the last component, create it as a file and move to the next path in the tar
- If there are more components to read, put the byte from the temporary buffer into the
[MAX_NAME_BYTES]u8
buffer and repeat from the start
Might need some refining but that's the idea (if there's not something I'm missing that makes it impossible).
test { | ||
_ = @import("tar/reader_test.zig"); | ||
_ = @import("tar/test_decompress.zig"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this file, this looks like a big kludge
You're right. It does blindly port Go logic. As much as I might want to help, I don't think I'm capable of creating a tar implementation from first principles. I'll step aside and let someone else work on this. Feel free continue with this PR or use anything from it in a different PR. |
Thank you and I apologize for the harsh words. I know a lot of people are counting on this improvement |
No worries. I support the desire for high quality code. And understand the desire for smaller PRs. I'll continue to do what I can to help with std.tar. I just don't feel confident in my ability to create a from scratch implementation. |
Understood. I've been working on the release notes and your name has come up quite a few times. This release is regardless going to be much better off because of you. |
tar.zig:
lib/std/compress/tar/testdata/
reader_test.zig:
test_decompress.zig: