std.tar: support pax headers and gnu_long{name,link} #15382

travisstaloch · 2023-04-21T05:33:54Z

tar.zig:

add HeaderIterator() type and convert pipeToFileSystem() to use it.
add tests parseNumeric and parsePaxTime ported from https://go.dev/src/archive/tar/strconv_test.go.
add initial support for options.executable_bit_only.
initial windows support:
- skip symlinks which require admin rights
- workaround file.updateTimes() panic by truncating file times

lib/std/compress/tar/testdata/

copy a subset of tar files from https://go.dev/src/archive/tar/testdata

reader_test.zig:

validate headers against files from testdata/. a port of https://go.dev/src/archive/tar/reader_test.go.

test_decompress.zig:

runs tar.pipeToFileSystem() on valid testdata/ files.

travisstaloch · 2023-04-21T06:39:31Z

addresses, maybe closes #14310
closes #15342 #15222
supersedes #15228

lib/std/tar.zig

lib/std/compress/tar/test_decompress.zig

travisstaloch · 2023-04-22T00:15:47Z

I have updated the testdata/ files. Previously they were .tar files and the testdata/ directory was around 136K. Now they are gzipped w/ -9 and total 6.9K (around 1/20 the size).

Also, tests now use @embedFile on fixed lists of .gz files rather than iterating the testdata/ directory and opening/reading files.

lib/std/tar.zig

truemedian

Everything else here looks good.

lib/std/tar.zig

travisstaloch · 2023-06-23T17:09:33Z

I don't think the aarch64-macos-debug ci failure is related to this PR. Seems to be an OOM.

From the log:

zig test ReleaseSafe native: error: memory usage peaked at 4572626944 bytes, exceeding the declared upper bound of 4080218931

jacobly0 · 2023-06-23T17:20:40Z

Yeah you're fine, fix is on master if you want to rebase. When that's the only error, all tests still passed.

@squeek502

this patch converts all the crashes submitted by @squeek502 (in ziglang#15382 (comment)) to errors. HeaderIterator: * add bounds checks to PaxIterator.next() * convert several unsafe int casts to safe ones misc: * added FileType.tagName() for debugging which returns null for unnamed enum values * make unixTime() fallible

emidoots · 2023-07-03T22:48:36Z

@travisstaloch looks like you have some more merge conflicts - just thought I'd mention as I am very interested in this PR.

-- 4/20/23 -- tar.zig: * add HeaderIterator() type and convert pipeToFileSystem() to use it. * add initial support for options.executable_bit_only. * initial windows support: * skip symlinks which require admin rights * workaround file.updateTimes() panic by truncating file times * add tests parseNumeric and parsePaxTime ported from https://go.dev/src/archive/tar/strconv_test.go. lib/std/compress/tar/testdata/ * copy a subset of tar files from https://go.dev/src/archive/tar/testdata * gzip them all with -9. results in around 95% file size reduction reader_test.zig: * validate headers against files from testdata/. a port of https://go.dev/src/archive/tar/reader_test.go. test_decompress.zig: * runs tar.pipeToFileSystem() on valid testdata/ files. ------------- std.tar: cleanups and fixes -- 4/22/23 -- tar.zig: * convert V7Header, UstarHeader, StarHeader, GnuHeader to have fields of byte arrays instead of accessor methods. * Header.getFormat() - optimize by using 64 bit compares rather than mem.eql(). * gnu_long{name,link} and pax headers - allow them to be longer than 512 bytes by adding an allocator param to pipeToFileSystem(). * symlinks: cleanup logic and skip if wasi * remove custom toPosixPath() wasn't necessary. * workaround header.name corruption issue after makeOpenPath() call and add notes about it. * skip setDirProperties() entrely. leave as a TODO. src/Package.zig: pass gpa to pipeToFileSystem() tests: minor cosmetic changes ------------- std.tar: fix memory errors -- 4/25/23 -- tar.zig: * previously HeaderIterator.header()'s v7 param was by value, causing it to return invalid pointers to locals for various string fields. now the param along with a few others have been made *const. this solves the perceived 'stack corruption' i thought i was seeing. because v7 is 512 bytes, it wasn't clobbered until a fn w/ a larger stack frame, dir.makeOpenPath -> dir.openDir -> os.toPosixPath, was called. * symlinks - check for and set flags.is_directory param ------------- std.tar: hardlinks, more cleanup -- 4/25/23 -- tar.zig: * properly handle hardlinks by copying files. * improve symlink is_directory check - replace dir.access() call with dir.openFile() + handle error.FileNotFound. * isValidPax(): optimize: replace mem.eql()s with ComptimeStringMap. tests: * add testdata/hardlink.tar.gz ------------- std.tar: more cleanup -- 4/25/23 -- tests: add testdata/dir-symlink.tar -------------

@squeek502

this patch converts all the crashes submitted by @squeek502 (in ziglang#15382 (comment)) to errors. HeaderIterator: * add bounds checks to PaxIterator.next() * convert several unsafe int casts to safe ones misc: * added FileType.tagName() for debugging which returns null for unnamed enum values * make unixTime() fallible

prevent crashes and return errors when: * any 'named type' header's file path contains a `NUL` character. * prevent std.bit_set assertion failure when `header.type` is outside the bounds of FileType.named_types_bitset.

@intcast

* return error when header.size is negative via math.cast * readBlocks(): change 'size' param from u64 to usize to avoid unnecessary @intcast * std.math -> math

* builtins * mem.alignForward * fs.file.File.Kind

travisstaloch · 2023-07-04T01:17:32Z

@travisstaloch looks like you have some more merge conflicts - just thought I'd mention as I am very interested in this PR.

thanks for the heads up. just rebased and pushed.

andrewrk · 2023-07-13T17:23:04Z

Thanks for keeping this up to date. I'm sorry for taking so long with the review but I want you to know that it's on my mind, definitely looking to get it landed before the release.

travisstaloch · 2023-07-13T23:51:45Z

Thanks for the update. No worries, I know there is a lot to do.

emidoots · 2023-07-22T15:02:38Z

Not helpful-I know, but wanted to mention this is the #1 most impactful thing for us in terms of being able to get rid of submodules in Mach. We have a handful of packages ready-to-go but we just can't depend on them yet due to std.tar not being able to extract the archive.

andrewrk

Thank you for working on this and diligently keeping it rebased on top of master.

This code is not up to standards. It is rife with code smells, it looks like it blindly ports Go logic rather than building a tar implementation on first principles, it abuses comptime logic and generics for no reason, which causes code bloat and reduces compilation speeds, and it regresses the API, adding an allocator requirement with no documented justification.

Furthermore, I see no discussion of the performance delta on the PR writeup.

I'm sorry, but this is not going to make it into 0.11.0.

As a path forward, I suggest to start with incremental improvements to std.tar, rather than a +2,000 line opinionated change that takes things in many different directions at once.

andrewrk · 2023-07-29T16:58:00Z

lib/std/tar.zig

+        return inline for (std.meta.fields(FileType)) |f| {
+            if (@intFromEnum(ft) == f.value) break f.name;
+        } else null;


Should be an inline switch instead.

andrewrk · 2023-07-29T17:00:06Z

lib/std/tar.zig

+/// merges key-value pair `kv` into hdr if its a valid PAX field.
+/// TODO merge PAX schilly xattrs
+pub fn mergePax(kv: [2][]const u8, hdr: *Header) !void {
+    const k = kv[0];
+    const v = kv[1];
+    log.debug("mergePax k={s} v={s}", .{ k, v });
+    if (v.len == 0) return;

-    pub fn is_ustar(header: Header) bool {
-        return std.mem.eql(u8, header.bytes[257..][0..6], "ustar\x00");
+    const map = std.ComptimeStringMap(std.meta.FieldEnum(Header), .{
+        .{ Pax.path, .name },
+        .{ Pax.linkpath, .linkname },
+        .{ Pax.uname, .uname },
+        .{ Pax.gname, .gname },
+        .{ Pax.uid, .uid },
+        .{ Pax.gid, .gid },
+        .{ Pax.atime, .atime },
+        .{ Pax.mtime, .mtime },
+        .{ Pax.ctime, .ctime },
+        .{ Pax.size, .size },
+    });
+
+    if (map.get(k)) |field_enum| switch (field_enum) {
+        .name => hdr.name = v,
+        .linkname => hdr.linkname = v,
+        .uname => hdr.uname = v,
+        .gname => hdr.gname = v,
+        .uid => hdr.uid = @truncate(try fmt.parseInt(i64, v, 10)),
+        .gid => hdr.gid = @truncate(try fmt.parseInt(i64, v, 10)),
+        .atime => hdr.atime = try parsePaxTime(v),
+        .ctime => hdr.ctime = try parsePaxTime(v),
+        .mtime => hdr.mtime = try parsePaxTime(v),
+        .size => hdr.size = try fmt.parseInt(i64, v, 10),
+        else => unreachable,
+    } else {
+        // TODO merge PAX schilly xattrs
+        // log.debug("TODO handle pax header key={s}", .{k});
    }
+}


What's this doing here? Looks incomplete and it has log statements.

andrewrk · 2023-07-29T17:05:17Z

lib/std/tar.zig

+                    },
+
+                    else => {
+                        hdr.merge(pax_hdr);


what is this, why are we merging stuff?

andrewrk · 2023-07-29T17:05:48Z

lib/std/tar.zig

+                hdr = try self.header(v7) orelse return null;
+
+                format.setIntersection(hdr.fmt);
+                log.debug("hdr={}", .{hdr});


Unscoped log statement. Generally I think std lib code should not have these

Thanks. I'm not planning to leave these log statements in. I just didn't remove them yet anticipating more reviews and avoiding ci churn.

andrewrk · 2023-07-29T17:06:00Z

lib/std/tar.zig

+                                .normal;
+                        }
+
+                        // Set the final guess at the format.


why are we guessing? don't guess

I don't understand all the details the tar format. My understanding it that It is messy and attempts to support old extensions. Maybe someone else can help explain why this guessing is necessary.

Here are some references.

https://www.gnu.org/software/tar/manual/html_node/Standard.html
https://go.dev/src/archive/tar/common.go#L201
https://go.dev/src/archive/tar/format.go#L169

andrewrk · 2023-07-29T17:06:20Z

lib/std/tar.zig

+
+                        // Set the final guess at the format.
+                        if (format.contains(.ustar) and format.contains(.pax))
+                            format.setIntersection(fmt_ustar);


what in the hell is going on here

andrewrk · 2023-07-29T17:08:18Z

src/Package.zig

@@ -600,7 +600,7 @@ fn unpackTarball(
    var decompress = try compression.decompress(gpa, br.reader());
    defer decompress.deinit();

-    try std.tar.pipeToFileSystem(out_dir, decompress.reader(), .{
+    try std.tar.pipeToFileSystem(gpa, out_dir, decompress.reader(), .{


I agree. But if we want to support long file names, it is necessary to have an allocator. See @truemedian's comment #15382 (comment)

@travisstaloch it might be possible to use a buffer of size std.fs.MAX_NAME_BYTES and read the path one-component-at-a-time (erroring if any component is > MAX_NAME_BYTES long (when encoded as UTF-8) since it shouldn't be possible to create it on the filesystem anyway), using the Dir of the last created component to create the next component and so on.

EDIT: Unless there's a reason to need to store the full path.

MAX_NAME_BYTES may work, but it needs to be able to reasonably handle windows systems (where such value is 260). You can't use the directory of the last component because tar files aren't necessarily laid out as a tree, you always have to create the file from the extraction root.

260 is MAX_PATH on Windows (and is irrelevant to Zig since we use extended-length path APIs exclusively). See std.os.windows.NAME_MAX and std.fs.MAX_NAME_BYTES.

I'm not sure what you mean by the second part. If each file in a tar has a filepath from the extraction root, then, for example:

some-long-component/some-long-component/<... repeated however many times ...>/foo.txt

I'm suggesting something like:

Read into [MAX_NAME_BYTES]u8 until either you hit a path separator or run out of room in the buffer (in which case, fatal error since the component won't be create-able on the filesystem)

Use a temporary buffer of one byte to check if this is the last component (read into it until you hit a non-path-separator or end-of-filepath)

If this is not the last component, create/open it as a directory (using something like Dir.makePath with the extraction root's Dir or the last created component's Dir)

If this is the last component, create it as a file and move to the next path in the tar

If there are more components to read, put the byte from the temporary buffer into the [MAX_NAME_BYTES]u8 buffer and repeat from the start

Might need some refining but that's the idea (if there's not something I'm missing that makes it impossible).

andrewrk · 2023-07-29T17:13:30Z

lib/std/compress/tar.zig

+test {
+    _ = @import("tar/reader_test.zig");
+    _ = @import("tar/test_decompress.zig");
+}


I don't like this file, this looks like a big kludge

travisstaloch · 2023-07-29T19:45:09Z

it looks like it blindly ports Go logic rather than building a tar implementation on first principles

You're right. It does blindly port Go logic. As much as I might want to help, I don't think I'm capable of creating a tar implementation from first principles. I'll step aside and let someone else work on this. Feel free continue with this PR or use anything from it in a different PR.

andrewrk · 2023-07-29T20:29:50Z

Thank you and I apologize for the harsh words. I know a lot of people are counting on this improvement

travisstaloch · 2023-07-29T21:11:03Z

No worries. I support the desire for high quality code. And understand the desire for smaller PRs. I'll continue to do what I can to help with std.tar. I just don't feel confident in my ability to create a from scratch implementation.

andrewrk · 2023-07-29T21:48:17Z

Understood. I've been working on the release notes and your name has come up quite a few times. This release is regardless going to be much better off because of you.

travisstaloch changed the title ~~std.tar: support more features and add tests~~ std.tar: support pax headers and gnulong_{name,link} Apr 21, 2023

travisstaloch changed the title ~~std.tar: support pax headers and gnulong_{name,link}~~ std.tar: support pax headers and gnu_long{name,link} Apr 21, 2023

travisstaloch force-pushed the tar-addtests branch from 70464d7 to f0da800 Compare April 21, 2023 05:40

travisstaloch force-pushed the tar-addtests branch 3 times, most recently from 62eec46 to de62a00 Compare April 21, 2023 08:32

rohlem reviewed Apr 21, 2023

View reviewed changes

lib/std/tar.zig Outdated Show resolved Hide resolved

travisstaloch force-pushed the tar-addtests branch 5 times, most recently from 003656c to 03da105 Compare April 21, 2023 14:15

travisstaloch commented Apr 21, 2023

View reviewed changes

lib/std/compress/tar/test_decompress.zig Outdated Show resolved Hide resolved

travisstaloch force-pushed the tar-addtests branch 2 times, most recently from 89081a2 to 47dda50 Compare April 22, 2023 00:09

travisstaloch force-pushed the tar-addtests branch from 47dda50 to cf081fa Compare April 22, 2023 02:14

truemedian suggested changes Apr 22, 2023

View reviewed changes

lib/std/tar.zig Outdated Show resolved Hide resolved

lib/std/tar.zig Outdated Show resolved Hide resolved

lib/std/tar.zig Outdated Show resolved Hide resolved

travisstaloch force-pushed the tar-addtests branch from 6076ef9 to a377e63 Compare April 22, 2023 23:37

truemedian suggested changes Apr 22, 2023

View reviewed changes

lib/std/tar.zig Outdated Show resolved Hide resolved

travisstaloch force-pushed the tar-addtests branch 3 times, most recently from ad95885 to eb2ea02 Compare April 23, 2023 00:10

truemedian reviewed Apr 23, 2023

View reviewed changes

lib/std/tar.zig Outdated Show resolved Hide resolved

travisstaloch force-pushed the tar-addtests branch 5 times, most recently from 5b68673 to 955ba4a Compare April 23, 2023 06:27

travisstaloch force-pushed the tar-addtests branch 3 times, most recently from 8aa9cfe to 7767b61 Compare June 22, 2023 12:23

alichraghi mentioned this pull request Jun 22, 2023

Dusk impl Beyley/mach-dusk#2

Closed

1 task

travisstaloch force-pushed the tar-addtests branch from 7767b61 to 348e327 Compare June 23, 2023 17:26

squeek502 mentioned this pull request Jun 30, 2023

build.zig.zon file .hash does not include executable bit in practice #16272

Closed

andrewrk self-requested a review July 3, 2023 22:56

travisstaloch added 6 commits July 3, 2023 18:01

std.tar: fix fuzzing crashes part 2

52ef4ce

prevent crashes and return errors when: * any 'named type' header's file path contains a `NUL` character. * prevent std.bit_set assertion failure when `header.type` is outside the bounds of FileType.named_types_bitset.

std.tar: fix fuzzing crashes part 3

33d1f33

* return error when header.size is negative via math.cast * readBlocks(): change 'size' param from u64 to usize to avoid unnecessary @intcast * std.math -> math

std.tar: chore - update to latest zig

dd21d6a

* builtins * mem.alignForward * fs.file.File.Kind

std.tar: chore - update to zig 0.11.0-dev.3910+689f3163a

0289688

travisstaloch force-pushed the tar-addtests branch from 348e327 to 0289688 Compare July 4, 2023 01:13

andrewrk requested changes Jul 29, 2023

View reviewed changes

travisstaloch closed this Jul 29, 2023

squeek502 mentioned this pull request Aug 27, 2023

std.tar: add support for file path in pax attributes #16990

Merged

travisstaloch mentioned this pull request Oct 9, 2023

make packages fetched by zig set the executable bit based on ELF header and shebang line #17463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.tar: support pax headers and gnu_long{name,link} #15382

std.tar: support pax headers and gnu_long{name,link} #15382

travisstaloch commented Apr 21, 2023 •

edited

Loading

travisstaloch commented Apr 21, 2023 •

edited

Loading

travisstaloch commented Apr 22, 2023 •

edited

Loading

truemedian left a comment

travisstaloch commented Jun 23, 2023

jacobly0 commented Jun 23, 2023 •

edited

Loading

emidoots commented Jul 3, 2023

travisstaloch commented Jul 4, 2023

andrewrk commented Jul 13, 2023

travisstaloch commented Jul 13, 2023

emidoots commented Jul 22, 2023 •

edited

Loading

andrewrk left a comment

andrewrk Jul 29, 2023

andrewrk Jul 29, 2023

andrewrk Jul 29, 2023

andrewrk Jul 29, 2023

travisstaloch Jul 29, 2023

andrewrk Jul 29, 2023

travisstaloch Jul 29, 2023

andrewrk Jul 29, 2023

andrewrk Jul 29, 2023

travisstaloch Jul 29, 2023

squeek502 Jul 29, 2023 •

edited

Loading

truemedian Jul 29, 2023

squeek502 Jul 29, 2023 •

edited

Loading

andrewrk Jul 29, 2023

travisstaloch commented Jul 29, 2023

andrewrk commented Jul 29, 2023

travisstaloch commented Jul 29, 2023

andrewrk commented Jul 29, 2023

std.tar: support pax headers and gnu_long{name,link} #15382

std.tar: support pax headers and gnu_long{name,link} #15382

Conversation

travisstaloch commented Apr 21, 2023 • edited Loading

travisstaloch commented Apr 21, 2023 • edited Loading

travisstaloch commented Apr 22, 2023 • edited Loading

truemedian left a comment

Choose a reason for hiding this comment

travisstaloch commented Jun 23, 2023

jacobly0 commented Jun 23, 2023 • edited Loading

emidoots commented Jul 3, 2023

travisstaloch commented Jul 4, 2023

andrewrk commented Jul 13, 2023

travisstaloch commented Jul 13, 2023

emidoots commented Jul 22, 2023 • edited Loading

andrewrk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squeek502 Jul 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squeek502 Jul 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisstaloch commented Jul 29, 2023

andrewrk commented Jul 29, 2023

travisstaloch commented Jul 29, 2023

andrewrk commented Jul 29, 2023

travisstaloch commented Apr 21, 2023 •

edited

Loading

travisstaloch commented Apr 21, 2023 •

edited

Loading

travisstaloch commented Apr 22, 2023 •

edited

Loading

jacobly0 commented Jun 23, 2023 •

edited

Loading

emidoots commented Jul 22, 2023 •

edited

Loading

squeek502 Jul 29, 2023 •

edited

Loading

squeek502 Jul 29, 2023 •

edited

Loading