Skip to content

std.tar: support pax headers and gnu_long{name,link} #15382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

travisstaloch
Copy link
Contributor

@travisstaloch travisstaloch commented Apr 21, 2023

tar.zig:

  • add HeaderIterator() type and convert pipeToFileSystem() to use it.
  • add tests parseNumeric and parsePaxTime ported from https://go.dev/src/archive/tar/strconv_test.go.
  • add initial support for options.executable_bit_only.
  • initial windows support:
    • skip symlinks which require admin rights
    • workaround file.updateTimes() panic by truncating file times

lib/std/compress/tar/testdata/

reader_test.zig:

test_decompress.zig:

  • runs tar.pipeToFileSystem() on valid testdata/ files.

@travisstaloch travisstaloch changed the title std.tar: support more features and add tests std.tar: support pax headers and gnulong_{name,link} Apr 21, 2023
@travisstaloch travisstaloch changed the title std.tar: support pax headers and gnulong_{name,link} std.tar: support pax headers and gnu_long{name,link} Apr 21, 2023
@travisstaloch
Copy link
Contributor Author

travisstaloch commented Apr 21, 2023

addresses, maybe closes #14310
closes #15342 #15222
supersedes #15228

@travisstaloch travisstaloch force-pushed the tar-addtests branch 3 times, most recently from 62eec46 to de62a00 Compare April 21, 2023 08:32
@travisstaloch travisstaloch force-pushed the tar-addtests branch 5 times, most recently from 003656c to 03da105 Compare April 21, 2023 14:15
@travisstaloch travisstaloch force-pushed the tar-addtests branch 2 times, most recently from 89081a2 to 47dda50 Compare April 22, 2023 00:09
@travisstaloch
Copy link
Contributor Author

travisstaloch commented Apr 22, 2023

I have updated the testdata/ files. Previously they were .tar files and the testdata/ directory was around 136K. Now they are gzipped w/ -9 and total 6.9K (around 1/20 the size).

Also, tests now use @embedFile on fixed lists of .gz files rather than iterating the testdata/ directory and opening/reading files.

Copy link
Contributor

@truemedian truemedian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else here looks good.

@travisstaloch travisstaloch force-pushed the tar-addtests branch 3 times, most recently from ad95885 to eb2ea02 Compare April 23, 2023 00:10
@travisstaloch travisstaloch force-pushed the tar-addtests branch 5 times, most recently from 5b68673 to 955ba4a Compare April 23, 2023 06:27
@travisstaloch travisstaloch force-pushed the tar-addtests branch 3 times, most recently from 8aa9cfe to 7767b61 Compare June 22, 2023 12:23
@alichraghi alichraghi mentioned this pull request Jun 22, 2023
1 task
@travisstaloch
Copy link
Contributor Author

I don't think the aarch64-macos-debug ci failure is related to this PR. Seems to be an OOM.

From the log:

zig test ReleaseSafe native: error: memory usage peaked at 4572626944 bytes, exceeding the declared upper bound of 4080218931

@jacobly0
Copy link
Member

jacobly0 commented Jun 23, 2023

Yeah you're fine, fix is on master if you want to rebase. When that's the only error, all tests still passed.

travisstaloch added a commit to travisstaloch/zig that referenced this pull request Jun 23, 2023
this patch converts all the crashes submitted by @squeek502
(in ziglang#15382 (comment))
to errors.

HeaderIterator:
* add bounds checks to PaxIterator.next()
* convert several unsafe int casts to safe ones

misc:
* added FileType.tagName() for debugging which returns null for unnamed
enum values
* make unixTime() fallible
@emidoots
Copy link

emidoots commented Jul 3, 2023

@travisstaloch looks like you have some more merge conflicts - just thought I'd mention as I am very interested in this PR.

@andrewrk andrewrk self-requested a review July 3, 2023 22:56
-- 4/20/23 --
tar.zig:
* add HeaderIterator() type and convert pipeToFileSystem() to use it.
* add initial support for options.executable_bit_only.
* initial windows support:
  * skip symlinks which require admin rights
  * workaround file.updateTimes() panic by truncating file times
* add tests parseNumeric and parsePaxTime ported from
  https://go.dev/src/archive/tar/strconv_test.go.

lib/std/compress/tar/testdata/
* copy a subset of tar files from
  https://go.dev/src/archive/tar/testdata
* gzip them all with -9. results in around 95% file size reduction

reader_test.zig:
* validate headers against files from testdata/. a port of
  https://go.dev/src/archive/tar/reader_test.go.

test_decompress.zig:
* runs tar.pipeToFileSystem() on valid testdata/ files.
-------------

std.tar: cleanups and fixes

-- 4/22/23 --
tar.zig:
* convert V7Header, UstarHeader, StarHeader, GnuHeader to have fields of
  byte arrays instead of accessor methods.
* Header.getFormat() - optimize by using 64 bit compares rather than
  mem.eql().
* gnu_long{name,link} and pax headers - allow them to be longer than
  512 bytes by adding an allocator param to pipeToFileSystem().
* symlinks: cleanup logic and skip if wasi
  * remove custom toPosixPath() wasn't necessary.
  * workaround header.name corruption issue after makeOpenPath() call
    and add notes about it.
* skip setDirProperties() entrely. leave as a TODO.

src/Package.zig: pass gpa to pipeToFileSystem()

tests: minor cosmetic changes
-------------

std.tar: fix memory errors

-- 4/25/23 --
tar.zig:
* previously HeaderIterator.header()'s v7 param was by value, causing it
  to return invalid pointers to locals for various string fields.  now
  the param along with a few others have been made *const.  this solves
  the perceived 'stack corruption' i thought i was seeing.  because v7
  is 512 bytes, it wasn't clobbered until a fn w/ a larger stack frame,
  dir.makeOpenPath -> dir.openDir -> os.toPosixPath, was called.
* symlinks - check for and set flags.is_directory param
-------------

std.tar: hardlinks, more cleanup

-- 4/25/23 --
tar.zig:
* properly handle hardlinks by copying files.
* improve symlink is_directory check - replace dir.access() call with
  dir.openFile() + handle error.FileNotFound.
* isValidPax(): optimize: replace mem.eql()s with ComptimeStringMap.

tests:
* add testdata/hardlink.tar.gz
-------------

std.tar: more cleanup

-- 4/25/23 --
tests: add testdata/dir-symlink.tar
-------------
this patch converts all the crashes submitted by @squeek502
(in ziglang#15382 (comment))
to errors.

HeaderIterator:
* add bounds checks to PaxIterator.next()
* convert several unsafe int casts to safe ones

misc:
* added FileType.tagName() for debugging which returns null for unnamed
enum values
* make unixTime() fallible
prevent crashes and return errors when:
* any 'named type' header's file path contains a `NUL` character.
* prevent std.bit_set assertion failure when `header.type` is outside
  the bounds of FileType.named_types_bitset.
* return error when header.size is negative via math.cast
* readBlocks(): change 'size' param from u64 to usize to avoid
  unnecessary @intcast
* std.math -> math
* builtins
* mem.alignForward
* fs.file.File.Kind
@travisstaloch
Copy link
Contributor Author

@travisstaloch looks like you have some more merge conflicts - just thought I'd mention as I am very interested in this PR.

thanks for the heads up. just rebased and pushed.

@andrewrk
Copy link
Member

Thanks for keeping this up to date. I'm sorry for taking so long with the review but I want you to know that it's on my mind, definitely looking to get it landed before the release.

@travisstaloch
Copy link
Contributor Author

Thanks for the update. No worries, I know there is a lot to do.

@emidoots
Copy link

emidoots commented Jul 22, 2023

Not helpful-I know, but wanted to mention this is the #1 most impactful thing for us in terms of being able to get rid of submodules in Mach. We have a handful of packages ready-to-go but we just can't depend on them yet due to std.tar not being able to extract the archive.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this and diligently keeping it rebased on top of master.

This code is not up to standards. It is rife with code smells, it looks like it blindly ports Go logic rather than building a tar implementation on first principles, it abuses comptime logic and generics for no reason, which causes code bloat and reduces compilation speeds, and it regresses the API, adding an allocator requirement with no documented justification.

Furthermore, I see no discussion of the performance delta on the PR writeup.

I'm sorry, but this is not going to make it into 0.11.0.

As a path forward, I suggest to start with incremental improvements to std.tar, rather than a +2,000 line opinionated change that takes things in many different directions at once.

Comment on lines +51 to +53
return inline for (std.meta.fields(FileType)) |f| {
if (@intFromEnum(ft) == f.value) break f.name;
} else null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an inline switch instead.

Comment on lines +280 to +317
/// merges key-value pair `kv` into hdr if its a valid PAX field.
/// TODO merge PAX schilly xattrs
pub fn mergePax(kv: [2][]const u8, hdr: *Header) !void {
const k = kv[0];
const v = kv[1];
log.debug("mergePax k={s} v={s}", .{ k, v });
if (v.len == 0) return;

pub fn is_ustar(header: Header) bool {
return std.mem.eql(u8, header.bytes[257..][0..6], "ustar\x00");
const map = std.ComptimeStringMap(std.meta.FieldEnum(Header), .{
.{ Pax.path, .name },
.{ Pax.linkpath, .linkname },
.{ Pax.uname, .uname },
.{ Pax.gname, .gname },
.{ Pax.uid, .uid },
.{ Pax.gid, .gid },
.{ Pax.atime, .atime },
.{ Pax.mtime, .mtime },
.{ Pax.ctime, .ctime },
.{ Pax.size, .size },
});

if (map.get(k)) |field_enum| switch (field_enum) {
.name => hdr.name = v,
.linkname => hdr.linkname = v,
.uname => hdr.uname = v,
.gname => hdr.gname = v,
.uid => hdr.uid = @truncate(try fmt.parseInt(i64, v, 10)),
.gid => hdr.gid = @truncate(try fmt.parseInt(i64, v, 10)),
.atime => hdr.atime = try parsePaxTime(v),
.ctime => hdr.ctime = try parsePaxTime(v),
.mtime => hdr.mtime = try parsePaxTime(v),
.size => hdr.size = try fmt.parseInt(i64, v, 10),
else => unreachable,
} else {
// TODO merge PAX schilly xattrs
// log.debug("TODO handle pax header key={s}", .{k});
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this doing here? Looks incomplete and it has log statements.

},

else => {
hdr.merge(pax_hdr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this, why are we merging stuff?

hdr = try self.header(v7) orelse return null;

format.setIntersection(hdr.fmt);
log.debug("hdr={}", .{hdr});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unscoped log statement. Generally I think std lib code should not have these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm not planning to leave these log statements in. I just didn't remove them yet anticipating more reviews and avoiding ci churn.

.normal;
}

// Set the final guess at the format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we guessing? don't guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand all the details the tar format. My understanding it that It is messy and attempts to support old extensions. Maybe someone else can help explain why this guessing is necessary.

Here are some references.

https://www.gnu.org/software/tar/manual/html_node/Standard.html
https://go.dev/src/archive/tar/common.go#L201
https://go.dev/src/archive/tar/format.go#L169


// Set the final guess at the format.
if (format.contains(.ustar) and format.contains(.pax))
format.setIntersection(fmt_ustar);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what in the hell is going on here

@@ -600,7 +600,7 @@ fn unpackTarball(
var decompress = try compression.decompress(gpa, br.reader());
defer decompress.deinit();

try std.tar.pipeToFileSystem(out_dir, decompress.reader(), .{
try std.tar.pipeToFileSystem(gpa, out_dir, decompress.reader(), .{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

booooooo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But if we want to support long file names, it is necessary to have an allocator. See @truemedian's comment #15382 (comment)

Copy link
Collaborator

@squeek502 squeek502 Jul 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisstaloch it might be possible to use a buffer of size std.fs.MAX_NAME_BYTES and read the path one-component-at-a-time (erroring if any component is > MAX_NAME_BYTES long (when encoded as UTF-8) since it shouldn't be possible to create it on the filesystem anyway), using the Dir of the last created component to create the next component and so on.

EDIT: Unless there's a reason to need to store the full path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_NAME_BYTES may work, but it needs to be able to reasonably handle windows systems (where such value is 260). You can't use the directory of the last component because tar files aren't necessarily laid out as a tree, you always have to create the file from the extraction root.

Copy link
Collaborator

@squeek502 squeek502 Jul 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

260 is MAX_PATH on Windows (and is irrelevant to Zig since we use extended-length path APIs exclusively). See std.os.windows.NAME_MAX and std.fs.MAX_NAME_BYTES.

I'm not sure what you mean by the second part. If each file in a tar has a filepath from the extraction root, then, for example:

some-long-component/some-long-component/<... repeated however many times ...>/foo.txt

I'm suggesting something like:

  • Read into [MAX_NAME_BYTES]u8 until either you hit a path separator or run out of room in the buffer (in which case, fatal error since the component won't be create-able on the filesystem)
  • Use a temporary buffer of one byte to check if this is the last component (read into it until you hit a non-path-separator or end-of-filepath)
  • If this is not the last component, create/open it as a directory (using something like Dir.makePath with the extraction root's Dir or the last created component's Dir)
    • If this is the last component, create it as a file and move to the next path in the tar
  • If there are more components to read, put the byte from the temporary buffer into the [MAX_NAME_BYTES]u8 buffer and repeat from the start

Might need some refining but that's the idea (if there's not something I'm missing that makes it impossible).

Comment on lines +1 to +4
test {
_ = @import("tar/reader_test.zig");
_ = @import("tar/test_decompress.zig");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this file, this looks like a big kludge

@travisstaloch
Copy link
Contributor Author

it looks like it blindly ports Go logic rather than building a tar implementation on first principles

You're right. It does blindly port Go logic. As much as I might want to help, I don't think I'm capable of creating a tar implementation from first principles. I'll step aside and let someone else work on this. Feel free continue with this PR or use anything from it in a different PR.

@andrewrk
Copy link
Member

Thank you and I apologize for the harsh words. I know a lot of people are counting on this improvement

@travisstaloch
Copy link
Contributor Author

No worries. I support the desire for high quality code. And understand the desire for smaller PRs. I'll continue to do what I can to help with std.tar. I just don't feel confident in my ability to create a from scratch implementation.

@andrewrk
Copy link
Member

Understood. I've been working on the release notes and your name has come up quite a few times. This release is regardless going to be much better off because of you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants