-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
ZARR: reference stores online (within zip ) #11867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
solved by #11870 (again confirming that any non-Python implementation is a second class citizen as numcodecs is pretty much a requirement for Zarr support) For the rest, where is the spec of this sort of Zarr format ? |
I'll find out, I'm still exploring blindly here but getting more clarity. Super excited that this even works at all, and being able to do this remotely gives a huge amount of leverage. Thanks!! |
Happy to answer questions about the parquet format of kerchunk references. It is not a valid zarr format except when interpreted in the context of references (kerchunk, fsspec.implementations.reference and virtualizarr in python-land). I was not aware of anyone using it outside of my bubble! https://fsspec.github.io/kerchunk/spec.html#parquet-references |
depends on what you call "working". The GDAL Zarr driver sees the metadata from the .zmetadata file, but if you attempt to read array values, it will "obviously" fail. |
Ok well one step at a time I guess. Still confused whether this is format for Zarr or just for python but it's not going away. You gave some guidance ala "vsikerchunk" a while back, this is long haul for me and very much appreciate your efforts, aligning to the spec and these recent fixes 👌 |
It was certainly developed in python, but I think there is public demand for some of this new momentum around zarr in general. You may be following the advent of icechunk and virtualizarr. It is fair to say that some of the technology is still settling, but this particular reference storage format has been used successfully for a while now. It is appropriate for the case where references stored in JSON would be too large to handle easily. |
There is huge momentum! That's why I'm such a pest. I should have asked in the Zarr community meeting ... was too early for me. If anyone needs evidence see this video, NASA committed to holus bolus publication and unpacking all crazy old formats: (Ayush Nag, presented by James Gallagher at 1:36:12) https://www.youtube.com/watch?v=T6QAwJIwI3Q What I don't understand is how to get this into the spec so that there's a clear pathway to implementation. Will Icechunk be the standard?, I think so, icechunk doing the work of fsspec is a huge step IMO, but maybe that's an uncrossable line for a C++ project. |
I should clarify, that the "spec" for zarr is well settled. What we're talking about is an IO layer abstraction which provides this spec to the reader. Icechunk is another of these; I cannot tell if it will eventually take the place of "simple zarr" (data chunks and metadata files in a directory tree) or kerchunk or something else. But every platform needs some code for each storage mechanism, even the ZIP we started off with above, never mind parquet-references.
As the fsspec person, can I ask why? Is it the prospect of something fsspec-like in rust (which was thoroughly ignored 2 years ago) ? |
Wasn't casting shade on fsspec - but anything that's only available in Python and not in a core library is a liability imo, I'm sure fsspec is excellent but I don't rely on Python much - it feels to me like a huge abstraction that needs to exist as core library, for cross language use (I assume cross-language as a core need in communities, same as cross-platform). (And indeed the VSI facilities in GDAL seem analogous to what fsspec is) |
Yes, there is certainly a lot of commonality; but no one runs GDAL for remote file/bytes management unless they also have some raster format loading requirements. Anyway, not to spam the thread too much... |
I actually do use GDAL for remote file/bytes all the time, it's a great way to create a cog from an online netcdf and push it up to S3 from memory (a classic lambda op), and I don't know otherwise how to lazy load a zarr from an online zip, just as a couple of examples. This is now fixed as far as it can be. |
Support for Kerchunk JSON and Parquet reference stores added in #12099 |
The confusion here is between the the zarr spec for serializing bytes, and the so-called "native zarr" format which consists of a specific layout of data on disk. I also found this confusing, so I wrote out a clearer explanation here: zarr-developers/zarr-python#2956.
This is not really true. Maybe it could, but Icechunk doesn't use obstore, and definitely doesn't use fsspec.
Icechunk's rust client will get C bindings (see earth-mover/icechunk#381), at which point you could call it directly from GDAL.
My honest opinion is that this kerchunk zarr parquet format was an experiment that should now be superceded by Icechunk's model. Icechunk has all the same advantages of scalability, and is also able to reference data inside archival formats, but it is more performant, has massive new features such as transactions and version control, is still serverless, and can be made available to other languages through bindings. The backwards compatibility story is provided by VirtualiZarr's ability to cheaply convert existing kerchunk-formatted references into Icechunk. |
It doesn't?? This got lost in translation. Why does it need a completely different IO from zarr and/or why is this community developing obstore, then?
To be sure: kerchunk with parquet was a very successful experiment, particularly on dask. Settling on a long-term standard is a good thing for everyone, but I would not push this line too strongly, especially since many many use cases will be bandwidth limited with fsspec too.
has this been done for any super-large parquet reference collections? |
Icechunk uses the rust
We already showed that it is more performant. It approaches saturation of the network connection, and sustains an insane number of requests, so it literally cannot get much faster. But what's more important than speed is reliability, and using a common layer in rust is better from that perspective too.
I don't think anyone has tried to convert a very large set of existing parquet references, but the ability to read them into virtualizarr does already exist, and we have dealt with pretty large manifests in virtualizarr and icechunk already. |
What is storage details in Icechunk? A new format? We're bound to a Rust library? (It's amusing given the obsession with avoiding dependency on legacy libraries here but ho hum). Personally I like the Parquet storage, it should just be a table though with groups and array index and the references, so it's obvious and easy to use in other ways. It's weird to have a (hive ) table but then rely on a structural assumed ordering, but that mash of json and arrays is a really strong smell in python for sure. I honestly have no idea what Icechunk format is, but just seems opaque which seems a shame given the straightforwardnesss of trad and virtual Zarr. |
@mdsumner Icechunk is a new format, but it's a worthy tradeoff, because all of it's safe ACID database-like transactions, versioning, and git-like branches are impossible without it. It's not just another array file format, it's really a fundamentally new type of serverless OLAP database, made fully possible only recently by the addition of "put-if-not-exists" to AWS S3 and other object storage provider APIs. None of these other formats, not even Parquet (which is analogous to traditional Zarr, not to Icechunk) can possibly provide those benefits. The only way they could do so would be to have a specification for a system of which parquet files and additional commit-tracking files should be updated in what circumstances, and at that point you would just be defining a new format. I feel like the community still doesn't understand just how big of a step forward Icechunk is, and how much cool stuff will become possible now.
I wondered the same thing, and yes that is ironic. But at least it's easy to bind to Rust from other languages (often via C). Icechunk still has an open spec anyway, so you could write an icechunk client in another language if you really wanted to.
The on-disk representation is a bit more opaque, and that is a trade-off, but we are getting enormous features in return. It does still keeps the straightforwardness of the key-value store interface, pluggable codecs, and zarr extensions regardless. |
With respect, we are talking about very large number of references and parquet storage, and I am not aware of a comparison or such a scenario. Nevertheless, on the performance chart you link to: you are seeing the effect of threading, whereas python is bound by the GIL for IO (compare the CPU times reported for |
Feature description
A feature request, or question - can we access this ZARR reference store (in Parquet) without downloading?
I thought this might work (it doesn't appear to need auth for the zip itself):
downloading it then works with /vsizip
but, with caveat, so vars are missing (maybe an install/config problem on my side)
These experimental parquet/Zarr stores are listed here: https://podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1
Additional context
No response
The text was updated successfully, but these errors were encountered: