ZARR: reference stores online (within zip ) #11867

mdsumner · 2025-02-19T04:35:20Z

Feature description

A feature request, or question - can we access this ZARR reference store (in Parquet) without downloading?

https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-docs/ghrsst/open/docs/MUR-JPL-L4-GLOB-v4.1_combined-ref.parq.zip/

I thought this might work (it doesn't appear to need auth for the zip itself):

"ZARR:/vsizip//vsicurl/https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-docs/ghrsst/open/docs/MUR-JPL-L4-GLOB-v4.1_combined-ref.parq.zip/MUR-JPL-L4-GLOB-v4.1_combined-ref.parq")

downloading it then works with /vsizip

wget https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-docs/ghrsst/open/docs/MUR-JPL-L4-GLOB-v4.1_combined-ref.parq.zip

gdalmdiminfo "ZARR:/vsizip/MUR-JPL-L4-GLOB-v4.1_combined-ref.parq.zip/MUR-JPL-L4-GLOB-v4.1_combined-ref.parq"

but, with caveat, so vars are missing (maybe an install/config problem on my side)

ERROR 1: Filter shuffle not handled

These experimental parquet/Zarr stores are listed here: https://podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1

Additional context

No response

The text was updated successfully, but these errors were encountered:

Refs OSGeo#11867

rouault · 2025-02-19T11:39:25Z

ERROR 1: Filter shuffle not handled

solved by #11870 (again confirming that any non-Python implementation is a second class citizen as numcodecs is pretty much a requirement for Zarr support)

For the rest, where is the spec of this sort of Zarr format ?
The fact that there's only a root .zmetadata file and no .zarray seems non-compliant to me.
And there's nothing in the .zmetadata indicating that this is actually Parquet files with .parq extension...

Refs OSGeo#11867

mdsumner · 2025-02-19T17:41:55Z

I'll find out, I'm still exploring blindly here but getting more clarity. Super excited that this even works at all, and being able to do this remotely gives a huge amount of leverage. Thanks!!

martindurant · 2025-02-19T17:54:55Z

Happy to answer questions about the parquet format of kerchunk references. It is not a valid zarr format except when interpreted in the context of references (kerchunk, fsspec.implementations.reference and virtualizarr in python-land). I was not aware of anyone using it outside of my bubble!

https://fsspec.github.io/kerchunk/spec.html#parquet-references

rouault · 2025-02-19T19:01:56Z

super excited that this even works at all, and being able to do this remotely gives a huge amount of leverage.

depends on what you call "working". The GDAL Zarr driver sees the metadata from the .zmetadata file, but if you attempt to read array values, it will "obviously" fail.

mdsumner · 2025-02-19T20:49:46Z

Ok well one step at a time I guess. Still confused whether this is format for Zarr or just for python but it's not going away. You gave some guidance ala "vsikerchunk" a while back, this is long haul for me and very much appreciate your efforts, aligning to the spec and these recent fixes 👌

martindurant · 2025-02-19T20:57:35Z

this is format for Zarr or just for python

It was certainly developed in python, but I think there is public demand for some of this new momentum around zarr in general. You may be following the advent of icechunk and virtualizarr. It is fair to say that some of the technology is still settling, but this particular reference storage format has been used successfully for a while now. It is appropriate for the case where references stored in JSON would be too large to handle easily.

mdsumner · 2025-02-19T21:27:42Z

There is huge momentum! That's why I'm such a pest. I should have asked in the Zarr community meeting ... was too early for me.

If anyone needs evidence see this video, NASA committed to holus bolus publication and unpacking all crazy old formats: (Ayush Nag, presented by James Gallagher at 1:36:12)

https://www.youtube.com/watch?v=T6QAwJIwI3Q

What I don't understand is how to get this into the spec so that there's a clear pathway to implementation. Will Icechunk be the standard?, I think so, icechunk doing the work of fsspec is a huge step IMO, but maybe that's an uncrossable line for a C++ project.

martindurant · 2025-02-19T21:42:35Z

I should clarify, that the "spec" for zarr is well settled. What we're talking about is an IO layer abstraction which provides this spec to the reader. Icechunk is another of these; I cannot tell if it will eventually take the place of "simple zarr" (data chunks and metadata files in a directory tree) or kerchunk or something else. But every platform needs some code for each storage mechanism, even the ZIP we started off with above, never mind parquet-references.

icechunk doing the work of fsspec is a huge step IMO

As the fsspec person, can I ask why? Is it the prospect of something fsspec-like in rust (which was thoroughly ignored 2 years ago) ?
Actually, the abstraction layer and the "fetch bytes" layer are orthogonal, and icechunk can read using fsspec.

mdsumner · 2025-02-19T21:52:21Z

Wasn't casting shade on fsspec - but anything that's only available in Python and not in a core library is a liability imo, I'm sure fsspec is excellent but I don't rely on Python much - it feels to me like a huge abstraction that needs to exist as core library, for cross language use (I assume cross-language as a core need in communities, same as cross-platform). (And indeed the VSI facilities in GDAL seem analogous to what fsspec is)

martindurant · 2025-02-19T21:55:41Z

the VSI facilities in GDAL seem analogous to what fsspec is

Yes, there is certainly a lot of commonality; but no one runs GDAL for remote file/bytes management unless they also have some raster format loading requirements. Anyway, not to spam the thread too much...

mdsumner · 2025-03-11T01:39:48Z

I actually do use GDAL for remote file/bytes all the time, it's a great way to create a cog from an online netcdf and push it up to S3 from memory (a classic lambda op), and I don't know otherwise how to lazy load a zarr from an online zip, just as a couple of examples.

This is now fixed as far as it can be.

Refs OSGeo#11867

rouault · 2025-04-07T15:51:28Z

Support for Kerchunk JSON and Parquet reference stores added in #12099

TomNicholas · 2025-04-15T21:46:55Z

For the rest, where is the spec of this sort of Zarr format ?
The fact that there's only a root .zmetadata file and no .zarray seems non-compliant to me.
And there's nothing in the .zmetadata indicating that this is actually Parquet files with .parq extension...

The confusion here is between the the zarr spec for serializing bytes, and the so-called "native zarr" format which consists of a specific layout of data on disk. I also found this confusing, so I wrote out a clearer explanation here: zarr-developers/zarr-python#2956.

Actually, the abstraction layer and the "fetch bytes" layer are orthogonal, and icechunk can read using fsspec

This is not really true. Maybe it could, but Icechunk doesn't use obstore, and definitely doesn't use fsspec.

Will Icechunk be the standard?, I think so, icechunk doing the work of fsspec is a huge step IMO, but maybe that's an uncrossable line for a C++ project.

Icechunk's rust client will get C bindings (see earth-mover/icechunk#381), at which point you could call it directly from GDAL.

These experimental parquet/Zarr stores

My honest opinion is that this kerchunk zarr parquet format was an experiment that should now be superceded by Icechunk's model. Icechunk has all the same advantages of scalability, and is also able to reference data inside archival formats, but it is more performant, has massive new features such as transactions and version control, is still serverless, and can be made available to other languages through bindings. The backwards compatibility story is provided by VirtualiZarr's ability to cheaply convert existing kerchunk-formatted references into Icechunk.

martindurant · 2025-04-15T22:17:04Z

Icechunk doesn't use obstore

It doesn't?? This got lost in translation. Why does it need a completely different IO from zarr and/or why is this community developing obstore, then?

it is more performant

To be sure: kerchunk with parquet was a very successful experiment, particularly on dask. Settling on a long-term standard is a good thing for everyone, but I would not push this line too strongly, especially since many many use cases will be bandwidth limited with fsspec too.

convert existing kerchunk-formatted references into Icechunk

has this been done for any super-large parquet reference collections?

TomNicholas · 2025-04-25T21:07:07Z

It doesn't?? This got lost in translation. Why does it need a completely different IO from zarr and/or why is this community developing obstore, then?

Icechunk uses the rust object_store crate directly for IO (which is maintained by the Apache Arrow community). obstore also uses that crate - it provides python bindings for it.

I would not push this line too strongly

We already showed that it is more performant. It approaches saturation of the network connection, and sustains an insane number of requests, so it literally cannot get much faster. But what's more important than speed is reliability, and using a common layer in rust is better from that perspective too.

has this been done for any super-large parquet reference collections?

I don't think anyone has tried to convert a very large set of existing parquet references, but the ability to read them into virtualizarr does already exist, and we have dealt with pretty large manifests in virtualizarr and icechunk already.

mdsumner · 2025-04-25T22:16:12Z

What is storage details in Icechunk? A new format? We're bound to a Rust library? (It's amusing given the obsession with avoiding dependency on legacy libraries here but ho hum). Personally I like the Parquet storage, it should just be a table though with groups and array index and the references, so it's obvious and easy to use in other ways. It's weird to have a (hive ) table but then rely on a structural assumed ordering, but that mash of json and arrays is a really strong smell in python for sure.

I honestly have no idea what Icechunk format is, but just seems opaque which seems a shame given the straightforwardnesss of trad and virtual Zarr.

TomNicholas · 2025-04-25T22:32:32Z

@mdsumner Icechunk is a new format, but it's a worthy tradeoff, because all of it's safe ACID database-like transactions, versioning, and git-like branches are impossible without it. It's not just another array file format, it's really a fundamentally new type of serverless OLAP database, made fully possible only recently by the addition of "put-if-not-exists" to AWS S3 and other object storage provider APIs.

None of these other formats, not even Parquet (which is analogous to traditional Zarr, not to Icechunk) can possibly provide those benefits. The only way they could do so would be to have a specification for a system of which parquet files and additional commit-tracking files should be updated in what circumstances, and at that point you would just be defining a new format.

I feel like the community still doesn't understand just how big of a step forward Icechunk is, and how much cool stuff will become possible now.

We're bound to a Rust library? (It's amusing given the obsession with avoiding dependency on legacy libraries here but ho hum).

I wondered the same thing, and yes that is ironic. But at least it's easy to bind to Rust from other languages (often via C). Icechunk still has an open spec anyway, so you could write an icechunk client in another language if you really wanted to.

seems opaque which seems a shame given the straightforwardnesss of trad and virtual Zarr.

The on-disk representation is a bit more opaque, and that is a trade-off, but we are getting enormous features in return. It does still keeps the straightforwardness of the key-value store interface, pluggable codecs, and zarr extensions regardless.

martindurant · 2025-04-26T01:20:31Z

We already showed that it is more performant. It approaches saturation of the network connection, and sustains an insane number of requests, so it literally cannot get much faster.

With respect, we are talking about very large number of references and parquet storage, and I am not aware of a comparison or such a scenario.

Nevertheless, on the performance chart you link to: you are seeing the effect of threading, whereas python is bound by the GIL for IO (compare the CPU times reported for dss.compute()). Now, that is a valid test and reasonable thing to measure, and being able to do it in a single process without dask is important. But it should be acknowledged that this is not how (pangeo) people normally use dask - you would have many processes, and loading the whole thing into memory is not the normal workflow. Therefore, translating the apparent 2x to a big cluster workload is far from trivial.

mdsumner added the enhancement label Feb 19, 2025

rouault added a commit to rouault/gdal that referenced this issue Feb 19, 2025

Zarr V2: add read/update support for 'shuffle' filter

d1bc388

Refs OSGeo#11867

rouault added a commit to rouault/gdal that referenced this issue Feb 19, 2025

Zarr V2: add read/update support for 'shuffle' filter

e8d5395

Refs OSGeo#11867

rouault mentioned this issue Feb 19, 2025

Zarr V2: add read/update support for 'shuffle' filter #11870

Merged

rouault added a commit to rouault/gdal that referenced this issue Feb 19, 2025

Zarr V2: add read/update support for 'shuffle' filter

000cced

Refs OSGeo#11867

mdsumner closed this as completed Mar 11, 2025

Jwohnlf pushed a commit to Jwohnlf/gdal that referenced this issue Mar 23, 2025

Zarr V2: add read/update support for 'shuffle' filter

46fb755

Refs OSGeo#11867

rouault mentioned this issue Apr 7, 2025

ZARR: add support for reading Kerchunk JSON and Parquet reference stores. #12099

Merged

Uh oh!

ZARR: reference stores online (within zip ) #11867

ZARR: reference stores online (within zip ) #11867

Comments

mdsumner commented Feb 19, 2025

Feature description

Additional context

rouault commented Feb 19, 2025

Uh oh!

mdsumner commented Feb 19, 2025

Uh oh!

martindurant commented Feb 19, 2025

Uh oh!

rouault commented Feb 19, 2025

Uh oh!

mdsumner commented Feb 19, 2025

Uh oh!

martindurant commented Feb 19, 2025

Uh oh!

mdsumner commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Feb 19, 2025

Uh oh!

mdsumner commented Feb 19, 2025

Uh oh!

martindurant commented Feb 19, 2025

Uh oh!

mdsumner commented Mar 11, 2025

Uh oh!

rouault commented Apr 7, 2025

Uh oh!

TomNicholas commented Apr 15, 2025

Uh oh!

martindurant commented Apr 15, 2025

Uh oh!

TomNicholas commented Apr 25, 2025

Uh oh!

mdsumner commented Apr 25, 2025

Uh oh!

TomNicholas commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Apr 26, 2025

Uh oh!

mdsumner commented Feb 19, 2025 •

edited

Loading

TomNicholas commented Apr 25, 2025 •

edited

Loading