Description
Summary
ZEP 8 (zarr-developers/zeps#48) proposes a standardized URL syntax for specifying Zarr nodes across multiple implementations. Implementing this in zarr-python
would enable more seamless interoperability between different Zarr tools, making it easier to share dataset locations across different backends.
Motivation
- A standardized URL syntax ensures consistent dataset access across different implementations.
- The proposed syntax allows for nested storage mechanisms, including ZIP files and hierarchical Zarr groups.
- Improved integration with libraries like
xarray
, which could leverage this for seamless dataset loading.
Proposed Implementation
I would encourage us to handle this in two phases.
Phase 1
Implement handling of ZEP8 URLs for stores implemented in Zarr-Python. Practically, this requires parsing URLs into Store, Path, and ZarrFormat components.
Phase 2
Implement an entrypoint mechanism for 3rd party stores (e.g. Icechunk). This would allow external Store developers to provide a function to generate a Store given a URL.
Example Usage in xarray
With ZEP 8 URL syntax, an xarray.Dataset
could be opened as follows:
import xarray as xr
dataset = xr.open_dataset(
"gs://my-bucket/my-archive.zip|zip:inner-dir|zarr3:subgroup",
engine="zarr"
)
This URL specifies:
- A ZipStore (
my-archive.zip
) stored in Google Cloud Storage (gs://my-bucket/
) - A nested hierarchy inside the Zarr store (
subgroup
)
Another example:
dataset = xr.open_dataset(
"https://foobar.com/my-archive|icechunk|zarr3",
engine="zarr"
)
This URL specifies:
- An Icechunk Store available over https
Additional Considerations
- Backward compatibility with existing
fsspec
URLs.
cc @jbms