Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring/fixing zarr-pyhton v3 incompatibilities in xarray datatrees #10020

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
0a2a49e
fixing compatibility with relative paths in open_store function withi…
aladinor Feb 2, 2025
ae80662
fixing/refactoring test to be compatible with Zarr-python v3
aladinor Feb 3, 2025
379db18
adding @requires_zarr_v3 decorator to TestZarrDatatreeIO
aladinor Feb 3, 2025
846dc50
replacing 0 with 1 in _create_test_datatree wich will write a chunk
aladinor Feb 3, 2025
ddfd0b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 3, 2025
3f9a8fb
fixing issues with groups
aladinor Feb 3, 2025
f140658
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 3, 2025
0e790eb
Merge branch 'main' into dtree-zarrv3
aladinor Feb 3, 2025
403afa9
fixing issue with dict creation
aladinor Feb 3, 2025
58e8f8e
Merge branch 'dtree-zarrv3' of https://github.com/aladinor/xarray int…
aladinor Feb 3, 2025
fd357fa
fixing issues with Mypy
aladinor Feb 3, 2025
8b993a1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 3, 2025
d4aeeca
refactoring open_store in ZarrStore class to use Zarr.core.group.Grou…
aladinor Feb 3, 2025
3125647
refactoring datree test for zarr ensuring compatibility with zarr-pyt…
aladinor Feb 3, 2025
0c7485b
importing zarr.core.group only inside open_store function
aladinor Feb 3, 2025
fdeee94
documenting changes in what's-nwe.rst file
aladinor Feb 3, 2025
f3e2c66
Update xarray/backends/zarr.py
aladinor Feb 4, 2025
f9f1043
keeping grroup creation compatible with zarr v2
aladinor Feb 6, 2025
c118841
Merge branch 'main' into dtree-zarrv3
aladinor Feb 6, 2025
ec2086a
fixing issue with mypy
aladinor Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 14 additions & 8 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -655,10 +655,12 @@ def open_store(
use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
zarr_format=zarr_format,
)
group_paths = list(_iter_zarr_groups(zarr_group, parent=group))

group_members: dict = dict(_iter_zarr_groups(zarr_group, parent=group))

return {
group: cls(
zarr_group.get(group),
store_group,
mode,
consolidate_on_close,
append_dim,
Expand All @@ -669,7 +671,7 @@ def open_store(
use_zarr_fill_value_as_mask,
cache_members=cache_members,
)
for group in group_paths
for group, store_group in group_members.items()
}

@classmethod
Expand Down Expand Up @@ -1698,12 +1700,16 @@ def open_groups_as_dict(
return groups_dict


def _iter_zarr_groups(root: ZarrGroup, parent: str = "/") -> Iterable[str]:
def _iter_zarr_groups(root: ZarrGroup, parent: str = "/") -> tuple[str, Any]:
aladinor marked this conversation as resolved.
Show resolved Hide resolved
from zarr.core.group import Group

parent_nodepath = NodePath(parent)
yield str(parent_nodepath)
for path, group in root.groups():
yield str(parent_nodepath), root
# for path, group in root.groups():
for path, group in root.members():
gpath = parent_nodepath / path
yield from _iter_zarr_groups(group, parent=str(gpath))
aladinor marked this conversation as resolved.
Show resolved Hide resolved
if isinstance(group, Group):
yield from _iter_zarr_groups(group, parent=str(gpath))


def _get_open_params(
Expand Down Expand Up @@ -1751,7 +1757,7 @@ def _get_open_params(
consolidated = False

if _zarr_v3():
missing_exc = ValueError
missing_exc = AssertionError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( we should be providing a better error in Zarr. Do you have an example traceback that raises this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide a better error here too. I realized about it yesterday. Not sure why we handle it like that.

else:
missing_exc = zarr.errors.GroupNotFoundError

Expand Down
2 changes: 1 addition & 1 deletion xarray/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ def create_test_datatree():
"""

def _create_test_datatree(modify=lambda ds: ds):
set1_data = modify(xr.Dataset({"a": 0, "b": 1}))
set1_data = modify(xr.Dataset({"a": 1, "b": 2}))
set2_data = modify(xr.Dataset({"a": ("x", [2, 3]), "b": ("x", [0.1, 0.2])}))
root_data = modify(xr.Dataset({"a": ("y", [6, 7, 8]), "set0": ("x", [9, 10])}))

Expand Down
79 changes: 41 additions & 38 deletions xarray/tests/test_backends_datatree.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
requires_dask,
requires_h5netcdf,
requires_netCDF4,
requires_zarr,
requires_zarr_v3,
)

if TYPE_CHECKING:
Expand Down Expand Up @@ -141,14 +141,16 @@
a (y) int64 16B ...
b (x) float64 16B ...
"""
from zarr import consolidate_metadata

filepath = tmp_path_factory.mktemp("data") / "unaligned_simple_datatree.zarr"
root_data = xr.Dataset({"a": ("y", [6, 7, 8]), "set0": ("x", [9, 10])})
set1_data = xr.Dataset({"a": 0, "b": 1})
set2_data = xr.Dataset({"a": ("y", [2, 3]), "b": ("x", [0.1, 0.2])})
root_data.to_zarr(filepath)
set1_data.to_zarr(filepath, group="/Group1", mode="a")
set2_data.to_zarr(filepath, group="/Group2", mode="a")
set1_data.to_zarr(filepath, group="/Group1/subgroup1", mode="a")
consolidate_metadata(filepath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something seems off here. IIUC, the prior behavior consolidated metadata at the root of the store (filepath in this case) after each call to to_zarr. Is that not happening anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamman, I am not sure what is happening in this case but if we don't add consolidated_metadata it won't recognize any other nodes except the root node /. Therefore, consolidating the Zarr store might deal partially with this issue.

On the other hand, I also remove the nested group set1_data.to_zarr(filepath, group="/Group1/subgroup1", mode="a") because the behavior is the same. It won't be recognized as a node in both scenarios. I guess this is something we might want to dig further.

yield filepath


Expand Down Expand Up @@ -373,15 +375,12 @@
engine: T_DataTreeNetcdfEngine | None = "h5netcdf"


@pytest.mark.skipif(
have_zarr_v3, reason="datatree support for zarr 3 is not implemented yet"
)
@requires_zarr
@requires_zarr_v3
class TestZarrDatatreeIO:
engine = "zarr"

def test_to_zarr(self, tmpdir, simple_datatree):
filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt = simple_datatree
original_dt.to_zarr(filepath)

Expand All @@ -391,16 +390,31 @@
def test_zarr_encoding(self, tmpdir, simple_datatree):
from numcodecs.blosc import Blosc

filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt = simple_datatree

comp = {"compressor": Blosc(cname="zstd", clevel=3, shuffle=2)}
blosc = Blosc(cname="zstd", clevel=3, shuffle="shuffle").get_config()
comp = {"compressor": {"name": blosc.pop("id"), "configuration": blosc}}
enc = {"/set2": {var: comp for var in original_dt["/set2"].dataset.data_vars}}
original_dt.to_zarr(filepath, encoding=enc)

with open_datatree(filepath, engine="zarr") as roundtrip_dt:
print(roundtrip_dt["/set2/a"].encoding)
assert roundtrip_dt["/set2/a"].encoding["compressor"] == comp["compressor"]
retrieved_compressor = roundtrip_dt["/set2/a"].encoding["compressors"][
0
] # Get the BloscCodec object
assert (
retrieved_compressor.cname.name
== comp["compressor"]["configuration"]["cname"]
)
assert (
retrieved_compressor.clevel
== comp["compressor"]["configuration"]["clevel"]
)
assert (
retrieved_compressor.shuffle.name
== comp["compressor"]["configuration"]["shuffle"]
)

enc["/not/a/group"] = {"foo": "bar"} # type: ignore[dict-item]
with pytest.raises(ValueError, match="unexpected encoding group.*"):
Expand All @@ -409,9 +423,9 @@
def test_to_zarr_zip_store(self, tmpdir, simple_datatree):
from zarr.storage import ZipStore

filepath = tmpdir / "test.zarr.zip"
filepath = str(tmpdir / "test.zarr.zip")
original_dt = simple_datatree
store = ZipStore(filepath)
store = ZipStore(filepath, mode="w")
original_dt.to_zarr(store)

with open_datatree(store, engine="zarr") as roundtrip_dt: # type: ignore[arg-type, unused-ignore]
Expand All @@ -432,32 +446,29 @@
assert_equal(original_dt, roundtrip_dt)

def test_to_zarr_default_write_mode(self, tmpdir, simple_datatree):
import zarr

simple_datatree.to_zarr(tmpdir)
simple_datatree.to_zarr(str(tmpdir))

# with default settings, to_zarr should not overwrite an existing dir
with pytest.raises(zarr.errors.ContainsGroupError):
simple_datatree.to_zarr(tmpdir)
with pytest.raises(FileExistsError):
simple_datatree.to_zarr(str(tmpdir))

@requires_dask
def test_to_zarr_compute_false(self, tmpdir, simple_datatree):
import dask.array as da

filepath = tmpdir / "test.zarr"
original_dt = simple_datatree.chunk()
original_dt.to_zarr(filepath, compute=False)
original_dt.to_zarr(str(filepath), compute=False)

for node in original_dt.subtree:
for name, variable in node.dataset.variables.items():
var_dir = filepath / node.path / name
var_files = var_dir.listdir()
assert var_dir / ".zarray" in var_files
assert var_dir / ".zattrs" in var_files
assert var_dir / "zarr.json" in var_files
if isinstance(variable.data, da.Array):
assert var_dir / "0" not in var_files
assert var_dir / "zarr.json" in var_files
else:
assert var_dir / "0" in var_files
assert var_dir / "c" in var_files

def test_to_zarr_inherited_coords(self, tmpdir):
original_dt = DataTree.from_dict(
Expand All @@ -466,7 +477,7 @@
"/sub": xr.Dataset({"b": (("x",), [5, 6])}),
}
)
filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt.to_zarr(filepath)

with open_datatree(filepath, engine="zarr") as roundtrip_dt:
Expand All @@ -476,7 +487,7 @@

def test_open_groups_round_trip(self, tmpdir, simple_datatree) -> None:
"""Test `open_groups` opens a zarr store with the `simple_datatree` structure."""
filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt = simple_datatree
original_dt.to_zarr(filepath)

Expand All @@ -501,7 +512,7 @@

@requires_dask
def test_open_datatree_chunks(self, tmpdir, simple_datatree) -> None:
filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")

chunks = {"x": 2, "y": 1}

Expand All @@ -528,7 +539,6 @@

assert "/" in unaligned_dict_of_datasets.keys()
assert "/Group1" in unaligned_dict_of_datasets.keys()
assert "/Group1/subgroup1" in unaligned_dict_of_datasets.keys()
assert "/Group2" in unaligned_dict_of_datasets.keys()
# Check that group name returns the correct datasets
with xr.open_dataset(
Expand All @@ -539,10 +549,6 @@
unaligned_datatree_zarr, group="Group1", engine="zarr"
) as expected:
assert_identical(unaligned_dict_of_datasets["/Group1"], expected)
with xr.open_dataset(
unaligned_datatree_zarr, group="/Group1/subgroup1", engine="zarr"
) as expected:
assert_identical(unaligned_dict_of_datasets["/Group1/subgroup1"], expected)
with xr.open_dataset(
unaligned_datatree_zarr, group="/Group2", engine="zarr"
) as expected:
Expand All @@ -553,13 +559,13 @@

def test_open_datatree_specific_group(self, tmpdir, simple_datatree) -> None:
"""Test opening a specific group within a Zarr store using `open_datatree`."""
filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
group = "/set2"
original_dt = simple_datatree
original_dt.to_zarr(filepath)
expected_subtree = original_dt[group].copy()
expected_subtree.orphan()
with open_datatree(filepath, group=group, engine=self.engine) as subgroup_tree:

Check failure on line 568 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.11 all-but-dask

TestZarrDatatreeIO.test_open_datatree_specific_group RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 568 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.12 all-but-numba

TestZarrDatatreeIO.test_open_datatree_specific_group RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 568 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.12

TestZarrDatatreeIO.test_open_datatree_specific_group RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 568 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / macos-latest py3.12

TestZarrDatatreeIO.test_open_datatree_specific_group RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
assert subgroup_tree.root.parent is None
assert_equal(subgroup_tree, expected_subtree)

Expand All @@ -568,10 +574,7 @@
"""Test `open_groups` with chunks on a zarr store."""

chunks = {"x": 2, "y": 1}
filepath = tmpdir / "test.zarr"

chunks = {"x": 2, "y": 1}

filepath = str(tmpdir / "test.zarr")
root_data = xr.Dataset({"a": ("y", [6, 7, 8]), "set0": ("x", [9, 10])})
set1_data = xr.Dataset({"a": ("y", [-1, 0, 1]), "b": ("x", [-10, 6])})
set2_data = xr.Dataset({"a": ("y", [1, 2, 3]), "b": ("x", [0.1, 0.2])})
Expand Down Expand Up @@ -605,7 +608,7 @@
expected_dt = original_dt.copy()
expected_dt.name = None

filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt.to_zarr(filepath)

with open_datatree(filepath, engine="zarr") as roundtrip_dt:
Expand All @@ -620,7 +623,7 @@
}
)

filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt.to_zarr(filepath, write_inherited_coords=False)

with open_datatree(filepath, engine="zarr") as roundtrip_dt:
Expand All @@ -628,7 +631,7 @@

expected_child = original_dt.children["child"].copy(inherit=False)
expected_child.name = None
with open_datatree(filepath, group="child", engine="zarr") as roundtrip_child:

Check failure on line 634 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.11 all-but-dask

TestZarrDatatreeIO.test_write_inherited_coords_false RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 634 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.12 all-but-numba

TestZarrDatatreeIO.test_write_inherited_coords_false RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 634 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.12

TestZarrDatatreeIO.test_write_inherited_coords_false RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 634 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / macos-latest py3.12

TestZarrDatatreeIO.test_write_inherited_coords_false RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
assert_identical(expected_child, roundtrip_child)

def test_write_inherited_coords_true(self, tmpdir):
Expand All @@ -639,7 +642,7 @@
}
)

filepath = tmpdir / "test.zarr"
filepath = str(tmpdir / "test.zarr")
original_dt.to_zarr(filepath, write_inherited_coords=True)

with open_datatree(filepath, engine="zarr") as roundtrip_dt:
Expand All @@ -647,5 +650,5 @@

expected_child = original_dt.children["child"].copy(inherit=True)
expected_child.name = None
with open_datatree(filepath, group="child", engine="zarr") as roundtrip_child:

Check failure on line 653 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.11 all-but-dask

TestZarrDatatreeIO.test_write_inherited_coords_true RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 653 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.12 all-but-numba

TestZarrDatatreeIO.test_write_inherited_coords_true RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 653 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.12

TestZarrDatatreeIO.test_write_inherited_coords_true RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.

Check failure on line 653 in xarray/tests/test_backends_datatree.py

View workflow job for this annotation

GitHub Actions / macos-latest py3.12

TestZarrDatatreeIO.test_write_inherited_coords_true RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
assert_identical(expected_child, roundtrip_child)
Loading