Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring/fixing zarr-pyhton v3 incompatibilities in xarray datatrees #10020

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

aladinor
Copy link
Contributor

@aladinor aladinor commented Feb 3, 2025

xarray/backends/zarr.py Outdated Show resolved Hide resolved
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aladinor for working on this! I'm happy to see that we are actually quite close to getting this working. I think the changes you made to the tests highlight that we have some work to do around consolidated metadata and possibly error handling in Zarr.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
@@ -1751,7 +1749,7 @@ def _get_open_params(
consolidated = False

if _zarr_v3():
missing_exc = ValueError
missing_exc = AssertionError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( we should be providing a better error in Zarr. Do you have an example traceback that raises this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide a better error here too. I realized about it yesterday. Not sure why we handle it like that.

filepath = tmp_path_factory.mktemp("data") / "unaligned_simple_datatree.zarr"
root_data = xr.Dataset({"a": ("y", [6, 7, 8]), "set0": ("x", [9, 10])})
set1_data = xr.Dataset({"a": 0, "b": 1})
set2_data = xr.Dataset({"a": ("y", [2, 3]), "b": ("x", [0.1, 0.2])})
root_data.to_zarr(filepath)
set1_data.to_zarr(filepath, group="/Group1", mode="a")
set2_data.to_zarr(filepath, group="/Group2", mode="a")
set1_data.to_zarr(filepath, group="/Group1/subgroup1", mode="a")
consolidate_metadata(filepath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something seems off here. IIUC, the prior behavior consolidated metadata at the root of the store (filepath in this case) after each call to to_zarr. Is that not happening anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamman, I am not sure what is happening in this case but if we don't add consolidated_metadata it won't recognize any other nodes except the root node /. Therefore, consolidating the Zarr store might deal partially with this issue.

On the other hand, I also remove the nested group set1_data.to_zarr(filepath, group="/Group1/subgroup1", mode="a") because the behavior is the same. It won't be recognized as a node in both scenarios. I guess this is something we might want to dig further.

@TomNicholas TomNicholas added topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class labels Feb 4, 2025
@maxrjones maxrjones mentioned this pull request Feb 6, 2025
2 tasks
@aladinor
Copy link
Contributor Author

aladinor commented Feb 6, 2025

Another issue I found @jhamman, is that when opening back our datatree if we point to an specific group, it will raise an error. Let me try to explain it. Let's suppose we have this datatree.

<xarray.DataTree>
Group: /Dimensions:  (x: 128, y: 256)
│   Dimensions without coordinates: x, yData variables:
│       w        (x) float64 1kB dask.array<chunksize=(128,), meta=np.ndarray>z        (x, y) float64 262kB dask.array<chunksize=(64, 256), meta=np.ndarray>
├── Group: /bDimensions:  (y: 256, x: 128)
│       Dimensions without coordinates: y, xData variables:
│           B        (y, x) float64 262kB dask.array<chunksize=(128, 128), meta=np.ndarray>
├── Group: /aDimensions:  (x: 128, y: 256)
│       Dimensions without coordinates: x, yData variables:
│           A        (x, y) float64 262kB dask.array<chunksize=(64, 256), meta=np.ndarray>
└── Group: /cDimensions:  (x: 128, y: 256)
    │   Dimensions without coordinates: x, yData variables:
    │       w        (x) float64 1kB dask.array<chunksize=(128,), meta=np.ndarray>z        (x, y) float64 262kB dask.array<chunksize=(64, 256), meta=np.ndarray>
    └── Group: /c/d
            Dimensions:  (x: 128, y: 256)
            Dimensions without coordinates: x, y
            Data variables:
                G        (x, y) float64 262kB dask.array<chunksize=(64, 256), meta=np.ndarray>

It was saved using Zarr v3. when opening it back using consolidated=True and pointing to the /c group (or any other group),

dt_round_zarr = xr.open_datatree(
        "testv3_dt.zarr",
        consolidated=True,
        chunks={},
        group="/c"
    )

it will raise the following error

Traceback (most recent call last):
  File "/media/alfonso/drive/Alfonso/python/xarray/refactoring.py", line 129, in <module>
    main()
  File "/media/alfonso/drive/Alfonso/python/xarray/refactoring.py", line 111, in main
    dt_round_zarr = xr.open_datatree(
                    ^^^^^^^^^^^^^^^^^
  File "/media/alfonso/drive/Alfonso/python/xarray/xarray/backends/api.py", line 1130, in open_datatree
    backend_tree = backend.open_datatree(
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alfonso/drive/Alfonso/python/xarray/xarray/backends/zarr.py", line 1628, in open_datatree
    groups_dict = self.open_groups_as_dict(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/alfonso/drive/Alfonso/python/xarray/xarray/backends/zarr.py", line 1677, in open_groups_as_dict
    stores = ZarrStore.open_store(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/media/alfonso/drive/Alfonso/python/xarray/xarray/backends/zarr.py", line 645, in open_store
    ) = _get_open_params(
        ^^^^^^^^^^^^^^^^^
  File "/media/alfonso/drive/Alfonso/python/xarray/xarray/backends/zarr.py", line 1798, in _get_open_params
    zarr_group = zarr.open_consolidated(store, **open_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/api/synchronous.py", line 212, in open_consolidated
    sync(async_api.open_consolidated(*args, use_consolidated=use_consolidated, **kwargs))
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/sync.py", line 142, in sync
    raise return_result
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/api/asynchronous.py", line 346, in open_consolidated
    return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/api/asynchronous.py", line 807, in open_group
    return await AsyncGroup.open(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/group.py", line 553, in open
    return cls._from_bytes_v3(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/group.py", line 611, in _from_bytes_v3
    raise ValueError(msg)
ValueError: Consolidated metadata requested with 'use_consolidated=True' but not found in 'c'.

I think this might has to do with consolidated metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataTree roundtrip fails on None group lookup
3 participants