Skip to content

(fix): structured dtype fill value consolidated metadata #3015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/2998.bugfix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix structured `dtype` fill value serialization for consolidated metadata
9 changes: 8 additions & 1 deletion src/zarr/core/group.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import asyncio
import base64
import itertools
import json
import logging
@@ -358,7 +359,13 @@
d[f"{k}/{ZATTRS_JSON}"] = _replace_special_floats(attrs)
if "shape" in v:
# it's an array
d[f"{k}/{ZARRAY_JSON}"] = _replace_special_floats(v)
if isinstance(v.get("fill_value", None), np.void):
v["fill_value"] = base64.standard_b64encode(

Check warning on line 363 in src/zarr/core/group.py

Codecov / codecov/patch

src/zarr/core/group.py#L362-L363

Added lines #L362 - L363 were not covered by tests
cast(bytes, v["fill_value"])
).decode("ascii")
else:
v = _replace_special_floats(v)
d[f"{k}/{ZARRAY_JSON}"] = v

Check warning on line 368 in src/zarr/core/group.py

Codecov / codecov/patch

src/zarr/core/group.py#L367-L368

Added lines #L367 - L368 were not covered by tests
else:
d[f"{k}/{ZGROUP_JSON}"] = {
"zarr_format": self.zarr_format,
25 changes: 25 additions & 0 deletions tests/test_metadata/test_v2.py
Original file line number Diff line number Diff line change
@@ -316,3 +316,28 @@ def test_zstd_checksum() -> None:
arr.metadata.to_buffer_dict(default_buffer_prototype())[".zarray"].to_bytes()
)
assert "checksum" not in metadata["compressor"]


@pytest.mark.parametrize(
"fill_value", [None, np.void((0, 0), np.dtype([("foo", "i4"), ("bar", "i4")]))]
)
def test_structured_dtype_fill_value_serialization(tmp_path, fill_value):
group_path = tmp_path / "test.zarr"
root_group = zarr.open_group(group_path, mode="w", zarr_format=2)
dtype = np.dtype([("foo", "i4"), ("bar", "i4")])
root_group.create_array(
name="structured_dtype",
shape=(100, 100),
chunks=(100, 100),
dtype=dtype,
fill_value=fill_value,
)

zarr.consolidate_metadata(root_group.store, zarr_format=2)
root_group = zarr.open_group(group_path, mode="r")
assert (
root_group.metadata.consolidated_metadata.to_dict()["metadata"]["structured_dtype"][
"fill_value"
]
== fill_value
Comment on lines +338 to +342
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilan-gold I'm a little confused by this test. If the fill value is a structured dtype scalar, then shouldn't the fill value that appears in metadata be base64 encoded? If so, shouldn't this check fail in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the python representation here is serlialized into the void data type. Wheter that is correct or not is a different story. Looking at it closely, I think this is either (a) expected, and the typing on ArrayV2Metadata is wrong or (b) the typing is right, and the behavior is wrong.

The type is: fill_value: int | float | str | bytes | None = 0

But the call to parse_fill_value yields numpy object:

try:
if isinstance(fill_value, list):
return np.array([tuple(fill_value)], dtype=dtype)[0]
elif isinstance(fill_value, tuple):
return np.array([fill_value], dtype=dtype)[0]
elif isinstance(fill_value, bytes):
return np.frombuffer(fill_value, dtype=dtype)[0]
elif isinstance(fill_value, str):
decoded = base64.standard_b64decode(fill_value)
return np.frombuffer(decoded, dtype=dtype)[0]
else:
return np.array(fill_value, dtype=dtype)[()]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(What I'm trying to say is that this dictionary is not the on-disk json, but a parsed version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know offhand what zarr-python 2 did here? (I can also check this later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not offhand, but just to understand, why does it matter? Is there a backwards compat concern with the in-memory python representation? TBH structured data types are much less essential to us than some of the other people who are raising these concerns from what it sounds like so I'm not super familiar with previous behavior. I don't think many people in our community use them, but they are in our CI and I like contributing so I make these PRs :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm touching a lot of data type representation code over in #2874 and I want to make sense of some of the test failures I'm seeing, and this was one of the tests that I tripped. I think your explanation makes sense (i.e., this is just the in-memory representation, and so the fill value should be the decoded version), sorry for the noise!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @d-v-b always happy to help. Will be quick to report on the status of this all once 3.0.8 is out, but our tests show no errors with this feature at the moment.

Copy link
Contributor

@d-v-b d-v-b May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for posterity, the specific thing in my PR that made this test fail was the use of to_dict. In #2874, I am making fill value encoding happen in the call to to_dict, instead of via a special JSON encoder (the status quo). So on my branch this test was comparing the JSON-serialized fill value against the in-memory version. I made the test pass by removing the to_dict step and directly comparing the metadata.fill_value attribute against the expected fill_value.

)