Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nan fill value metadata comparison #2930

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

Fix for #2929

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Mar 24, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Mar 24, 2025

Thanks for this! When we add datetimes, we will need to change this code again with a separate block for np.nat.

@@ -44,3 +46,23 @@ def from_dict(cls, data: dict[str, JSON]) -> Self:
"""

return cls(**data)

def __eq__(self, other: Any) -> bool:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding this means we lose the nice auto-generated dataclass __eq__ method, which prints out like this:

Matching attributes:
['shape',
 'data_type',
 'chunk_grid',
 'chunk_key_encoding',
 'codecs',
 'attributes',
 'dimension_names',
 'zarr_format',
 'node_type',
 'storage_transformers']
Differing attributes:
['fill_value']

Drill down into differing attribute fill_value:
  fill_value: np.float32(nan) != np.float32(nan)

But I don't see an easy way to keep that formatting and make it handle fill_value properly too...

@@ -233,7 +233,7 @@ class ArrayV3MetadataDict(TypedDict):
attributes: dict[str, JSON]


@dataclass(frozen=True, kw_only=True)
@dataclass(frozen=True, kw_only=True, eq=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do this for ArrayV2Metadata too.

@@ -411,3 +411,20 @@ def test_dtypes(dtype_str: str) -> None:
else:
# return type for vlen types may vary depending on numpy version
assert dt.byte_count is None


def test_metadata_comparison_with_nan_fill_value():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could presumably be parametrized over both v2 and v3 metadata, but you don't appear to have any other tests which do that. Is there a reason for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two metadata documents are separate entities, so they get tested separately. if there's shared functionality that they both depend on, but isn't part of their shared base class, then that shared functionality should have its own test.

higher up in the stack we test the array classes against the two metadata varieties, because part of the job of the array classes is to abstract over the v2 v3 differences.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 24, 2025

more broadly, np.isnan is picky about the types it accepts: np.isnan('im not nan') will raise a type error, for example. So I think some try... excepts might be necessary here.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 24, 2025

just a thought but what if the fill_value was already JSON-encoded, i.e. instead of fill_value: <numpy scalar> we did fill_value: JSON-able thing. This would totally prevent the need to handle edge cases in __eq__

@TomNicholas
Copy link
Member Author

just a thought but what if the fill_value was already JSON-encoded

This would totally prevent the need to handle edge cases in eq

I like this idea - it seems what you actually care about is whether the representation of the metadata on-disk is equal, which is what the JSON-encoded version is. Would this involve changing Metadata's internal representation of the fill_value?

@d-v-b
Copy link
Contributor

d-v-b commented Mar 24, 2025

just a thought but what if the fill_value was already JSON-encoded
This would totally prevent the need to handle edge cases in eq

I like this idea - it seems what you actually care about is whether the representation of the metadata on-disk is equal, which is what the JSON-encoded version is. Would this involve changing Metadata's internal representation of the fill_value?

yes, it would require pushing the "treat this JSON value as a numpy scalar" job higher up in the stack, or we could add a method on the metadata classes.

FWIW, over in #2874 I'm giving each data type object a method that converts the JSON representation of a scalar to an in-memory representation. That's exactly what we would use here.

@TomNicholas
Copy link
Member Author

FWIW, over in #2874 I'm giving each data type object a method that converts the JSON representation of a scalar to an in-memory representation. That's exactly what we would use here.

Okay, I'll wait until that's in before returning to this then. I now have a stopgap fix in VirtualiZarr already anyway (zarr-developers/VirtualiZarr#502).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants