Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for datetime and timedelta dtypes (#2616) #2884

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RJ-Vestrum
Copy link

My first contribution addressing #2616!

Summary of changes:
* Add support for the datetime dtypes
* Add support for the timedelta dtypes
* Update tests to validate the fill_values for for datetime and timedelta
* Towncrier file for changes

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

    * Add support for the datetime dtypes
    * Add support for the timedelta dtypes
    * Add test to validate the fill_values for for datetime
    * Add test to validate the fill_values for for timedelta
    * Add towncrier file for changes
@d-v-b
Copy link
Contributor

d-v-b commented Mar 4, 2025

@RJ-Vestrum thanks so much for this work! We definitely want to have these datatypes in zarr, but there are few complications:

  • The Zarr v2 spec made support for datatimes and timedeltas very easy, but the zarr v3 spec makes this a bit harder, because it doesn't provide any guidance at all for adding new data types. ZEP9 (phase 1): add clarifications for extension naming zarr-specs#330 is an effort to fix this, but that discussion is ongoing. One certainty is that we will allow data types to be specified in metadata as a dict with the structure {"name": <dtype name>, "configuration": {...params}}. For parametric dtypes like datetime and timedelta, I think we should consider storing them in JSON like so:
"data_type": {"name": "numpy.timedelta64", "configuration": {"unit": "ns"}}

IMO this is a lot clearer than using a stringly-typed identifier like "datetime64[ns]". But it would also complicate your work in this PR

A second complication for your work is my current effort to completely refactor how dtypes work in zarr-python (see #2874). In that PR, I'm unifying v2 and v3 data type handling via a completely new set of python objects to model dtypes. If you're interested, I'd love to get your feedback / input on that work, because I haven't added datetimes or timedeltas yet. I'd be happy to explain how you could add those under the proposed framework in that PR if you're interested. Otherwise, there's a risk that the work in this PR would need to be re-done in the new framework, provided #2874 is successful.

Since there's so many moving parts right now (spec changes, multiple PRs), we should probably invest some effort in coordinating our efforts. let me know if you're interested!

@RJ-Vestrum RJ-Vestrum marked this pull request as draft March 4, 2025 17:19
@RJ-Vestrum
Copy link
Author

@d-v-b thanks for the detailed response and for explaining all the moving parts here! I really appreciate the context on both the spec changes and the dtype refactor in #2874.

This PR was mostly a workaround I had been using locally, and I wanted to share it in case it could contribute to the development process. I don’t have much time to coordinate right now, but I’ll take a look at #2874 and keep it in mind. I’m planning to revisit this in April, so if #2874 is finalized by mid-April, I can rework this PR to align with the new framework then.

I also fully agree that specifying data types in metadata rather than using stringly-typed identifiers will make things much clearer.

Thanks again for taking the time to lay this all out!

@rabernat
Copy link
Contributor

rabernat commented Mar 4, 2025

I'm curious...how specific is the on-disk storage of numpy.timedelta64 etc. to numpy? Is it actually just int64 data, plus a few parameters like units?

@d-v-b
Copy link
Contributor

d-v-b commented Mar 4, 2025

I'm curious...how specific is the on-disk storage of numpy.timedelta64 etc. to numpy? Is it actually just int64 data, plus a few parameters like units?

great question, I have no idea, but if the answer is "yes", then it's even more reason to think about an arrow-style dtype hierarchy where primitive types form the basis of logical types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants