Skip to content

1.0 release date? #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
geospatial-jeff opened this issue May 8, 2025 · 14 comments
Open

1.0 release date? #66

geospatial-jeff opened this issue May 8, 2025 · 14 comments

Comments

@geospatial-jeff
Copy link

Is there an anticipated release date for the 1.0 GeoZarr specification? I ask because I would love to use GeoZarr in several of my applications, but I'm hesitant to align with a pre 1.0 standard, especially when the conversations in #63 and #65 make it seem like many aspects of the spec are still under construction and open for change.

@christophenoel
Copy link

Yeah, it’s even a bit worse than that — most of the discussions since January 2024 have only just started to converge with the first major PR (#65). That sets the groundwork for agreeing on the structure. After that, we’ll finally be able to split up the topics, propose and debate the actual data model and encoding details, figure out priorities, and build a proper roadmap.
Realistically, I wouldn’t expect a 1.0 release before early 2026, at best — just my take.

@mdsumner
Copy link

mdsumner commented May 12, 2025

Does really seem off the rails then. I think this is about having a transform and crs, 6 numbers and a string. Even for the nominated special case where x,y are the first (or last however you think about it) axes, and not as general as xarray PR 9543 that would be a huge step forward. I don't understand, this project appears to be trying to define all possible datasets in the world. There's really nothing "geo" about a transform, and having a description of the coordinate system in use is only geo when you confine it to one idealized surface of one planet. I think that effort (transform + crs), does not belong here. For my part, I will PR the Zarr+STAC approach to GDAL, that would cover a lot of the existing gap.

I don't disagree that "GeoZarr" as an idea has merit, it's just that Zarr+xarray really dropped the ball on implicit non-degenerate coordinates (until PR 9543). That effort has to go all the way to the core, and I always thought "geozarr" was an odd special-casing. (Sorry, really not criticizing the good hard work here, just it's not the core of the problem that I saw when I first encountered xarray, and I care about that more than this reinvention of existing tools and frameworks).

@christophenoel
Copy link

Affine transforms are a minor point but still something a number of people want. For me, it’s simply a matter of agreeing on how to encode the six parameters in Zarr — we've seen at least three different proposals so far, and any of them would do. It’s the same with overviews: there are already existing specs and APIs, we just need to settle on one.

This project isn’t trying to define every possible dataset in the world — that’s a misunderstanding of the role of the requirement classes, which are optional and play only a supporting role in the spec (though, there are plenty of use cases for which they would be useful from my point of view)

The slow pace isn’t about reinventing anything. It’s just the reality of trying to reach consensus (select the right wheel) and find people with the time and energy to write a solid, formal specification.

@mdsumner
Copy link

mdsumner commented May 12, 2025

Ok fair enough, appreciate your response. These data exist and there's huge gaps, so software needs fixing. That's what I thought this was for so I will stop expecting that and pursue the gaps themselves. I'm not disinterested in this project and effort as you clarify it, fwiw. (I always thought "GeoZarr" was a bit of a diversion anyway, the problems that I was motivated by are in Zarr+xarray as foundations that set expectations. The blithe adoption of the netcdf coordinate model and assuming everything is longlat - as a first approximation - is the core issue that affects many communities in a deep way).

@christophenoel
Copy link

Thanks for sharing your thoughts. If you have any concrete ideas or suggestions, please feel free to put them forward—it’d be really valuable. And if you’re up for explaining some of the issues in simpler terms, that would really help those of us who aren’t deep experts follow along better.

@csaybar
Copy link

csaybar commented May 13, 2025

Hi @christophenoel , I just finished reading the GeoZarr specification and noticed it's heavily based on the CDM data model and the CF conventions. I was curious, since you're already adopting the CF conventions, why not also adopt the CF data model directly?

cf python package

From what I understand, the CF community promotes the use of its constructors over others, such as the CDM or ISO 19123 coverage model. The OGC CF-netCDF, which was proposed by the authors of CDM, offers a structure that is very similar to what GeoZarr aims to implement.

In that context, wouldn’t GeoZarr essentially be an application of the OGC CF-netCDF model using the Zarr format as the serialization layer?

@christophenoel
Copy link

What make you think it is not reusing the CF data model (which itself extends the CDM model) ?

@christophenoel
Copy link

GeoZarr also aims to provide affine transform, overviews, stac integration, requirements classes for standardising some frequent types of assets. The GeoZarr data model aims only to unify additional construct around CF/CDM.

@csaybar
Copy link

csaybar commented May 13, 2025

What make you think it is not reusing the CF data model (which itself extends the CDM model) ?

Well, the paper that introduces the CF data model (https://gmd.copernicus.org/articles/10/4619/2017/gmd-10-4619-2017.html). Section 5) is basically a comparison of the CF data model with CDM (section 5.3) and other data models. There was nothing in the paper that made me think CF extends CDM 🤔 . The CF data model is much simpler than CDM. In the paper's conclusion:

We consider that our CF data model is simpler and more flexible than other such models, because it defines a small number of general constructs rather than many specialised ones.

If I’m not mistaken, the core idea behind the CF data model is to provide a single, unambiguous interpretation of the CF convention. Building additional layers or models on top of the CF data model may reintroduce multiple interpretations, precisely the issue the original CF model aimed to avoid, as emphasized again in the conclusion of the referenced paper.

We believe that the data model proposed here is a complete and correct description of CF, because we have yet to find a case for which our implementation in the cf-python library fails to represent or misrepresents a CF-compliant dataset.

With respect to the affine transform. Aligning with the current GDAL implementation may result in non-CF-compliant metadata. There is always the option of implementing both, but I think is non-CF-compliant too (you already have a construct for coordinates in both CDM and CF). [Not sure about this because GDAL has some smart tricks to guess the CRS and affine, if I'm not wrong.]

Spatial overviews is more of a structural problem. Having a clear spatial overview strategy is great. In my opinion, this is the most valuable contribution of the GeoZarr spec. 👍

@christophenoel
Copy link

Sorry, but I don’t have the time to explain all the background here (not even to work on the GeoZarr spec actually 😄 ). Just to note, we do have people involved who are active in the CF community and familiar with the model’s intent and scope, so that perspective is represented in the loop.

You may have a look at the CF data model, which basically shows how it maps to the NetCDF data model, which itself is a tailoring of the unidata common data model.
I will try to share soon a document that help to understand some core ideas.

@csaybar
Copy link

csaybar commented May 13, 2025

You may have a look at the CF data model, which basically shows how it maps to the NetCDF data model, which itself is a tailoring of the unidata common data model.

I'm gonna be honest, I find the papers/specs on the CF and CDM data models quite difficult/boring to read, so there's a good chance I might be misunderstanding something. That said, I believe you are confusing the distinction between the CF and CDM data models and conventions. I'm happy to be corrected if I'm off, just pointing out how I currently understand it (just in case this can help others).

  • The CF data model is, according to the authors, the right interpretation/implementation of the CF convention. The CF convention is a set of guidelines that define how to structure and annotate data in ONLY NetCDF files.

Paper: Creating an explicit data model before the CF conventions were written would arguably have been preferable.

  • The CDM data model is more abstract and complex. It merges the netCDF, OPeNDAP, and HDF5 data models to create a common API for many types of scientific data. The CDM strongly recommends following the CF convention, not the data model.

I think most people confuse that CF convention and the CF data model, but they are two very different things. 🙆‍♀️

@geospatial-jeff
Copy link
Author

@christophenoel thanks for all of the background context here, this is very helpful. I think it would be great to put some of this information in a top level document in the repo (ex. roadmap.md) so it's easier for new folks to discover.

Understanding the current status / maturity level of the schema has been my largest pain point over the past ~week as I've familiarized with my spec. The lack of clarity here makes it hard for newcomers to orient themselves. This information is scattered around various issues, PRs, and meeting notes; it would be much easier to consume if it was presented in one place.

I would be happy to take a shot at writing this; although I may be missing some important context. Please let me know if that would be useful! Thanks very much!

@christophenoel
Copy link

@csaybar : I appreciate your effort to clarify the distinctions—it's always valuable to exchange views. That said, some points may reflect a partial reading of the underlying specifications, and the explanation risks coming across as somewhat presumptive, particularly given the depth and history of these models. A more measured approach might serve better when addressing complex, expert-level topics.

I would like to advise for the following reading:

Also, I suggest to have a deeper look at the draft spec here in HTML or PDF.

@geospatial-jeff : Agreeing on a roadmap is indeed a point that is in the queue for a long time now.

@csaybar
Copy link

csaybar commented May 14, 2025

Hi @christophenoel,

Very sorry if my last message sounded pretentious; that wasn’t my intention.

My views are mostly based on Unidata’s Common Data Model and CF Data Model papers. I’ve also spent time reading the CF specifications and working with related Python libraries like cfdm and cf-python. I’m not an expert for sure, but I don't think I did a partial read either.

Thanks for sharing your thoughts. If you have any concrete ideas or suggestions, please feel free to put them forward—it’d be really valuable. And if you’re up for explaining some of the issues in simpler terms, that would really help those of us who aren’t deep experts follow along better.

I made a comment because it seemed like you wanted feedback on the GeoZarr draft ☝️. I just mentioned that a similar OGC spec already exists, proposed by the CDM authors. CF-netCDF 3.0. Not equal to GeoZarr of course just "similar".

Good luck with the ongoing work on GeoZarr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants