-
Notifications
You must be signed in to change notification settings - Fork 12
Spec initial draft #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…t, Scope, Conventions, Terminology, Requirements Classes, Overview, and References - Added Preface and Abstract aligned with GeoZarr charter purpose and value proposition. - Defined Scope section highlighting model-based architecture and interoperability objectives. - Introduced modular Requirements Classes with URIs (core, time, crs, geotransform, etc.). - Provided formal definitions in Terms and Abbreviated Terms aligned with data model terminology. - Adapted Conventions section for Zarr metadata, identifiers, link relations, and encoding usage. - Added Overview section summarising modular structure, encoding support, and metadata design. - Populated References section with normative citations (Zarr, CDM, NetCDF, CF, GDAL, STAC, etc.) following LNCS style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great progress @christophenoel! Thanks for your work!
I read through and left numerous comments. Hopefully this feedback is helpful.
|
||
* AAAA | ||
* BBBB | ||
The *Core* requirements class defines the minimal compliance necessary to claim conformance with the GeoZarr Unified Data Model. It is intentionally open and permissive, supporting incremental adoption and broad compatibility with existing Zarr tools and data models based on the Unidata Common Data Model (CDM). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 well said
|
||
[example] | ||
Here's an example of an example term. | ||
A one-dimensional array whose values define the coordinate system for a dimension of one or more data variables. Typical examples include latitude, longitude, time, or vertical levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to link to CDM / CF conventions and / or state whether this definition is identical to those other definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
|
||
A container for datasets, variables, dimensions, and metadata in Zarr. Groups may be nested to represent a logical hierarchy (e.g., for resolutions or collections). | ||
|
||
==== metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==== metadata | |
==== attributes |
This is what it's called in Zarr at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, metadata is a broader and more formal term (also used in NetCDF, GDAL, etc.) that encompasses all descriptive information about data—covering CF attributes, geospatial reference systems, temporal context, and links to external resources (e.g. STAC).
Attributes typically refer to key-value pairs attached to variables or groups in formats like NetCDF, or Zarr. They are a subset of metadata, and more implementation-specific.
Concretely, while metadata heavily rely on attributes, this also may be defined by construct on groups. And the metadata might be stored in a single attribute (JSON format), etc.
|
||
==== metadata | ||
|
||
Structured information describing the content, context, and semantics of datasets, variables, and attributes. GeoZarr metadata includes CF attributes, geotransform definitions, and links to STAC metadata where applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structured information describing the content, context, and semantics of datasets, variables, and attributes. GeoZarr metadata includes CF attributes, geotransform definitions, and links to STAC metadata where applicable. | |
Attributes are structured information describing the content, context, and semantics of datasets, variables, and attributes, i.e. metadata. GeoZarr attributes includes CF attributes, geotransform definitions, and links to STAC metadata where applicable. |
=== Coordinate Variables | ||
|
||
Coordinate variables (excluding GeoTransform Coordinates) define the geospatial or temporal context of data. They are represented as named arrays with metadata attributes. | ||
|
||
Coordinate variables are represented as named 1D arrays aligned with corresponding dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should distinguish that coordinate variables are not part of the Zarr spec itself. It's from CDM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective its clear, as there is a mapping (section) for each component of the data model to Zarr.
You will find the same sections to NetCDF, to GeoTiff, etc.
"title": "Example Dataset", | ||
"summary": "Multidimensional Earth Observation data", | ||
"institution": "Example Space Agency", | ||
"Conventions": "CF-1.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to include a GeoZarr convention flag?
[cols="1,2,2"] | ||
|=== | ||
|Structure |Zarr v2 |Zarr v3 | ||
|
||
|Zoom level groups | Subdirectories with `.zgroup` and `.zattrs` | Subdirectories with `zarr.json`, `node_type: group` | ||
|
||
|Variables at each level | Zarr arrays (`.zarray`, `.zattrs`) in each group | Zarr arrays (`zarr.json`, `node_type: array`) in each group | ||
|
||
|Global metadata | `multiscales` defined in parent `.zattrs` | `multiscales` defined in parent group `zarr.json` under `attributes` | ||
|=== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary to refer explicitly to Zarr format details here. Let's just talk about Groups and Arrays, which is universal to both format versions.
|Global metadata | `multiscales` defined in parent `.zattrs` | `multiscales` defined in parent group `zarr.json` under `attributes` | ||
|=== | ||
|
||
Each multiscale group MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
|
||
- Chunks MUST be aligned with the tile grid (1:1 mapping between chunks and tiles) | ||
- Chunk sizes MUST match the `tileWidth` and `tileHeight` declared in the TileMatrix | ||
- Spatial dimensions MUST be clearly identified using `dimension_names` (v3) or `_ARRAY_DIMENSIONS` (v2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Spatial dimensions MUST be clearly identified using `dimension_names` (v3) or `_ARRAY_DIMENSIONS` (v2) | |
- Spatial dimensions MUST be clearly identified using the Array dimensions |
Co-authored-by: Ryan Abernathey <[email protected]>
Co-authored-by: Ryan Abernathey <[email protected]>
Don't know why it wasa closed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for kicking this off Christophe! Looking forward to discuss more.
|Supports encoding of data in projected coordinate systems and association with spatial reference metadata. | ||
|`http://www.opengis.net/spec/geozarr/1.0/conf/projected` | ||
|
||
|Spectral Bands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we becoming too specific by starting to spell out attributes for spectral bands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is too specific. Naming conventions for spectral bands is an issue across communities rather than just for Zarr. I think it would be better to define the standard outside GeoZarr and then provide references.
|
||
==== tile matrix set | ||
|
||
A spatial tiling scheme defined by a hierarchy of zoom levels and consistent grid parameters (e.g., scale, CRS). Tile Matrix Sets enable spatial indexing and tiling of gridded data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if @abarciauskas-bgse or @maxrjones have any reflections on this.
|Specifies multiscale tiled layout using zoom levels and Tile Matrix Sets as per OGC API – Tiles. | ||
|`http://www.opengis.net/spec/geozarr/1.0/conf/overviews` | ||
|
||
|STAC Metadata Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is someone currently owning the STAC integration?
|GeoTransform Metadata | ||
|Enables affine spatial referencing via GDAL-compatible `GeoTransform` metadata and optional interpolation hints. | ||
|`http://www.opengis.net/spec/geozarr/1.0/conf/geotransform` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start with a translation of the OGC GeoTIFF standard rather than basing GeoZarr v1.0 on GDAL's internal data structure (i.e., ModelTiepoint, ModelPixelScale, ModelTransformation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree on thiis point, and I think you're not referring GDAL data model, but the encoding in GeoTiff (ModelTiepoint, ModelPixelScale, ModelTransformation) which complexifies the 6 points of affine transform.
I think the current approach with the greatest support is the extension/adaptation of CF to support also affine transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed complexifying on GDAL, when it's not just a bounding box it is sparse coordinates. Besides the special case of skew/rotation with six fig transform, Max has explained well how this is getting out of hand. Don't reinvent the warper API here 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we would adopt an in-memory representation that only supports a subset of use-cases when we're dealing with serializing metadata and there is a long-established and widely used standard for serializing this information that works for a broader set of use cases. I'm arguing for GeoTIFF tags over GDAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said similar here before I understood the landscape much: https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/16?u=michael_sumner I'll try to put together examples. Maybe I'm off base but tie points are a small case of non redundant coordinate arrays, it's not a raster so very cleanly out of scope imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree on thiis point, and I think you're not referring GDAL data model, but the encoding in GeoTiff (ModelTiepoint, ModelPixelScale, ModelTransformation) which complexifies the 6 points of affine transform.
This is not true from my practical experience. I've been using the following code to convert GeoTIFF tags into a GDAL compliant affine transformation for 5+ years and it hasn't broken once (and no issues/bugs reported either).
gt = affine.Affine(
self.ifds[0].ModelPixelScaleTag[0],
0.0,
self.ifds[0].ModelTiepointTag[3],
0.0,
-self.ifds[0].ModelPixelScaleTag[1],
self.ifds[0].ModelTiepointTag[4],
)
That's not to say that this will never break, there are definitely cases where this wouldn't work, some of these have already been covered in the thread:
- @mdsumner mentions the skew/rotation case.
- Images with multiple tie points -
ModelTiePointTag
. The original maptools GeoTIFF spec (not the OGC one) states that:
since the relationship between the Raster space and the model space will often be an exact, affine transformation, this relationship can be defined using one set of tiepoints and the "ModelPixelScaleTag", described below, which gives the vertical and horizontal raster grid cell size, specified in model units. If possible, the first tiepoint placed in this tag shall be the one establishing the location of the point (0,0) in raster space. However, if this is not possible (for example, if (0,0) is goes to a part of model space in which the projection is ill-defined), then there is no particular order in which the tiepoints need be listed. For orthorectification or mosaicking applications a large number of tiepoints may be specified on a mesh over the raster image. However, the definition of associated grid interpolation methods is not in the scope of the current GeoTIFF spec.
The key here is that the definition and interpretation of multiple tie points is not defined by the GeoTIFF spec. GDAL does have its own logic to interpret images with multiple (more than 6) tie points, I'm sure other libraries have slightly different logic. I believe we should be building GeoZarr to cover the highest surface area possible; not to cover all potential use cases. Because it is impossible to provide a representation of an affine transform that satisfies all potential edge cases that may appear across CF / GDAL / GeoTiff (and w/e other data formats people load into GeoZarr).
I think we should start with a translation of the OGC GeoTIFF standard rather than basing GeoZarr v1.0 on GDAL's internal data structure (i.e., ModelTiepoint, ModelPixelScale, ModelTransformation).
Back to the original question; I don't think it does anyone justice to debate whether or not we should be in compliance with GeoTIFF or GDAL's internal data model. The reality is the concept of an "affine transform" was developed over the course of the 18th-19th century and far pre-dates the notion of raster data, geotiff, or the GDAL data model. Let's just call it what it is! Translating to GeoZarr from both GeoTIFF and GDAL is valuable to see how well they map over.
I think the current approach with the greatest support is the extension/adaptation of CF to support also affine transformation.
Why do we have to extend CF to add support for affine transforms to GeoZarr? Why can't we just say "geozarr should use affine transforms, and this is how the affine transform should be structured".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true from my practical experience. I've been using the following code to convert GeoTIFF tags into a GDAL compliant affine transformation
You just confirm that ModelTiepointTag, ModelPixelScaleTag are GeoTiff, not GDAL.
Yes basically the logic is:
-
If
ModelTransformationTag
exists → derive affine from matrix. -
Else:
- Use
ModelTiepointTag
for origin. - Use
ModelPixelScaleTag
for pixel size. - Adjust origin if
PixelIsPoint
(shift by ½ pixel).
- Use
What I'm arguing for is that the GeoZarr encoding should be formatted closer to GDAL syntax (below) than using GeoTiff attributes.
Affine = [a0, a1, a2, a3, a4, a5]
Where:
a0 = top left x (origin X)
a1 = pixel width (scale X)
a2 = rotation (typically 0)
a3 = top left y (origin Y)
a4 = rotation (typically 0)
a5 = pixel height (scale Y, usually negative)
I'm in favor of the draft proposed by CF recently: https://github.com/orgs/cf-convention/discussions/411
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is just a semantics issue then. a0
a1
a2
(as you mention above) are not related to GDAL (or GeoTIFF). That is the definition of an affine transform, which GDAL happens to implement. While it's true to say this aligns with the GDAL data model, it's more accurate to say that "this is an affine transform".
To state that differently; GDAL did not come up with the idea of an affine transformation. The concept of an affine transform exists outside of GDAL/CF/GeoTIFF, or any of the other implementations that have been discussed here. Why not just call it what it is?
As I said in my earlier comments, I don't think it does the community any justice to spend our time debating whether or not an affine transform is more like GeoTIFF or more like GDAL when there is a lower level definition for an "affine transform" which existed long before either of these two implementations exist.
|Multiscale Overviews | ||
|Specifies multiscale tiled layout using zoom levels and Tile Matrix Sets as per OGC API – Tiles. | ||
|`http://www.opengis.net/spec/geozarr/1.0/conf/overviews` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think a conformance class for OGC API - Tiles should be included in GeoZarr v1.0 because it may be more complicated than necessary. The community is requesting support for simple factor of two downsampling, which would be provided by a Zarr translation of the OGC COG standard, but not storing OGC TMS (e.g., https://cloudnativegeo.slack.com/archives/C06HCP0KAA2/p1746035679500189). So I think we should start with the COG-translated conformance class (cc @vincentsarago @geospatial-jeff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones from my point of view, this only means not reinventing the wheel, and in line with the OGC spec. This has been already mostly covered in draft clause_7d_format_pyramiding.adoc (and discussions such as : #30)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think a conformance class for OGC API - Tiles should be included in GeoZarr v1.0 because it may be more complicated than necessary.
I strongly agree with this. While overviews are used mostly in geospatial use cases for visualization purposes; there is nothing inherently geospatial about overviews. There is not a specification for overviews as far as I'm aware, but I'd argue (geo)TIFF is the reference example for overviews/tiling. GeoTIFF only stores geospatial information (geotiff tags) on the IFD holding native resolution data. GeoTIFF does not store any spatial information on IFDs that contain reduced resolution overviews.
I believe the community simply needs more time to figure out how overviews (in the geospatial sense) apply more generally to the zarr data model, and pushing this OGC API - Tiles conformance class into GeoZarr before it's well thought through is not beneficial to anyone. Overviews are also not important enough to block the release of GeoZarr v1.0.
I'd much rather give the community something to build off of by releasing an initial GeoZarr v1.0 spec without overview support. Then see how the tooling evolves to support overviews against the broader zarr visualization use-case (through projects like https://github.com/carbonplan/ndpyramid). It's much easier to circle back to add a tiling conformance class after the fact than refactor an existing conformance class that wasn't well thought through to begin with 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A conformance class is only the advertisement (conformTo: overview
) that overview are provided in the dataset. The abstract representation of overviews/multiscales should be defined in the abstract data model, and the concrete representation in the zarr encoding of the model.
The overviews/ multiscaling indeed would probably reuse exisitng specification (GeoTiff, or Tile Matrix Set doesn't matter which one wins). A few draft and demonstration have been proposed, so I don't se why the community would need more time: as any other topic, we should progress on this topic during the year, provide examples, and see what suits the best. The initial TMS based draft looks great to me: #44
This is great progress, thank you @christophenoel! I left some comments and general feedback about minimizing the initial scope in #63 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work
|
||
Each dataset node comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: | ||
|
||
- **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it extent or just length of dimension, while what is here coordinate variable defines extent (coverage domain)
|
||
- **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`. | ||
- **Coordinate Variables** – Arrays that supply coordinate values along dimensions, providing spatial, temporal, or contextual referencing. These may be scalar or higher-dimensional, depending on the referencing scheme. | ||
- **Data Variables** – Multidimensional arrays representing physical measurements or derived products. Defined over one or more dimensions, these variables are associated with coordinate variables and annotated with metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe they are not part of the model below itself, but that is ok to distinguish them. Just them, it may be missing auxiliary variables like grid_mapping/crs, which do not have a reference to the dimension
Each dataset node comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: | ||
|
||
- **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`. | ||
- **Coordinate Variables** – Arrays that supply coordinate values along dimensions, providing spatial, temporal, or contextual referencing. These may be scalar or higher-dimensional, depending on the referencing scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- each 'coordinate variable' is associated with one dimension
+ String name | ||
+ int length | ||
+ boolean isUnlimited | ||
+ boolean isShared |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to distinuish optional parameters? I think it is one of them
|
||
All metadata attributes (for groups, coordinates variables and data variables) are recommended to conform to CF naming and typing conventions. Supported attributes include: | ||
|
||
- `standard_name`, `units`, `axis`, `grid_mapping` (CF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if grid_mapping is expected, auxiliary variables shall be as well
To help move forward quickly and divide the work by topic, a draft structure was created using three main sections, following the typical OGC format:
Each section is divided into multiple files (some of which have not yet been created) to ensure that separate pull requests (PRs) can be made for each future modification.
We may discuss in the next meeting whether to accept this PR as is (for the stucture only) before starting to create additional PRs, or whether to first conduct an initial review.
The latest Editor's Draft version of OGC GeoZarr Specificationis found here in HTML or PDF. This document is generated automatically from the repository.