Support for Groups in CF-2.0 #4
Description
Use case
While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.
The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation. A simple quality indicator is included as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, etc. We don't want to bother most users with these details, and have therefore put these variables in another group.
The problem
The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. I will give a few solutions here as a starting point, and we will see where we end up. I've selected to use one of these options as a stop-gap measure, but we are (within reason) flexible enough to support either of these options.
Basic requirement for the solution
Variables that are linked to main variables, for instance via the 'ancillary_variables' attribute, but also in the 'bounds', 'coordinates' and probably other attributes as well, must use the same dimensions.
Reference structure
+ /PRODUCT
| /PRODUCT/scanline(scanline) (DIM)
| /PRODUCT/ground_pixel(ground_pixel) (DIM)
| /PRODUCT/corners(corners) (DIM)
| /PRODUCT/latitude(scanline, ground_pixel)
| /PRODUCT/longitude(scanline, ground_pixel)
| /PRODUCT/ozone_column(scanline, ground_pixel)
+ /PRODUCT/SUPPORT_DATA/processing_flags(scanline, ground_pixel)
| /PRODUCT/SUPPORT_DATA/latitude_bounds(scanline, ground_pixel, corners)
| /PRODUCT/SUPPORT_DATA/longitude_bounds(scanline, ground_pixel, corners)
Possible solution 1: Follow the scoping rules for dimensions
Follow the scoping rules for dimensions, and search all of the scope where the dimensions of the main variable can be used. The netCDF-4 C++ interface provides nice options for this, although more convenient support may be added to that interface later on.
In the example, the /PRODUCT/latitude
variable has an attribute bounds
with value latitude_bounds
, while the /PRODUCT/SUPPORT_DATA/processing_flags
variable has an attribute coordinates
with value latitude longitude
.
To find the actual variables, first the application find the dimensions (using std::set<NcDim> netCDF::NcGroup::getDims()
, with netCDF::NcGroup::ParentsAndCurrent
as the search scope), then starting from group where the dimension is defined (NcGroup NcDim::getParentGroup()
), and finally find the named variable within scope of the dimension (using NcVar netCDF::NcGroup::getVar()
with NcGroup::Location::ChildrenAndCurrent
as the search scope). Other interfaces may make it harder to implement this pattern, but that is only a temporary limitation I think.
Note that this places other restrictions on the file, such as the inability of using the same name for a variable in two different groups within the same dimension search scope. I'm not sure this is a restriction at all, but it is something to keep in mind.
Possible solution 2: Use HDF-5 paths to point to linked variables.
This solution is more explicit, and uses HDF-5 paths to explicitly point to the location of a linked variable.
In the example, the /PRODUCT/latitude
variable has an attribute bounds
with value SUPPORT_DATA/latitude_bounds
, while the /PRODUCT/SUPPORT_DATA/processing_flags
variable has an attribute coordinates
with value /PRODUCT/latitude /PRODUCT/longitude
.
To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward.
Note that this solution uses the fact that the /
character is used as a path separator in HDF-5 (and can therefore not occur in a variable- or group-name). This method puts a restriction on group names in that these should not contain spaces, as the lists of variables are space separated. A similar restriction is already in place (implicitly) on variable names in CF-1.6.
General note on variable names.
Within the S5P project we have put a restriction on 'element' names (groups, variables, attributes). NetCDF-4 allows an element name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard. To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.
We use the following restrictions:
- The names of NetCDF-4 elements must match the regular expression:
[a-zA-Z_][a-zA-Z0-9_]*
. This means that the name of a NetCDF-4 element can be used as a variable name in most programming languages. - The names NetCDF-4 elements use underscores to separate parts within a name. An exception to this rule is formed by attributes whose name is specified by an external standard or recommendation, such as the CF metadata conventions
- The names of variables are all lower case, with the exception of chemical species and abbreviations.
- The names of groups are all upper case.
- It is recommended to limit the names of elements to 40 characters or less.
- Elements names that only differ in capitalization are not allowed.
- It is strongly recommended to ensure that names of variables are unique within a file.
The first limitation is for instance nice when using the (HDF-5) pytables interface for python, as it allows simple dot-notation to access variables in a file, but requires that all elements are valid python variable names. Adding a similar interface to the python netCDF4 package is on my (far too long) todo list.
Notes
See summary below. The variable name restriction now have their own issue #5.