-
Notifications
You must be signed in to change notification settings - Fork 1
Support for Groups in CF-2.0 #4
Comments
Have you considered not using groups and instead using an attribute to On 2014-10-20 3:32 AM, Maarten Sneep wrote:
Sincerely, Bob Simons The contents of this message are mine personally and |
I do not see how adding an attribute to differentiate resolves the issue of having a large number of variables and a desire to provide some (hierarchical) structure in that mountain of data. |
Hi Maarten: A few comments:
So I agree with your solution #2.
I think this should also be supported. Both are unambiguous and simple to implement.
Regards, |
Thanks John, I'll wait for a few more comments, and then rephrase the proposal. "full name" and "short name" are good names to use if these are used in the CDM. There are functions in the NetCDF interface to obtain the full name for a group but not a variable, but this is probably a rather simple set of calls anyway. I fully agree that whatever solution we come up with, the current short name use for variables in the same group shall be valid. I think I'll split off the variable name issue into a separate one. |
Discussion SummaryUse case & proposal by Maarten Sneep MaartenSneepKNMI Use caseWhile developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions. The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation fields (as we have an observation swath, not a regular projection). A simple quality indicator is included at this level as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, fit quality parameters such as a χ² value, etc. We don’t want to bother most users with these details, and have therefore put these variables in another group. The problemThe current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. Based on early feedback in the issue discussions, I propose the solution described below for CF-2.0. Basic requirements and limitations for the solution
The first is to avoid needless backward incompatibility, the second is to ensure that variables match to each other. TerminologyKey in the proposal are the variable names that are used in the attributes used to reference other variables. Referring to the “CDM Object Names” page, we have the following terminology:
Suggested solutionUse full names in the attribute attached to the source variable to describe the location of the referenced variable, if the source variable and the destination variable are in different groups. For variables in the same group the short name shall be used. Notes
ExampleImagine the following set of groups, dimensions and variables.
To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward. Apparently the Java-NetCDF interface already has such a call. |
Hello Maarten et al., Thanks to Aleksandr Jelenek for pointing me to this discussion.
Option 1 uses scoping rules to disambiguate which variables the Option 2, full paths, will lead to orphaned coordinates once The best argument for recommending Option 2 is its specificity--- The most elegant and resilient solution seems to be Option 1. I see some virtue of supporting Option 2 and Option 3, namely, Finally, Maarten noted that the existing API could be enhanced to Best, |
I'm not sure I would endorse the implicit scoping and make the explicit 'legal but not recommended', but it is an option. I prefer to be explicit. If you are going to copy variables, you'd better ensure that the correct support variables come along anyway. I'd rather create a broken file (that will fail to validate noisily) than a file that appears to be correct, but references the wrong data. |
I hope CF2 allows full names and short names in most cases. When short names are used, the convention would be that the most in-scope variable with the short name matches. When long names are used, then there is no ambiguity. Relative names (not beginning with a slash) could also be supported. Absolute paths are likelier to break in downstream processing when things are moved around. Implicit scoping is truer to the relational properties and inheritance implied by the CDM. I see no reason why both implicit scoping and absolute paths cannot both be CF-compliant. I advocate implicit scoping over explicit long names where possible because the former preserves more of an object oriented model while the latter seems fragile. I appreciate that when things break they should break noisily, however, that does not seem to me to warrant sacrificing the elegance of the extended CDM. |
There is one objection I have against following the scoping rules for dimension. We are talking here about support variables. If you want to assign a hierarchy, then dimensions are higher up in the tree, because they are needed to define a variable. This order is regulated at the netCDF-4 level, and is fine. The support variables that are linked via attributes are support variables, items like processing flags, or other ancillary variables. Conceptually I'd say these are below the main variable. Following the same scoping rules doesn't make sense to me. In many cases most users shouldn't have to look at processing flags, and you want these variables at a lower level (deeper into the file hierarchy). This is precisely opposite to what the scoping rules for dimensions dictate. On the other hand, a variable like geolocations are pretty essential, and you want to have those easily visible. So I'm curious as to what scoping rules for implicit finding of variables are proposed? How do you deal with the potential appearance of variables with the same name in different groups? Implicit may be nice, but somewhere it should be defined what is meant (i.e. in CF-2, if we choose to go this route). I'm also curious for the reasoning behind your proposed rules, as both cases of hierarchy exist, and I don't think you can cover both implicitly. And finally: I don't see how fragility can be an issue: you will have to ensure that support variables are copied anyway and referenced correctly. With explicit referencing you'll notice errors far quicker. With implicit referencing you may accidentally refer to a variable that is present in the destination file, but is different from the intended support variable (same name, different contents, say the latitude and longitude from a different granule altogether). |
Hi Maarten, My suggestion is that CF2 recommend that support variables be Now let me address your questions in reverse order: Best, |
Fixed in cf-convention/cf-conventions#145 |
Use case
While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.
The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation. A simple quality indicator is included as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, etc. We don't want to bother most users with these details, and have therefore put these variables in another group.
The problem
The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. I will give a few solutions here as a starting point, and we will see where we end up. I've selected to use one of these options as a stop-gap measure, but we are (within reason) flexible enough to support either of these options.
Basic requirement for the solution
Variables that are linked to main variables, for instance via the 'ancillary_variables' attribute, but also in the 'bounds', 'coordinates' and probably other attributes as well, must use the same dimensions.
Reference structure
Possible solution 1: Follow the scoping rules for dimensions
Follow the scoping rules for dimensions, and search all of the scope where the dimensions of the main variable can be used. The netCDF-4 C++ interface provides nice options for this, although more convenient support may be added to that interface later on.
In the example, the
/PRODUCT/latitude
variable has an attributebounds
with valuelatitude_bounds
, while the/PRODUCT/SUPPORT_DATA/processing_flags
variable has an attributecoordinates
with valuelatitude longitude
.To find the actual variables, first the application find the dimensions (using
std::set<NcDim> netCDF::NcGroup::getDims()
, withnetCDF::NcGroup::ParentsAndCurrent
as the search scope), then starting from group where the dimension is defined (NcGroup NcDim::getParentGroup()
), and finally find the named variable within scope of the dimension (usingNcVar netCDF::NcGroup::getVar()
withNcGroup::Location::ChildrenAndCurrent
as the search scope). Other interfaces may make it harder to implement this pattern, but that is only a temporary limitation I think.Note that this places other restrictions on the file, such as the inability of using the same name for a variable in two different groups within the same dimension search scope. I'm not sure this is a restriction at all, but it is something to keep in mind.
Possible solution 2: Use HDF-5 paths to point to linked variables.
This solution is more explicit, and uses HDF-5 paths to explicitly point to the location of a linked variable.
In the example, the
/PRODUCT/latitude
variable has an attributebounds
with valueSUPPORT_DATA/latitude_bounds
, while the/PRODUCT/SUPPORT_DATA/processing_flags
variable has an attributecoordinates
with value/PRODUCT/latitude /PRODUCT/longitude
.To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward.
Note that this solution uses the fact that the
/
character is used as a path separator in HDF-5 (and can therefore not occur in a variable- or group-name). This method puts a restriction on group names in that these should not contain spaces, as the lists of variables are space separated. A similar restriction is already in place (implicitly) on variable names in CF-1.6.General note on variable names.
Within the S5P project we have put a restriction on 'element' names (groups, variables, attributes). NetCDF-4 allows an element name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard. To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.
We use the following restrictions:
[a-zA-Z_][a-zA-Z0-9_]*
. This means that the name of a NetCDF-4 element can be used as a variable name in most programming languages.The first limitation is for instance nice when using the (HDF-5) pytables interface for python, as it allows simple dot-notation to access variables in a file, but requires that all elements are valid python variable names. Adding a similar interface to the python netCDF4 package is on my (far too long) todo list.
Notes
See summary below. The variable name restriction now have their own issue #5.
The text was updated successfully, but these errors were encountered: