Skip to content

ZEP10: Generic extensions proposal #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

joshmoore
Copy link
Member

@joshmoore joshmoore commented May 16, 2025

This is a follow on to ZEP9 (#65) since #66 limits the scope of ZEP9 solely to phase 1 such that it can be moved to accepted (since zarr-developers/zarr-specs#330 is merged and v3.1 released). This ZEP is equivalent to phase 2 of the original ZEP9 draft and introduces a top-level generic extensions field.

This ZEP will follow the process laid out in ZEP0 and invites votes from the newly refreshed @zarr-developers/implementation-council. This PR may be proactively merged as a draft, but will not be moved to "accepted" until the related PR on zarr-specs is voted on, merged, and v3.2 released.

Please see zarr-developers/zarr-specs#344 for detailed changes.

@alimanfoo
Copy link
Member

Hi @joshmoore, just a process question, it would seem beneficial to get this PR merged asap so it becomes visible as a draft zep on the zeps website. Who needs to approve that, and what checks would need to be done at this stage to allow merging? E.g., does someone just need to check that the document has the right structure for a ZEP? If so, I'd be happy to approve.

draft/ZEP0010.md Outdated



#### Domain metadata (group)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example goes against my own intuition regarding the extension vs. attribute distinction. I assumed that extensions would be for things that require integration into the zarr implementation itself (i.e. zarr-python, tensorstore, zarrs) while attributes are for things that are handled by higher level layers, either "user" code or some higher-level abstraction like ome-zarr built on top of zarr.

Using extensions essentially just as namespaced attributes means that either:

  • each implementation needs to have an implementation of the extension just to allow access to the attributes; or
  • implementations need to allow users to directly read and write must_understand: false extension metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the example, @jbms, but didn't fully address your question. I do see code extensions, e.g., ome-zarr having access to the extension metadata as one of three general options:

  • in implementation
  • in plugin (i.e. internal entrypoint)
  • in wrapper (i.e. higher-level)

@jbms
Copy link

jbms commented May 16, 2025

Hi @joshmoore, just a process question, it would seem beneficial to get this PR merged asap so it becomes visible as a draft zep on the zeps website. Who needs to approve that, and what checks would need to be done at this stage to allow merging? E.g., does someone just need to check that the document has the right structure for a ZEP? If so, I'd be happy to approve.

I know we've done that in the past for ZEPs but then it is actually harder to comment on it --- I'd need to open a separate issue for each comment..

"extensions": [
{
"name": "example.array-statistics",
"must_understand": false,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we run into the issue that we may want to distinguish between reading and writing. Reading the array without understanding the extension is fine, but writing to the array would invalidate the statistics. Perhaps we should take the opportunity here to introduce must_understand_for_reading and must_understand_for_writing. An extension marked plain must_understand: false would seem to be serving the same purpose as an attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remembering your previous comment to this (...somewhere), I did consider it. I'm up for trying to include it now, but it will push the timeline, so there's a trade-off. As long as we're discussing it here, though, I was wondering if perhaps we don't make it an (extendable?!) object: must_understand: {"read": true, "write" false}.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized also that there is some additional complexity here.

For storing these min/max values, I could imagine a common strategy would be to first write the array, then compute the min/max statistics. If the array gets updated later, then the min/max values would need to be recomputed afterwards. If the user is resizing and then writing to a new portion of the array, only the newly-written part needs to be examined to recompute the min/max. If the user modifies an existing portion of the array, though, then the entire array would need to be examined. Or maybe the user is okay assuming the min/max bounds have only increased, and therefore only the modified portion needs to be examined.

If this example.array-statistics extension is somehow marked "must understand for writing", then an existing zarr implementation would refuse to write to it. But in fact it is fine for an existing implementation to write to it as long as the statistics are updated later, possibly through some separate batch pipeline separate from the zarr implementation itself. That suggests that implementations should support some sort of override -- a list of extensions that can safely be ignored. For this type of extension, it would also be useful for implementations to provide an interface for directly modifying extension metadata just like attributes.

This alternative would continue to reserve the top-level namespace for changes to
the core spec and, therefore, reduce pollution of the top-level namespace. Downsides include
that only a single use of each extension would be possible since the key is the extension
name and there would be no ordering of the extensions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since lack of an ordering is listed as a disadvantage, what do you see as a use case for ordering of extensions or listing the same extension more than once?

@joshmoore
Copy link
Member Author

@alimanfoo a process question, it would seem beneficial to get this PR merged asap so it becomes visible as a draft zep on the zeps website. Who needs to approve that, and what checks would need to be done at this stage to allow merging? E.g., does someone just need to check that the document has the right structure for a ZEP? If so, I'd be happy to approve.

For merging in the "Draft", yes, that suffices. From https://zarr.dev/zeps/active/ZEP0000.html#submitting-a-zep

"...The Zarr Steering Council and the Zarr Implementations Council will not unreasonably deny publication of a ZEP. Reasons for denying ZEP include duplication of effort, being technically unsound, not providing proper motivation or addressing backwards compatibility, or not taking care of Zarr CODE OF CONDUCT."


@jbms I know we've done that in the past for ZEPs but then it is actually harder to comment on it --- I'd need to open a separate issue for each comment.

I'm certainly all for leaving it open for a bit, especially for the discussion of the material that is only here (as @jbms has done above). I can manage having it open and synchronizing with the specs PR. That being said, if possible, I'd like to get it merged as a "Draft" and then will also keep updating it as necessary to stay in step with discussions on zarr-developers/zarr-specs#344

@d-v-b
Copy link

d-v-b commented May 16, 2025

Hi @joshmoore, just a process question, it would seem beneficial to get this PR merged asap so it becomes visible as a draft zep on the zeps website. Who needs to approve that, and what checks would need to be done at this stage to allow merging? E.g., does someone just need to check that the document has the right structure for a ZEP? If so, I'd be happy to approve.

seconding @jbms, I rate the ability to discuss the ZEP as a single PR much higher than seeing it listed on the ZEP web site, so I would rather we keep this PR open until it's clear that all the questions have been answered.

draft/ZEP0010.md Outdated
Comment on lines 66 to 67
(the default) and the implementation does not support it, the dataset must not
be loaded and an appropriate error should be raised. For extensions with
Copy link

@d-v-b d-v-b May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, does dataset mean array | group, or just array? In any case it might be good to avoid the term "dataset" because hdf5 uses that to describe what zarr calls arrays.

Second, what does it mean to "load a dataset" in this context? I think it should be safe for implementations to access chunks / metadata documents in a read-only mode, and it seems like it should be safe for any implementation to open any metadata document in a mutable mode, but insofar as the metadata document provides essential instructions for interpreting chunks, then it seems like the chunks should be mutable if and only if all the the metadata in the metadata document can be understood. But these are just my impressions, it would be good to see a more formal breakdown of these conditions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extension may affect the interpretation of the array in an important way. If an implementation opens the array without returning an error, then users may end up using incorrect data without realizing it. For example, there might be an extension that transposes or otherwise affects the indexing of the entire array. Failing by default is very important, but implementations could provide an option for the user to specify a list of extensions that may be safely ignored.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an extension-ignorant Zarr implementation could safely read a chunk as an opaque sequence of bytes, e.g., for the purpose of applying an additional level of compression on top of the chunk data... unless there is an extension that says that this type of compression is disallowed... In theory this can get rather complex, even if it wouldn't in practice. But I do think we need to define what "open a dataset" means here. It would be very odd (and easy to circumvent) if zarr-python was required by the spec to not read chunks as opaque bytes because of some field in a metadata document.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, there might be an extension that transposes or otherwise affects the indexing of the entire array.

Since this type of transformation can be expressed as a codec, it seems problematic it can also be defined via an extension.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can't be expressed as a codec currently because codecs only apply to individual chunks.

However, your comment did give me an idea to write up something I'd thought about a bit previously...

zarr-developers/zarr-specs#346

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "generic extension" as proposed here is a way to gain implementation experience with a "formal extension" before actually adding it to the zarr spec.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that the goal of this proposal? because there are other ways to do this without adding a new key to zarr metadata.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I indeed would favor just using top-level metadata fields with namespaced names to avoid conflicts with metadata fields that are later added to the standard. That would avoid a syntactic mismatch between pre-standard and standard extensions.

Additionally an extension might alter or remove some existing standard attributes --- for example you might have an inline extension that specifies the array data inline in the metadata, in which case chunk_grid and codecs and fill_value are no longer specified. Or you might have a fill_value_array extension that allows specifying a non-uniform fill value (with broadcasting), in which case the normal fill_value field is excluded.

Syntactically isolating non-standard extensions all in an extensions array doesn't exactly fit with the fact that they may have unbounded effects.

For extensions that are really just namespaced attributes, putting them in an extensions array makes more sense but just keeping them in attributes would make even more sense in my opinion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntactically isolating non-standard extensions all in an extensions array doesn't exactly fit with the fact that they may have unbounded effects.

this is one of my big concerns with this proposal. If we consider a field like data_type, the scope is very narrow. It has very limited interaction with the other metadata fields. Ideally it would have none! by contrast, the proposed extensions field has unlimited scope, and can express any relationship with any other metadata field. What are we getting for this complexity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal of this ZEP is to allow general purpose extensions that are not limited by the scopes of existing extension points. These extensions are supposed to allow for a (mostly) decentralized evolution of the spec, meaning that no heavy-weight process is required to add functionality or alter behavior of arrays and groups.
We want people in the community to experiment. If an extension has enough traction among implementations and users it might become the basis for a new core zarr feature. There might also be a number of garbage extensions that go nowhere.

Personally I indeed would favor just using top-level metadata fields with namespaced names to avoid conflicts with metadata fields that are later added to the standard. That would avoid a syntactic mismatch between pre-standard and standard extensions.

Syntactically isolating non-standard extensions all in an extensions array doesn't exactly fit with the fact that they may have unbounded effects.

I think that would also be a reasonable choice. I almost think this boils down to aesthetics.

Additionally an extension might alter or remove some existing standard attributes --- for example you might have an inline extension that specifies the array data inline in the metadata, in which case chunk_grid and codecs and fill_value are no longer specified. Or you might have a fill_value_array extension that allows specifying a non-uniform fill value (with broadcasting), in which case the normal fill_value field is excluded.

It seems fine if an extension would mandate that existing core attributes be ignored. It might be a bit weird and could lead to composability issues. But we don't want to limit experimentation at this point.

For extensions that are really just namespaced attributes, putting them in an extensions array makes more sense but just keeping them in attributes would make even more sense in my opinion.

This is an interesting point and I think we need some more discussion around that.

associated with a specific (array) metadata field. Additional extension points
may be added by future ZEPs. Until that time, however, third-parties may want
to add arbitrary extension objects to either arrays or groups. This proposal
introduces a generic ``extensions`` field that serves as a container for such a
Copy link

@d-v-b d-v-b May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the term "generic" used here? We are talking about extensions to array and group metadata documents, and therefore extensions to the array and group models. I would expect that array extensions will have very different constraints vs group extensions (like the question of inheritance, which only applies to groups), so it might be simpler and more direct to introduce that dichotomy early on, instead of framing this as if there's something "generic" about arrays and groups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we call them "generic" because they don't fit in the other extensions points (e.g. codecs, chunk_key_encoding, ...).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be simpler to denote these extensions as "array extensions" and "group extensions". That to me much more clearly conveys the thing being extended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be consistent with our use of "extension" elswhere in the spec document.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's possible that the word "extension" is used ambiguously in the rest of the spec document as well. In any case, one assumes that the contents of extensions for an array might be different that content of extensions for a group. So I would probably use terms like "array extensions" and "group extensions" to describe these two conditions. By contrast on its face "generic extensions" conveys nothing to me -- new codecs could also be "generic" extensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "extension" is pretty well defined here: https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#extensions

Right now there are only extensions that fit in one of the extension points. With this ZEP we are permitting additional extensions that are not limited in scope. That is why we call them "generic".
Under that definition, codecs are not generic extensions because they fit in the codecs extension point.

In any case, one assumes that the contents of extensions for an array might be different that content of extensions for a group.

It is true that some extensions may only apply to arrays or groups. That would need to be denoted in the respective extension spec. However, fundamentally and syntactically, and therefore for the purpose of this ZEP, I don't think we need to treat them differently.

Comment on lines +106 to +108
Note that in this example of the extension is ``must_understand=true`` meaning
an implementation which does not support the ``example.offset`` extension
should raise an error.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when should that error be raised? when reading metadata, or when reading chunks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the impl doesn't know the example.offset extension, it must fail when parsing the metadata.
It may fail with a out-of-bounds error when reading/writing data outside the domain. But that would be up to the specification for this extension to define.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the impl doesn't know the example.offset extension, it must fail when parsing the metadata.

It seems to me that a zarr-compatible application should be able to say, for example, "this is an array with shape <shape>, but I can't load chunks for you because of <unknown extension>". Your suggesting that the metadata document should be effectively unreadable prevents this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that a zarr-compatible application should be able to say, for example, "this is an array with shape <shape>, but I can't load chunks for you because of <unknown extension>".

I think that would be a good implementation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good implementation.

Since the behavior I described relies on reading the metadata without an error, this PR should clarify the distinction between reading metadata documents and other IO operations (e.g., reading chunks, in this example).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are purely displaying information to a user and including a warning that an unknown extension was encountered, then displaying whatever information can be heuristically extracted from the metadata successfully may be reasonable.

In general though if there is an unknown extension, you can't really make any assumptions about the meaning of the metadata and any programmatic use is problematic.

For example, the offset extension may mean that the upper bound of the array is no longer indicated by shape but by offset + shape, and the chunk grid starts at offset rather than (0, ...). Maybe there is some program that partitions zarr arrays according to the chunking and then hands off those zarr arrays to worker processes. If the partition program does not support the offset extension, but the worker program does support the offset extension, then the partition program will perform the partitioning incorrectly, but the worker processes may process them without errors, but not correctly aligned to the chunk grid.

Concretely, I'd say that if there is an unknown must_understand=true extension, zarr.open and similar interfaces should not appear to succeed and allow querying properties like the chunk grid, dtype, etc. unless the user explicitly opts into ignoring unknown extensions.

Copy link

@d-v-b d-v-b May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general though if there is an unknown extension, you can't really make any assumptions about the meaning of the metadata and any programmatic use is problematic.

I find this outcome concerning, as it amounts to fragmenting the zarr ecosystem.

@d-v-b
Copy link

d-v-b commented May 16, 2025

I think this document should explain why the pre-existing attributes field is insufficient for the purposes of this ZEP.


## Proposal

To provide for more flexible, immediate, and de-centralized use cases, we
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would find persuasive here would be a concrete example of something that people are trying to do with Zarr today that is blocked by the lack of the extensions field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example section has a few examples.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am familiar with the examples in the examples section. They are not concrete examples of something that people are trying to do with Zarr today that is blocked by the lack of the extensions field.

My feedback here is that this proposal should open with a real, concrete example that people will understand. I could not explain to someone what "flexible, immediate, decentralized use cases" means. By contrast, I could explain to someone what xarray, or nasa-gesdic, or ome-zarr are doing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more explicit, I would take one of these examples

introduce in a small number of words how they are using zarr, and then explain why the extensions field would solve a real problem for them. This would make it clear that this ZEP is an attempt to solve a real problem that people actually have, unlike the current reference to "flexible, immediate, decentralized use cases".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am familiar with the examples in the examples section. They are not concrete examples of something that people are trying to do with Zarr today that is blocked by the lack of the extensions field.

I disagree. All of these examples are concrete, but simplified, extension proposals that people have been floated by the community.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feedback you are getting is that, from my POV, those examples are not concrete enough, and are too simplified. I personally do not see value in changing the zarr spec to solve a hypothetical, simplified problem. Stating a real problem, and then showing how this proposal will solve that problem, is convincing.

@jbms
Copy link

jbms commented May 16, 2025

I think this document should explain why the pre-existing attributes field is insufficient for the purposes of this ZEP.

For must_understand: true extensions, like specifying the array content inline, transposing the array, etc. an attribute would definitely not work. However, all of the examples given would work as attributes reasonably well.

joshmoore and others added 2 commits May 16, 2025 16:07
Co-authored-by: Davis Bennett <[email protected]>
Co-authored-by: Davis Bennett <[email protected]>
@d-v-b
Copy link

d-v-b commented May 22, 2025

I think this document should explain why the pre-existing attributes field is insufficient for the purposes of this ZEP.

For must_understand: true extensions, like specifying the array content inline, transposing the array, etc. an attribute would definitely not work. However, all of the examples given would work as attributes reasonably well.

to be clear, the specific thing that would not work if all extensions were in attributes is that we could not prevent non-compliant implementations from accessing data. Extension-compliant implementations on the other hand would have no trouble reading extensions from attributes.

This makes me wonder: how important is it really to exclude non-compliant implementations from accessing (and possible misinterpreting) data? I.e., how much weight should we assign to this feature. Are there real examples of negative outcomes from misinterpreting specialized zarr data? Or is this purely hypothetical?

@joshmoore
Copy link
Member Author

Thanks for the feedback, all. I've pushed a number of clarification commits based on them, and tried to resolve the threads appropriately. I have ideas on further examples (esp. encryption as recently discussed on Zulip), but I'd very much welcome any others that may be floating around (as PRs, comments, etc.)

There are a few remaining conversations:

  • extensions array vs. one of the alternative
  • clarifying boundaries between attributes and extensions usage
  • advanced must_understand semantics

These may be easier on a call to work toward consensus rather than extended back and forth here. Since the previous ZEP meeting spot was cancelled, I'd suggest we start with a one-off. Finding time this coming week (May 26+) may be difficult but two options are:

  • June 2, 2025 – 20:00–21:00 CEST
  • June 4, 2025 – 20:00–21:00 CEST

I'd still also like to encourage other implementer voices, @zarr-developers/implementation-council. To ensure everyone feels comfortable contributing, it might be helpful for those who have already shared their perspective to give others space to chime in without feeling the need to immediately respond or defend their thoughts points.

@LDeakin
Copy link
Member

LDeakin commented May 25, 2025

Thanks Josh and Norman, this looks pretty great! My thoughts based on the PR and comments so far:

  • An extension array seems the most flexible, as it permits ordered / repeated extensions
  • The distinction between extensions and attributes is certainly is not as simple as automated vs human. The way I see it:
    • An extension may affect chunk operations, metadata parsing, store operations, and other fundamental Zarr functionality in unforeseen ways. Such an extension requires support from a Zarr implementation like zarr-python, tensorstore, zarrs, etc.
    • An attribute may change how data/metadata is interpreted, and would be the responsibility of downstream libraries (like ome-zarr-py), but implementations could support them too. These would fit under the banner of ZEP04.
  • I'm conflicted on isolating reading/writing with must_understand, but leaning to keeping it true/false because
    • must_understand: true/false is backwards-compatible
    • must_understand: true clearly means an implementation must support reading/writing
    • must_understand: false in an extension implies to me that an implementation should support reading, but should not write unless it is actually aware of the extension and knows that it is okay to do so
      • An extension that never needs to be understood for reading or writing seems like it should just be an attribute

@jbms
Copy link

jbms commented May 25, 2025

Can someone give an example of how order-dependent extensions might be used/specified?

It seems to me that it would be potentially confusing to have most extensions be order-independent but in certain cases have order dependence. It also seems like it would be quite challenging to specify the order-dependent behavior unless you can map the extension to some composable "interface" like a codec or storage transformer. But if there is such a composable interface, the extension should just be defined as specifying a list of "things" that conform to that interface, and that interface becomes a new extension extension point, e.g. {"name": "my_new_interface_list", "configuration": {"list": [{"name": "my_new_interface_item", "configuration": {...}}...]}}

Comment on lines +67 to +68
By adding a new field, the specification can assert restrictions that if added
to ``attributes``. would amount to a breaking change. If present, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand "By adding a new field, the specification can assert restrictions that if added
to attributes. would amount to a breaking change". I think an example could help a lot for people (like me) who've fallen behind in tracking spec/ZEP progress.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract of "attributes" has been, 'you can put anything here'. The definition of extensions say that they follow a form and their names are registered in https://github.com/zarr-developers/zarr-extensions. Breaking here means that it's certainly fair to assume that there is V3 data today with attributes that would violate these assumptions.

Comment on lines +128 to +131
The ``example.offset`` extension contains an array of the same order as the
shape of the containing array specifying which element of the array should be
considered as the origin, e.g., `[0, 0]`. This allows the reuse of subregions
of an array without the need to rewrite the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone were to come across a must_understand=True extension in a zarr.json in the wild, how would they find the information necessary for supporting it in their Zarr implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookup the name in zarr-extensions, and then read the documentation associated there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this ZEP self-describing, so that a user knows where to find that information without needing to also read the Zarr V3 spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... I'm not sure I understand your question, @maxrjones.


### Alternatives for the `extensions` extension point

The current design allows having the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand "This MAY happen as
part of the core spec adopting functionality of an extension.". Can you provide more details about the motivation for this warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the "offset" example. Currently, someone could grab the zarr-extensions name and start building that extension. In the future, another ZEP could codify the name "offset" as top-level metadata, independently of the fact that it would then already be an extension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would the Zarr community allow codifying the name "offset" as top-level metadata in this case, since it seems that the functionality could be implemented by the extensions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can retroactively give you an example. In Zarr v2, the dimension separator was a Store concern in zarr-python. While working on the benchmark in https://www.nature.com/articles/s41592-021-01326-w, I realized that for OME-Zarr we were going to need to require / for performance reasons, and there was no way to do that from the fileset alone. So I introduced dimension_separator into Zarr v2 as best I could. (Conceivably, you could call that v2.1.)

It took me 6 months to get that work done which means my paper took that much longer to publish. Today, I would almost certainly have made an extension in order to not take on that burden immediately. It, however, very much makes sense in the main spec (as it is in v3) and so likely a follow up ZEP would have been wanted to elevate the dim separator "extension" to a full member of the spec.

...,
"extensions": [
{
"name": "example.offset",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the example part of this name meaningful? I think it would be useful to either define in this document what the naming conventions are or link to the relevant external convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's from that "Extensions naming" section added by ZEP9.

Comment on lines +193 to +198
"multiscale": {
"datasets": [
"path/to/array/1",
"path/to/array/2",
"path/to/array/3"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some details about why using extensions is better than using attributes for this purpose? I'm asking this because it seems to serve a similar purpose as the NGFF spec or proposed GeoZarr spec which have been building an attributes-centric approach.

Copy link
Member

@maxrjones maxrjones May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an example of the motivation for this question and the Zarr v3 restriction, I'm trying to understand how this could be used for GeoZarr (if the standards working group decides to use this mechanism) which needs to support both Zarr formats. An example following my current understanding is drafted at https://github.com/maxrjones/geozarr-spec/blob/simple-translation/geozarr-spec.md#defining-primary-geospatial-convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is trying to outline how those two communities might move their common data structures into a shared space. GeoZarr's similar structure was initially taken from zarr-developers/zarr-specs#50. Had we had the central registry at that time, we might have developed it in common.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value of a central registry seems like a separate question from why using extensions is better than using attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is trying to outline how those two communities might move their common data structures into a shared space. GeoZarr's similar structure was initially taken from zarr-developers/zarr-specs#50. Had we had the central registry at that time, we might have developed it in common.

Does this mean that ZEP009 supercedes ZEP004?

the value of a central registry seems like a separate question from why using extensions is better than using attributes.

My guess based on these discussion is that an extensions key allows the Zarr community to watch for naming conflicts between communities, serving a similar purpose as ZEP0004 but with a different mechanism. Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the person who originated ZEP 4, I have come to feel it wasn't fully baked. I feel like the mechanism for identifying if a convention is present proposed there is quite flaky. I much prefer the STAC approach, i.e. explicit extensions, which is closer to what is being described here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the person who originated ZEP 4, I have come to feel it wasn't fully baked. I feel like the mechanism for identifying if a convention is present proposed there is quite flaky. I much prefer the STAC approach, i.e. explicit extensions, which is closer to what is being described here.

What I had in mind was explicitly registering individual top-level attribute names, e.g. tensorstore.dimension_units.

@maxrjones
Copy link
Member

Is there anything in this proposal that motivates its restriction to Zarr specification 3 rather than both Zarr specification 2 and 3?

@normanrz
Copy link
Member

Is there anything in this proposal that motivates its restriction to Zarr specification 3 rather than both Zarr specification 2 and 3?

At least from my pov, there is no desire to further evolve the v2 specification. Extensibility was one major motivation of the v3 specification. I think it would be confusing to continue evolving both.
In most cases, v2 data can be upgraded to v3 with metadata-only updates.

@joshmoore
Copy link
Member Author

jbms (Jeremy Maitin-Shepard) 2 days ago
Can someone give an example of how order-dependent extensions might be used/specified?

I've not come up with a compelling one, @jbms. My intuition is that there would be a chance for one member of the pipeline (extA) to be able to update some state (the metadata?) before a later one (extB). Practically, though, I don't see how extA could know enough about extB to inject itself into the list at the right point. So other than "high-priority" extensions which add themselves at the beginning and "low-priority" ones which add themselves at the end, I still don't have a concrete example.

But, generally, 👍 on a general "SequenceExtension" style that others can adopt. Perhaps this speaks to a "generic extensions conventions" (or "idioms") section.

@jbms
Copy link

jbms commented May 27, 2025

I agree with what @LDeakin said about must_understand --- must understand for writing should always implicitly be true and must_understand applies only for reading. That simplifies things nicely.

I am not in favor of using extensions as an attribute namespace for things like ome-zarr that are logically layered on top of zarr itself and don't require changes/deep integration with the zarr implementation, for several reasons:

  • zarr implementations intended for general use will have to provide an API for users to directly read and write extensions metadata very similar to attributes
  • we need to complicate must_understand to indicate reading and writing separately.
  • it muddies the distinction between things that change the core zarr model itself with things purely layered on top.

Instead we should add an attribute section to the zarr-extensions repo (related to the zarr conventions proposal). While technically this is a breaking change in that currently no part of the attribute namespace is reserved for registered names, in practice we could use some prefix or other naming convention for registered attributes such that conflicts with existing uses are very unlikely.

As I see it, extensions do have a high cost as far as fragmenting the ecosystem and therefore should be introduced with care, mostly for things that could reasonably be added to the core spec also.

Co-authored-by: Sanket Verma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants