Skip to content

Add Index.load() and Index.chunk() methods #8128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Aug 31, 2023

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

As mentioned in #8124, it gives more control to custom Xarray indexes on what best to do when the Dataset / DataArray load() and chunk() counterpart methods are called.

PandasIndex.load() and PandasIndex.chunk() always return self (no action required).

For a DaskIndex, we might want to return a PandasIndex (or another non-lazy index) from load() and rebuild a DaskIndex object from chunk() (rechunk).

@benbovy
Copy link
Member Author

benbovy commented Apr 16, 2025

Index.compute() might be a possible alternative to Index.load() #6837.

@dcherian
Copy link
Contributor

How would this work for compute?

For load, I could see

ds.xindexes["foo"].load()

but the pattern for compute is usually:

ds2 = ds.compute()

how would that translate?

@benbovy
Copy link
Member Author

benbovy commented Apr 16, 2025

Index.load() has different semantics than Dataset.load(): it returns an index object that will replace the existing index when calling Dataset.load(). The returned index may be self (just propagate the index), a new instance maybe of another type (e.g., convert the index to a PandasIndex) or maybe None (drop the index).

Index.load() (like other core Index API) is not intended to be end-user facing API, it is used internally by Dataset.load(), or Dataset.compute() via Dataset.load().

In general the Index method names were chosen after the Dataset methods in which they are called, but maybe Index.compute() or another name would be less confusing here?

@dcherian
Copy link
Contributor

So if I was a user using CoordinateTransformIndex and I wanted to "load" the transformed values into memory, how would I do that?

@benbovy
Copy link
Member Author

benbovy commented Apr 17, 2025

As an end-user you would only need to do ds.load() or ds.compute() and not care much about anything else.

It is up to the index to define how to "load" the coordinate values and maybe convert itself. For CoordinateTransfromIndex I see three options:

  1. 1D index may be converted into a PandasIndex
  2. nD index may be dropped, so Dataset.load() will fallback to Variable.load() for loading the index coordinate data
  3. add a CoordinateTransformIndex.__init__(lazy=True) option that will be used in CoordinateTransformIndex.create_variables() and that will determine the kind of variable to return

Option 3 probably makes the most sense if we still need to keep track of the underlying transform.

@dcherian
Copy link
Contributor

I'm not sure we should conflate the two.

For example, I could have a dataset with a bunch of chunked arrays and a CoordinateTransformIndex. I might want to load the data into memory, but not realize the lazy coordinates.

And conversely, I might want to realize the CoordinateTransform values (say I've subset to a small region), but not load any chunked arrays.

I guess (3) is an option, but it's a bit of "action-at-a-distance". What is the most explicit API we can come up with?

# assuming RasterIndex over 'x', 'y' dimensions
ds.xindexes.update({"x": ds.xindexes["x"].load()})  # in-place (seems like it has to be)

@benbovy
Copy link
Member Author

benbovy commented Apr 17, 2025

I see. Would it be reasonable to add a Dataset.load(load_coords=False) option? And add a Dataset.coords.load() method for the case of loading the coordinates but not the data? This is not the most fined-grained approach but maybe that's enough for most cases?

What is the most explicit API we can come up with?

I'd avoid ds.xindexes.update() as long-term .xindexes might be reduced to a basic mapping of index objects (#9203 (comment)), whereas "loading" the index should also update the index coordinates.

Alternatively:

loaded_coords = xr.Coordinates.from_xindex(ds.xindexes["x"].load())

ds.coords.update(loaded_coords)
# or
ds = ds.assign_coords(loaded_coords)

@dcherian
Copy link
Contributor

I like loaded_coords = xr.Coordinates.from_xindex(ds.xindexes["x"].load()) as the explicit API.

@benbovy
Copy link
Member Author

benbovy commented Apr 25, 2025

Assuming a multi-coordinate index like RasterIndex over x/y dimensions, ds.xindexes["x"].load() may look confusing: what about "y"?

Some possible ways to make it less confusing:

  • In Xarray update Indexes.__getitem__(self, key) such that key accepts a tuple. This would allow typing ds.xindexes[("x", "y")], which would basically return the same index than ds.xindexes["x"] or ds.xindexes["y"]

  • 3rd-party API such as ds.rasterix.raster_index.load() or ds.rasterix.load_raster_coords()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

2 participants