-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
The default behavior of opening up datasets lazily instead of loading them into memory urgently needs more documentation or more extensive linking of existing docs.
I have seen tons of example where the 'laziness' of the loading is not apparent to users.
The workflow commonly looks something like this:
- Open some 'larger-than-memory' dataset, e.g. from a cloud bucket with
xr.open_dataset('gs://your/bucket/store.zarr', engine='zarr')
- Printing the dataset repr and seeing no indication of the dataset not being in memory (I am not sure if there is somewhere to check this that I might have missed?)
- Layering some calculations on it, and at some point a calculation is triggered and blows up the users machine/kernel.4.
To start with, the docstring of open_dataset
does not mention at all what is going to happen when the default chunks=None
is used! This could be easily fixed if some more extensive text on the lazy loading exists. I was also not able to find any more descriptive docs on this feature, even though I might have missed something here.
Up until a chat I had with @TomNicholas today, I honestly did not understand why this feature even existed.
His explanation (below) was however very good, and if something similar is not in the docs yet, should probably be added.
When you open a dataset from disk (/zarr), often the first thing you want to do is concatenate it and subset it by indexing.
(the concatenation may even happen automatically in open_mfdataset). If you do not have dask (or choose not to use dask), opening a file would load all its data into a numpy array. You might then want to concatenate with other numpy arrays from opening other files, but then subset to only some time steps (for example). If everything is immediately loaded as numpy arrays this would be extremely wasteful - you would load all these values into memory even though you're about to drop them by slicing.
This is why xarray has internal lazy indexing classes - they lazily do the indexing without actually loading the data as numpy arrays. If instead you load with dask, then because dask does everything lazily, you get basically the same features. But the implementation of those features is completely different (dask's delayed array objects vs xarray's lazy indexing internal classes).
I think overall this is a giant pitfall, particularly for xarray beginners, and thus deserves some thought. While I am sure the choices made up to here might have some large functional upsides, I wonder three things:
- How can we improve the docs to at least make this behavior more obvious?
- Is the choice of setting the loading to lazy by default a required choice?
- Is there a way that we could indicate 'laziness' in the dataset repr? I do not know how to check whether a dataset is actually in memory at this point. Or maybe there is a clever way to raise a warning if the size of the dataset is larger than the system memory?
Happy to work on this, since it is very relevant for many members of projects I work with. I first wanted to check if there is some existing docs that I missed.