Replies: 1 comment
-
have you tried writing to zarr instead of netCDF? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm processing time series of satellite images (x, y, time) but my processing function operates only on time (each pixel is independant).
The processing function is quite complex (many functions in different modules) but only involves xarray and numpy function (I believe).The whole input dataset does not fit in memory and the output dataset neither, but the data are read from a zarr with a very small chunk size in x and y and the output dataset is persisted with to_netcdf. (I'm not using load, compute or persist of course).
I expect a very low memory usage, as my processing function only operates in time and the chunk size in x and y is very small. In the extreme case, using chunk={'x': 1, 'y': 1} should consum a minimum of memory but could be quite efficient (but I don't mind).
However my script takes all the memory whatever the chunk size and stops with MemoryError. I've worked the algorithm to simplify the operation where I suspect a problem, but in fact the problem is that I don't identify where the problem is.
Is there any way to debug this kind of situation (which happen often to me), for instance by checking if the calculation operates on chunks only, or if, when and why the whole dataset is loaded or if many computations are keep in memory ? Is there a way to inspect dask internals information to understand what's happening ?
Beta Was this translation helpful? Give feedback.
All reactions