-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Currently, DataSets are two numpy arrays of axis variables and a dict of individual measurements as numpy arrays, plus a few other things for metadata/units. This is very basic, and was essentially chosen as a 'minimum setup' that would be easy to move away from... once the requirements became clearer. There are a variety of different, better data representations with their own pros and cons:
- Astropy TimeSeries:
- Pros:
- These are designed to work with timeseries data, and have functions for manipulating it. They can save to
hdf5file, including metadata like units and column names, and load from it.
- These are designed to work with timeseries data, and have functions for manipulating it. They can save to
- Cons:
- TimeSeries are saved as a single HDF5 Dataset within the file, rather than each column being a separate Dataset. This means you can't load single columns.
- Astropy's HDF5 interface does not support chunked reading - you cannot request a subset of the file (e.g. from date
2018-02-01to2018-02-10). This means working with files too large for memory is impractical.
- Pros:
- Dask DataFrame/XArray: Dask is a parallel & large-file library that wraps much of the functionality of Pandas.
- Pros:
- Allows for chunked reads from file, so files larger than memory are easy to work with.
- Can parallelise operations relatively easily (not so relevant for this tool)
- Cons:
- High overhead to get started. Core Dask does not support multidimensional arrays properly, meaning you need to use Dask/XArray.
- Pros:
- Manually write code to read the CDF files etc. relevant to a time window:
- Pros:
- No need for resaving as searchable intermediary files.
- Cons:
- We're going to need to save intermediary files after pre-processing anyway.
- Requires writing easier code, but more of it - more likely to have bugs.
- Not portable, will require rewriting for new file types.
- Pros:
I'll be honest, I think this bit is best left to someone with data engineering experience - e.g. an RSE in a future chunk of work.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request