Skip to content

Refactor DataSet backend #18

@smangham

Description

@smangham

Currently, DataSets are two numpy arrays of axis variables and a dict of individual measurements as numpy arrays, plus a few other things for metadata/units. This is very basic, and was essentially chosen as a 'minimum setup' that would be easy to move away from... once the requirements became clearer. There are a variety of different, better data representations with their own pros and cons:

  • Astropy TimeSeries:
    • Pros:
      • These are designed to work with timeseries data, and have functions for manipulating it. They can save to hdf5 file, including metadata like units and column names, and load from it.
    • Cons:
      • TimeSeries are saved as a single HDF5 Dataset within the file, rather than each column being a separate Dataset. This means you can't load single columns.
      • Astropy's HDF5 interface does not support chunked reading - you cannot request a subset of the file (e.g. from date 2018-02-01 to 2018-02-10). This means working with files too large for memory is impractical.
  • Dask DataFrame/XArray: Dask is a parallel & large-file library that wraps much of the functionality of Pandas.
    • Pros:
      • Allows for chunked reads from file, so files larger than memory are easy to work with.
      • Can parallelise operations relatively easily (not so relevant for this tool)
    • Cons:
      • High overhead to get started. Core Dask does not support multidimensional arrays properly, meaning you need to use Dask/XArray.
  • Manually write code to read the CDF files etc. relevant to a time window:
    • Pros:
      • No need for resaving as searchable intermediary files.
    • Cons:
      • We're going to need to save intermediary files after pre-processing anyway.
      • Requires writing easier code, but more of it - more likely to have bugs.
      • Not portable, will require rewriting for new file types.

I'll be honest, I think this bit is best left to someone with data engineering experience - e.g. an RSE in a future chunk of work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions