Description
I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg. s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )
This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward.
Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43
numcodecs uses cython for compiled code rather than numba.
I am keen to help get this sorted and suggest one possible way forward could be:
- Raise an issue and PR on numcodecs to implement the shuffle/unshuffle as a Codec in C with a cython binding to bloscs shuffle.h
- Alter the call to create_dataset at https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L118 to include the filters argument = sequence of numcodec filters
Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?
Thanks for everyone's effort on this!