Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data

I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg.  s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )

This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward. 

Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43 
numcodecs uses cython for compiled code rather than numba.

I am keen to help get this sorted and suggest one possible way forward could be:

- [ ] Raise an issue and PR on numcodecs to implement the shuffle/unshuffle as a Codec in C with a cython binding to bloscs shuffle.h
- [x] Alter the call to create_dataset at https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L118 to include the filters argument = sequence of numcodec filters

Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?

Thanks for everyone's effort on this! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions