Add support for 'tabix' formatted genomics/genetics data

**Geneticists and genomics scientists cannot natively get data from S3 when it's compressed using the most common format for that data type**
Tabular data in genetics and genomics is often compressed using the so-called "tabix" format. This is a block compressed, gzip-compatible compression that allows indexing into a file by genome position. While the file suffixes are `.gz`, and `gunzip` can be used to decompress them from local files (via `pandas.read_csv` with `compression=gzip`) AWS data wrangler cannot fetch these data from S3 (by `awswrangler.s3.read_csv`).

**What I'd like to see**
I'd like to be able to get data via `awswrangler.s3.read_csv(S3_uri, compression="bgzip")`. While `guzip` will decompress the file on local storage,`bgzip -d` is actually the preferred method. I believe that subtle differences between `gzip` and `bgzip` corrupt the reading of the data from S3.

**Alternatives**
Transferring a file to local storage (`aws s3 cp <uri> ./`) and then using the pandas `read_csv` function works, but involves an extra copy step.

It is likely that the genomics community's use of `bgzip` for tabix' files is idiosyncratic, but there is a large and growing number of users in this space.

Adding support for this compression method would support the genetics field and the biopharma industry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for 'tabix' formatted genomics/genetics data #1568

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for 'tabix' formatted genomics/genetics data #1568

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions