Skip to content

csv sniffer fails, when given only a single line of BED-like input #196

@sergpolly

Description

@sergpolly

https://github.com/mirnylab/cooler/blob/c9c718fdebccbda41ad10c47f700853f79ee3cd3/cooler/cli/balance.py#L181

I was trying to steal this blacklist ingesting code for cooltools expected cli tool - here https://github.com/mirnylab/cooltools/blob/9294dae6dd19794e61bbb50773c1db04fb627398/cooltools/cli/compute_expected.py#L127

here are some examples of its behavior (deisred and undesired):

bed_content = "track=full-of-nonsense\nchr1\t9000000\t10000000\n"
ftmp = "black.tmp"
with open(ftmp,'w') as fp:
    fp.write(bed_content)

# trying to read/sniff - like in cooler-balance cli: 
blacklist = ftmp
import csv
with open(blacklist, 'rt') as f:
    print( csv.Sniffer().has_header(f.read(1024)) )

yields True like it should (I guess)

bed_content = "chr1\t9000000\t10000000\nchr2\t9000000\t10000000\n" yields False - like it should!

However bed_content = "chr1\t9000000\t10000000" or with the newline bed_content = "chr1\t9000000\t10000000\n" - yields True - which is very much undesired ...

after reading has_header source code https://github.com/python/cpython/blob/607b1027fec7b4a1602aab7df57795fbcec1c51b/Lib/csv.py#L383 - it becomes apparent - they "sniff" if a csv has a header based on the several rows - i.e. they check delimiter patterns in several rows and then decide if the first row was a header or not . Thus when there is only 1 row - everything falls back to the default assumption - which is that the first raw is a header...

@nvictus what should we do ? make a special case for BED file with a single rows ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions