-
Notifications
You must be signed in to change notification settings - Fork 59
Description
I was trying to steal this blacklist ingesting code for cooltools expected cli tool - here https://github.com/mirnylab/cooltools/blob/9294dae6dd19794e61bbb50773c1db04fb627398/cooltools/cli/compute_expected.py#L127
here are some examples of its behavior (deisred and undesired):
bed_content = "track=full-of-nonsense\nchr1\t9000000\t10000000\n"
ftmp = "black.tmp"
with open(ftmp,'w') as fp:
fp.write(bed_content)
# trying to read/sniff - like in cooler-balance cli:
blacklist = ftmp
import csv
with open(blacklist, 'rt') as f:
print( csv.Sniffer().has_header(f.read(1024)) )
yields True
like it should (I guess)
bed_content = "chr1\t9000000\t10000000\nchr2\t9000000\t10000000\n"
yields False
- like it should!
However bed_content = "chr1\t9000000\t10000000"
or with the newline bed_content = "chr1\t9000000\t10000000\n"
- yields True
- which is very much undesired ...
after reading has_header
source code https://github.com/python/cpython/blob/607b1027fec7b4a1602aab7df57795fbcec1c51b/Lib/csv.py#L383 - it becomes apparent - they "sniff" if a csv has a header based on the several rows - i.e. they check delimiter patterns in several rows and then decide if the first row was a header or not . Thus when there is only 1 row - everything falls back to the default assumption - which is that the first raw is a header...
@nvictus what should we do ? make a special case for BED file with a single rows ?