Skip to content

Commit

Permalink
Fix reading UTF-8 encoded sample names when char is signed
Browse files Browse the repository at this point in the history
The trick used in bcf_hdr_parse_sample_line() to rapidly find tabs
and newlines could be defeated by UTF-8 characters outside the
Basic Latin range on platforms where "char" is signed (like x86).
It's currently not clear if VCF intends to allow these, but the
4.3 specification does allow UTF-8 and it's easy enough to support.
Fix by casting to unsigned when making the comparison.

Modifies formatcols.vcf to include a UTF-8 character for a
round-trip test.

Fixes samtools/bcftools#1408
  • Loading branch information
daviesrob committed Feb 17, 2021
1 parent 10a6a8b commit 8127bfc
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion test/formatcols.vcf
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1>
##FORMAT=<ID=S,Number=1,Type=String,Description="Text">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S3
1 100 a A T . . . S a bbbbbbb ccccccccc
2 changes: 1 addition & 1 deletion vcf.c
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ int HTS_RESULT_USED bcf_hdr_parse_sample_line(bcf_hdr_t *h, const char *str)
const char *p, *q;
// add samples
for (p = q = str;; ++q) {
if (*q > '\n') continue;
if ((unsigned char) *q > '\n') continue;
if (++i > 9) {
if ( bcf_hdr_add_sample_len(h, p, q - p) < 0 ) ret = -1;
}
Expand Down

0 comments on commit 8127bfc

Please sign in to comment.