From 229e998b93fe168dc107095d81b9a31ccc75483d Mon Sep 17 00:00:00 2001 From: John Marshall Date: Thu, 4 May 2023 20:49:40 +1200 Subject: [PATCH] Allow for UTF-8 field values in header regular expression Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes #719. --- SAMv1.tex | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/SAMv1.tex b/SAMv1.tex index 97b8e74c3..13eb5824c 100644 --- a/SAMv1.tex +++ b/SAMv1.tex @@ -69,6 +69,7 @@ \section{The SAM Format Specification} Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. +For brevity, named character classes are written as~{\tt [\cclass{class}]} without an additional pair of brackets. \subsection{An example}\label{sec:example} Suppose we have the following alignment with bases in lowercase @@ -215,8 +216,10 @@ \subsection{The header section} each data field follows a format `{\tt TAG:VALUE}' where {\tt TAG} is a two-character string that defines the format and content of {\tt VALUE}. Thus header lines match {\tt - /\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[ - -\char126]+)+\$/} or {\tt /\char94@CO\char92t.*/}. + /\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[\cclass{print}]+)+\$/} + or {\tt /\char94@CO\char92t.*/}.% +\footnote{{\tt [\cclass{print}]} indicates that header field values contain printable characters, i.e.,~non-control characters. +For fields limited to~ASCII, which is the majority, this is equivalent to~{\tt [ -\char126]}.} Within each (non-{\tt @CO}) header line, no field tag may appear more than once and the order in which the fields appear is not significant.