Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly permit eg N+m in MM tag (PR#799) #799

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

The text already states that an unmodified base of N means we count any base type, but base N code N in the table is a little misleading as to the intention. It was intended to mean any unspecified modification, in the same way C+C is any unspecified C mod, but in this case it's against all bases rather than a specific base type.

However that doesn't solve the issue of whether we can record specific mods against any "fundamental" source base. Clarified this by adding an extra line to the table and some text. (However note this doesn't necessarily imply downstream processing tools will not do any compatibility assessment and reject N+m when the SEQ base is a T.)

Fixes #785

Copy link

Changed PDFs as of 03505ac: SAMtags (diff).

@jmarshall
Copy link
Member

Previous sentence says “An unmodified base of ‘N’”; this new sentence says “A fundamental base of ‘N’”. Is there a distinction between unmodified and fundamental (if so, let's describe it)? Otherwise, let's use one or the other.

The text already states that an unmodified base of N means we count
any base type, but base N code N in the table is a little misleading
as to the intention.  It was intended to mean any unspecified
modification, in the same way C+C is any unspecified C mod, but in
this case it's against all bases rather than a specific base type.

However that doesn't solve the issue of whether we can record specific
mods against any "fundamental" source base.  Clarified this by adding
an extra line to the table and some text.  (However note this doesn't
necessarily imply downstream processing tools will not do any
compatibility assessment and reject N+m when the SEQ base is a T.)

Fixes samtools#785
Copy link

Changed PDFs as of e7a4254: SAMtags (diff).

@jkbonfield
Copy link
Contributor Author

I reviewed the change again and decided "fundamental" base is the clearer term.

I note the table below this has unmodified base, but I'm unsure whether that should be changed or not. For the language "C+hm," it is the "C" here we refer to as "fundamental", given unmodified is a confusing term to use when describing a base that is infact modified. However for the table of from-this to-that unmodified may be easier to understand and we're not explicitly referring to a string or serialisation syntax for that table.

I'm not overly happy with the first paragraph of "from the primary unmodified sequence as originally reported by the sequencing instrument", given the sequencing instrument is likely the thing stating this is modified. However typically it does that in stages, recording the unmodified base type in FASTA/FASTQ and a modification as a side-channel. I want to say as recorded in the SAM SEQ column as that forbids things like h, m, etc, but then it gets mired in reverse-complementing confusions.

I think for now it's better, but not perfect. Ideas welcomed, although maybe best dealt with in another PR if it's a big rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

MM tag preferred format for TAPS data
2 participants