Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use UMI information from BAM files (MI or RX tag) #14

Open
gringer opened this issue Mar 24, 2024 · 0 comments
Open

Add option to use UMI information from BAM files (MI or RX tag) #14

gringer opened this issue Mar 24, 2024 · 0 comments

Comments

@gringer
Copy link

gringer commented Mar 24, 2024

With the new PCB111 and PCB114 kits, ONT are using a new strand-switch primer that incorporates a UMI tag. I've now worked out how I can slot that UMI information into a SAM file (onto the RX tag) after mapping. Can this be used by oarfish to improve gene/transcript quantification?

[collating additional detail for readers; I assume the developers are mostly aware of this already]


For clarification on the MI/RX distinction, here are the relevant sections from the SAM flag definition PDF:

  • MI:Z:str - Molecular Identifier. A unique ID within the SAM file for the source molecule from which this
    read is derived. All reads with the same MI tag represent the group of reads derived from the same
    source molecule.
  • RX:Z:sequence+ - Sequence bases from the unique molecular identifier. These could be either corrected or
    uncorrected. Unlike MI, the value may be non-unique in the file. Should be comprised of a sequence of
    bases. In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the
    recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different
    barcodes. If the bases represent corrected bases, the original sequence can be stored in OX (similar to OQ storing
    the original qualities of bases.)

For most UMIs it is common to use the combination of the sequence and the transcript to define a molecule, because the available diversity of UMIs tends to be quite low. As an example, Rhapsody uses 8bp UMIs (4^8 = 65,536 possible UMIs); 10x uses 10-12bp UMIs (1,048,576 - 4,194,304 possible UMIs respectively). For Illumina short read runs with hundreds of millions to billions of reads per sample, the likelihood of a random clash of UMI sequences is quite high, hence the need to combine UMI + transcript to form a molecule identifier.

The available diversity for UMIs attached to nanopore reads is often a bit higher because adding a few more bases is cheap on the sequencing side of things. For Nanopore's cDNA kits, the SSP UMI format is TTTVVVVTTVVVVTTVVVVTTVVVVTTTmGmGmG, with the V sequences representing diversity, giving 3^16 (i.e. 43,046,721) possible different UMIs, which means it might be possible to forgo the additional use of transcripts to define molecules for most runs (which typically have millions to tens of millions of reads per sample). That would be of benefit for an initial implementation of this correction in oarfish because it would require a simple substitution of UMI for readID into the existing workflow, without having to mess around with base-level "is this the same molecule, or different" decisions.


This is a "nice-to-have", but not essential for me for a few reasons:

  • UMIs are not present on some nanopore cDNA protocols (e.g. direct cDNA ligation sequencing)
  • Tools that detect UMIs in nanopore reads (during demultiplexing / strand orientation) are rare
  • I'm not seeing UMIs being saturated on nanopore runs with abundant RNA and 1-5M reads per sample; the mean number of reads per UMI are close to 1
  • I have no idea if UMI normalisation would make any substantial difference in the oarfish results

I've attached a BAM file with 907 mapped reads with UMIs (in the RX tag) from a mouse immune cell sample (CD11b-hi; Nb+), which may be useful in working out how to implement this.

mm2_RXsubset_BP04_oriented_vs_M34-t.bam.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant