Add option to use UMI information from BAM files (MI or RX tag) #14

gringer · 2024-03-24T01:56:13Z

With the new PCB111 and PCB114 kits, ONT are using a new strand-switch primer that incorporates a UMI tag. I've now worked out how I can slot that UMI information into a SAM file (onto the RX tag) after mapping. Can this be used by oarfish to improve gene/transcript quantification?

[collating additional detail for readers; I assume the developers are mostly aware of this already]

For clarification on the MI/RX distinction, here are the relevant sections from the SAM flag definition PDF:

MI:Z:str - Molecular Identifier. A unique ID within the SAM file for the source molecule from which this
read is derived. All reads with the same MI tag represent the group of reads derived from the same
source molecule.
RX:Z:sequence+ - Sequence bases from the unique molecular identifier. These could be either corrected or
uncorrected. Unlike MI, the value may be non-unique in the file. Should be comprised of a sequence of
bases. In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the
recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different
barcodes. If the bases represent corrected bases, the original sequence can be stored in OX (similar to OQ storing
the original qualities of bases.)

For most UMIs it is common to use the combination of the sequence and the transcript to define a molecule, because the available diversity of UMIs tends to be quite low. As an example, Rhapsody uses 8bp UMIs (4^8 = 65,536 possible UMIs); 10x uses 10-12bp UMIs (1,048,576 - 4,194,304 possible UMIs respectively). For Illumina short read runs with hundreds of millions to billions of reads per sample, the likelihood of a random clash of UMI sequences is quite high, hence the need to combine UMI + transcript to form a molecule identifier.

The available diversity for UMIs attached to nanopore reads is often a bit higher because adding a few more bases is cheap on the sequencing side of things. For Nanopore's cDNA kits, the SSP UMI format is TTTVVVVTTVVVVTTVVVVTTVVVVTTTmGmGmG, with the V sequences representing diversity, giving 3^16 (i.e. 43,046,721) possible different UMIs, which means it might be possible to forgo the additional use of transcripts to define molecules for most runs (which typically have millions to tens of millions of reads per sample). That would be of benefit for an initial implementation of this correction in oarfish because it would require a simple substitution of UMI for readID into the existing workflow, without having to mess around with base-level "is this the same molecule, or different" decisions.

This is a "nice-to-have", but not essential for me for a few reasons:

UMIs are not present on some nanopore cDNA protocols (e.g. direct cDNA ligation sequencing)
Tools that detect UMIs in nanopore reads (during demultiplexing / strand orientation) are rare
I'm not seeing UMIs being saturated on nanopore runs with abundant RNA and 1-5M reads per sample; the mean number of reads per UMI are close to 1
I have no idea if UMI normalisation would make any substantial difference in the oarfish results

I've attached a BAM file with 907 mapped reads with UMIs (in the RX tag) from a mouse immune cell sample (CD11b-hi; Nb+), which may be useful in working out how to implement this.

mm2_RXsubset_BP04_oriented_vs_M34-t.bam.gz

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to use UMI information from BAM files (MI or RX tag) #14

Add option to use UMI information from BAM files (MI or RX tag) #14

gringer commented Mar 24, 2024

Add option to use UMI information from BAM files (MI or RX tag) #14

Add option to use UMI information from BAM files (MI or RX tag) #14

Comments

gringer commented Mar 24, 2024