-
Notifications
You must be signed in to change notification settings - Fork 4
Method
The proposed method was developed into a framework, for three kind of libraries:
- single end libraries, UMI in Read1.
- paired end libraries, UMI in Read1.
- paired end libraries, UMI in Read1 and Read2.
Each case follows its own version of the later described basic workflow. The differences lie in the sequence grouping and read separating for each case. All of them, though, implement the same proposed read correction and UMI merging methods, described at the end of the page.
In the case of:
- single end libraries, each read is separated in two parts, UMI and sequence. The sequences are grouped by the UMI.
- paired end libraries and UMI in Read1, each read of the Read1 file is separated in two parts, UMI and sequence, while each read of the Read2 file remains intact, as it contains only the sequence. The sequences of the Read1 file are grouped by the UMI and their IDs are used to find their corresponding Read2 sequences.
- paired end libraries and UMI in both Read1 and Read2, each read of both files is separated in two parts, UMI and sequence. A new combined UMI12 is constructed by the union of the two UMIs for each sequence. The sequences of the Read1 file are grouped by the UMI12 and their IDs are used to find their corresponding Read2 sequences.
The framework takes as input fastq files and generates new fastq files, containing the corrected sequences.
The first phase includes the automated process of reading fastq files at a working directory. The library preparation step of the input files must be genarated using the same protocol and fulfil the same input parameters described at page Running the project.
For each library, the workflow consists of four main steps:
- Data cleaning, by keeping the UMIs that fulfil the condition of a minimum reads per UMI, set by the user.
- Initial read correction of the sequences with the same UMI, using the later described method.
- UMI merging, taking into account both the distance of the UMIs and the distance of the sequences, generated by the second step.
- Final read correction of the sequences, that belong in the same group of merged UMIs, as created by the third step.
The proposed method for read correction works at nucleotide level, and as a result it estimates the base of each nucleotide of the sequence individually. It can be broken down in the following steps:
- Calculation of the frequency and the mean quality of each base.
- Setting of a criterion, defined by the mean of the two previously calculated values of each base.
- Selection of the base, which criterion has the max value.
- In case of a draw between the bases' criterion, selection of the one with the max quality value.
- Setting of the new quality, as the selected base's mean quality, as it was calculated in step 1.
The proposed method for UMI merging takes into account both the distance of the UMIs and the distance of the sequences. The distances respond to Hamming distances, calculated using the function Biostrings::stringDist.
It takes as input the sequences generated by the first read correction. As a result, each unique UMI (or UMI12 in case of a UMI in both Read1 and Read2) corresponds to one Read1 sequence (and one Read2 sequence, in case of paired end libraries). It, also, takes as input the read counts of each unique UMI, calculated in the initial data cleaning step.
The method consists of the following steps:
- Finding the UMI with the max number of counts and its corresponding Read1 and Read2 sequences.
- Calculation of its base distance with the next UMI in line (in case of UMI12 it separates it to the two initial UMIs and checks them individually).
- If it fulfils the max UMI distance criterion, set by the user, it moves on to step 4. Otherwise, it goes back to step 2 and if there are no UMIs left, moves on to step 8.
- Finding the corresponding Read1 and Read2 sequences of the tested UMI.
- Calculation of their base distances with the max counts UMI Read1 and Read2 sequences.
- If both of them fulfil the max sequence distance criterion, set by the user, it moves on to step 7. Otherwise, it goes back to step 2 and if there are no UMIs left, moves on to step 8.
- Merging of the two UMIs, using the symbol "|". The UMI with the max read counts is considered the correct UMI.
- Removal of the merged UMIs from the list of unique UMIs and then goes to step 1. If there are no UMIs left, the UMI merging procedure ends.