High PolyA content in R2 reads after trimming. #180

AmrSaadeldin · 2023-11-19T20:58:17Z

Hello

I am working with human whole-genome bisulfite sequencing data, and paired-end reads. After using Trim Galore to remove adapters, I encountered an issue with the FastQC report for R2 reads across all samples, which indicates a high PolyA content. This is unexpected since Trim Galore usually removes PolyA sequences. My primary concern is whether to proceed with mapping, given that the R1 reads are fine and have passed all FastQC tests. Additionally, it's important to note that both R1 and R2 reads in all samples do not show any overrepresented sequences and meet most other FastQC criteria, except for the adapter content in R2 reads. I am seeking advice on how to address this issue with the PolyA content in R2 reads and whether it's advisable to move forward with the current data.

Below the images before and after trimming.

the first image: The diagram shows the adapter content before trimming across various samples. Each line in the diagram represents the adapter content for a specific sample. The blue lines indicate the Illumina adapters in the R1 and R2 samples, while the orange line represents the polyA content in one of the R1 samples. The remaining lines, colored red and light blue, correspond to the polyA adapters content in all R2 samples

The second image: This is the FastQC report depicting adapter content after it has been trimmed using Trim-Galore! Every line in this report represents the polyA adapter content. The orange-red line at the bottom illustrates the polyA content for one of the R1 samples. All other lines in the report correspond to the polyA content in all the R2 samples.

FelixKrueger · 2023-11-20T11:20:09Z

Hi @AmrSaadeldin ,

Thanks for the details. Here is my initial assessment of the situation:

in its default mode, Trim Galore looks for adapter contamination, which has obviously worked as expected.
it does not perform PolyA removal as a matter of course
the amount of PolyA appears to start right at the very start of sequences, and continues to increase in a linear fashion with the read length. There are now at least several possible scenarios:
1. You really do see reads that are complete repetitions of A from start to end (which would probably be some kind of technical artefact?): these reads will almost certainly not align uniquely in the genome, so would effectively get filtered out during the alignment step
2. I am not exactly sure how long poly-A sequences is in FastQC, but assuming your read-length was 150bp and looking at the plot I would assume 10-12bp. There are number of positions in the genome that are a stretch of 10-12As in a row, and if these were enriched for some reason you would see the value of PolyA creep up (which might not really be poly-A in this case). If these regions in the genome would be affected, there is a chance that 150 bp sequences would map just fine.
for peace of mind you could run a second round of trimming, like so:
A single base may also be given as e.g. -a A{10}, to be expanded to -a AAAAAAAAAA

In either case, I don't think the results would be very different in either case.

AmrSaadeldin · 2023-11-21T20:55:11Z

Hi @FelixKrueger, thank you so much for your help and your detailed observations.

Based on your insights, I am now contemplating whether to conduct another round of trimming using the A{10} parameter. However, I'm concerned that this might introduce bias in the data. Considering this, my inclination is towards proceeding directly with the mapping phase. I suspect that the sequences might either not map uniquely or not map at all, which, as you mentioned, could be due to technical artifacts or genomic stretches of A's.

Given these possibilities, do you think proceeding directly to mapping, without an additional trimming step, is a sound approach for our downstream analysis? Thank you again.

FelixKrueger · 2023-11-22T13:32:36Z

My gut feeling is that you should be fine to proceed as-is, but for your own ease of mind I would potentially run a test (maybe just on a single sample?) in parallel. If you can convince yourself that the effects are either undetectable or negligibly small, you should be well prepared to answer any questions in that direction (in theory, Read 2 is the read where the methylation state is encoded by G/A (and not C/T as for Read 1), so if there is some sort of technical bias that makes it through to the uniquely mapped stage (which I doubt) you would expect some more unmethylated calls at these positions.

FelixKrueger self-assigned this Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High PolyA content in R2 reads after trimming. #180

High PolyA content in R2 reads after trimming. #180

AmrSaadeldin commented Nov 19, 2023 •

edited

Loading

FelixKrueger commented Nov 20, 2023

AmrSaadeldin commented Nov 21, 2023

FelixKrueger commented Nov 22, 2023

High PolyA content in R2 reads after trimming. #180

High PolyA content in R2 reads after trimming. #180

Comments

AmrSaadeldin commented Nov 19, 2023 • edited Loading

FelixKrueger commented Nov 20, 2023

AmrSaadeldin commented Nov 21, 2023

FelixKrueger commented Nov 22, 2023

AmrSaadeldin commented Nov 19, 2023 •

edited

Loading