-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bad reads in tiling path #42
Comments
Does these reads consist of large number of homopolymer errors so hifiasm failed to correct them? |
The portion of the "bad read" appears to have low complexity, but not necessarily just homopolymers. This portion is always at the beginning of the read and only spans a ~100-200 bp. The reads at these loci are not getting heavily corrected. It looks the same with HiFi or corrected. Here is a zoomed in example |
Maybe the "bad read" clusters are just extreme examples of this edge error effect? There is a weird GA...TC thing going on at this sites? |
I guess all these bad reads of assembly should be there but cannot be corrected, right? |
yes, the contig assembly is correct, except for the consensus error brought in by the portion of the bad read |
Thanks a lot. In the above examples, are these reads roughly aligned to their corresponding positions at p_ctg? For example, if we extract read A from the position 5000 of a contig and align A back to the contig, is the alignment position of A still around 5000? |
Yes, the coordinates are the same. |
yes. All of these are reads from the p_ctg.gfa, and the bottom read is the one that agrees with consensus, but the rest of the reads from the tiling path do not. I think these are just extreme examples of other edge examples above. Let me see how this looks if I correct the consensus and re-align these reads |
I guess the rationales of all these examples are similar. For now hifiasm does not have a consensus step when generating contigs. I will try to add it this week. Thanks for point out this problem, which is very helpful for us. BTW, is it possible that you can share some small example data with us? If not I can also find examples by myself. Thank you again for your great help. |
OK, I'll gather up some small read sets that reproduce the problem. All the examples above are cases where the purity of the SNP/INDEL is >95%. I am not even trying to address the cases where it is ambiguous, such as the length of a homopolymer. Again, it is always adjacent to an edge read, so I don't think it is a generalized consensus issue but a contigging issue. Thanks! |
In the example above with 16bp deletion, none of the reads have that deletion, so how did arise? That's why I wondering if it is contigging issue, rather than consensus issue (at least for the other type of errors I am seeing). |
Nice catch! This does appear to be a generalized feature of deletion errors, preceding a read edge. I can find a lot of examples of this. However, I cannot see an obvious repeat unit in the case of an insertion prior to a read edge. insertion = AGAAGCTCATAAAAGCCAATGAAGTACCTTTGAACTTCCAGT Would it be helpful if I just send you pairs of reads that have INDELS adjacent to edge, including the "bad read" scenario that I was initially pursuing? |
Yean, that's already very helpful. Thank you a lot! I will let you know when I fix this problem. For the new example, it is hard to say. I need to have a look at some examples to see what really happens, Low complexity introduces weird things. |
@kevfengler227 We have made a new release v0.13 which is able to fix most such problems. Besides, the new version supports to output bad regions with low quality in BED format and their related reads (see '--pb-range'). I guess it might be useful for polishing. |
Awesome. Thanks! I have polished dozens of assemblies over the past few weeks. I'll do a side-by-side comparison with the latest version and let you know how it goes. How was the issue arising? |
I am not sure I fully understand. The issue seems to be limited to the ends of reads in the tiling path. This would imply that only the ends of reads cannot be fully corrected? I have not seen any of the above examples occurring in the middle of a read. Also, the reads in question look the same prior to correction. So perhaps the root issue is that large variations at the ends of reads are not being corrected? That is, the end of the read is just really bad. |
@kevfengler227 We have rereleased v0.13 to output BED files in default and changed the '--pb-range' to '--lowQ'. '--pb-range' is a little bit confused... |
Great! Thanks for helping us to improve hifiasm : ) |
So I still see examples where an INDEL error is adjacent to the end of the read that. This seems to be a different type of error than you previous addressed. The deletion is not in any of the reads, so I am not sure how it arises. Interestingly this consensus error represents a 24bp INDEL, so somehow an additional 24 bp from the start of the end of the read is being added upstream. |
Thanks a lot. Does lowQ.bed contain this 24bp INDEL? |
yes, the region adjacent to, but not including the INDEL has a score of "92" in the BED file. The read that is flanking the INDEL has a lot of mismatches, so it is definitely a poor read. But I still don't understand how such an INDEL could arise. This particular example seems more like a case of an error in converting reads in the tiling path into a consensus?? |
Actually, it would make more sense to show the corrected reads aligned to the assembly Here the lowQ region appears unrelated to the INDEL. It is associated with a long homopolymer of ambiguous length. The ambiguity of length does correspond to ~24 bp. When the tiling path of reads is converted to a consensus sequence are the actual coordinates used or are they estimated? Note that the read that is upstream of the INDEL does not contain the ambiguity of the other reads. |
Interesting...Thanks again. Does v0.13 has more such errors in comparison to v0.12? I guess there might be fewer consensus errors in v0.13. To fix this type of errors, probably hifiasm needs to perform the real consensus in the next version. Let me have a look on Maize assembly first. |
well, a consensus step wouldn't fix the ambiguous homopolymer issue. It is only the false INDEL offset I am hoping to fix because that seems like a bug. How could it arise? I don't know how that is being done. The way I do the polishing takes longer than the assembly, so hopefully that can be avoided, or hopefully you have a faster way in mind (I do minimap2 alignment followed by SNP/INDEL calling from mpileup and then changing FASTA). It is probably not the worth the computation time, but for a few assemblies I try to see how far I can push the consensus accuracy. No it has the same amount of this type of error in v13 and v12. But the false INDEL upstream of a tiling path read like this appear to be last systematic error I can find. Besides that there are just ambiguous homopolymers which cannot be fixed until HiFi improves. |
I see. Let me have a try with maize to see how far we can go. By the way, is it useful if hifiasm output the problematic regions so that we can only polish those small regions, instead of the whole assembly? In this case, the polishing time might be minor. |
maybe, but in this case the lowQ region appears to be the root cause of the INDEL offset, but it does not include the INDEL because it occurs upstream of the tiling path read. But if the underlying consensus issue cannot be fixed, then a localized polish sounds like a very interesting solution (if the region is expanded a bit beyond the lowQ region). |
You are referring to the B73 HiFi sample that PacBio released, right? I'll assemble that too and point you to some examples. |
Yeas, thanks a lot. Looking forward your examples. |
OK, I have the examples. Do you want to take this discussion off-line? What is your email? Thanks, |
Of course. My email is: [email protected]. |
So the problem solved finally? Best, |
I guess it is not too bad after v0.14. |
It appears that much of the error in hifiasm assemblies is associated with unsupported reads integrated into the tiling path. They are found by aligning the original HiFi reads or corrected reads to the assembly and looking for clusters of SNPs/INDELs. How could this be addressed within hifiasm? How does this read survive correction?
Thanks,
KF
The text was updated successfully, but these errors were encountered: