-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually specify where het and hom k-mer peaks are - hifiasm's guess incorrect. #55
Comments
Since the het rate is quite high, hifiasm regards the het peak as homozygous peak incorrectly. It doesn't matter too much for assembly, but affects purge_dups a lot. One solution is to set |
Thanks for your message, @chhylp123. I gave it another run, this time specifying
primary contigs:
alternate contigs:
Not sure what to try at this point, but I'm interested to keep the conversation going if you all are interested in developing the assembler to handle such cases. |
The location of the hom/het k-mer peak has little to do with the doubled assembly size. As Haoyu said, try |
I'm not sure if we may be seeing a somewhat similar issue. I have a 240MBase plant genome that isn't assembling well - we're getting an N50 of about 110KBase. The plant should have a moderate to low heterozygosity and we have 11GBase of sequence, so the peak below at x46 looks like the homozygous peak with no evident heterozygous peak. But it looks like hifiasm is picking up on something in the noise as a spurious homozygous peak (maybe a bit of residual contamination?) and is calling "[M::ha_ft_gen] peak_hom: 6; peak_het: -1" which is possibly messing with the downstream assembly? I can't immediately see any way of overriding this?
|
What are your assembly stats for the primary and alternative assemblies, @plattsad ? https://github.com/conchoecia/fasta_stats |
With a vanilla configuration (only threads specified)...
|
Looks like we should be able to let users to manually set homozygous peak. Do you think it is helpful @plattsad @conchoecia? I will release v0.14 today if you think it is an acceptable solution. |
I think that might be useful. As a test I manually set peak_hom and peak_het in htab.cpp to 46 and 23 respectively. The assembly with only threads set appeared considerably better:
|
@plattsad that is a great improvement after manually setting the het and hom peak. Good idea trying that as a direct proof-of-concept. If you have Hi-C data then that should nicely scaffold into chromosomes. @chhylp123 If you think it will increase the assembly quality, then yes it would be nice to have the option (even if it only affects By the way, I used Hi-C to scaffold the output of my genome assembly above that was doubled in size, and ended up with a nice chromosome-scale diploid assembly. The heterozygosity seems like it is well over 5% in the animal that I am working with, including major indels and inversions between the haplotypes. |
@plattsad @conchoecia The new version of hifiasm (v0.14) incorporates a new option '--min-hist-cnt' to ignore noisy counts when analyzing the k-mer spectrum. Hope it will be helpful for this problem. |
Hi @chhylp123 - I think that this is a good compromise to help the software determine where a good cutoff for the noise is. I take it to mean that |
Yes, I think so. Skip the problematic peaks should work. |
Hi there, trying out my first assemblies with
hifiasm
with an animal whose genome is around 170 Mb, and is over 4% heterozygous. We know this from >100x coverage k-mer spectra from Illumina WGS reads.The k-mer spectrum from the PacBio HiFi data tells the same story. I made a k-51 spectrum since I saw that is what
hifiasm
uses by default. The data are around 44x coverage, and you can see the peak from the het k-mers around 20 and the homozygous k-mers around 40.The k-mer spectrum from
hifiasm
matches the above spectrum:However,
hifiasm
gets the position of the homozygous peak incorrect, and calls it at 19, while it should be around 40. The results of running this assembly with--high-het
was that the primary assembly was twice as big as it should have been (400Mb instead of 200Mb). So, both haplotypes are ending up in the same primary assembly.The secondary assembly is too small:
Is there anything I can do to specify where
hifiasm
should expect the het and homozygous peaks? I think this would be helpful to others working with highly heterozygous species. Thanks so much!The text was updated successfully, but these errors were encountered: