-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hifiasm doesn't find heterozygous kmer peak correctly - consequently HiC mode fails #78
Comments
Misidentifying het/hom k-mer peak usually won't have a big effect.
How did you do the assembly? If you generate primary/alternate assembly without Hi-C, what is the size of the primary assembly and alternate assembly, respectively? |
Never mind. I see numbers in #55: 420Mb for primary and 28Mb for alternate if this is the same sample. Hifiasm currently doesn't handle this case well. We are aware of that and trying to improve. |
I will expose an option to let users manfully set hom peak as soon as possible. In Hi-C mode, hifiasm needs to use homozygous peak to make sure which unitig is homozygous. In this example, all untigs are identified as homozygous unitigs, so that all of them are assigned to both haplotypes. #55 is caused by stringent threshold in purge_dups, and it also affects the samples with high heterozygous here. I will also fix this problem in the next version. |
Hi Dr. Li and Dr. Cheng - thank you for your message. Yes, those are the right numbers from #55, 420Mb for primary and 28Mb for alternate. The way that I assembled the chromosome-scale diploid genome was to:
Let me know if you are interested in taking a look at this dataset or my assembly results. Thank you!
|
Sorry for the late reply. I will push a new version to repo soon and hopefully it can fix the purge_dups and hic issues for your sample. |
Hey there, just thought I would give an update with the latest hifiasm release -- 0.15-r327. The software still identifies the het/homs peaks as -1/19 instead of the correct 19/38. The command was
|
You can manually set hom peak by '--purge-cov'. It is hard for hifiasm to get right peak in this case. Please note it just affects phasing and graph cleaning, so that no need to run the whole assembly workflow. |
By the way, could you please show the following number in the log file?
|
Hi @chhylp123 - Where in the k-mer spectrum should There were two instances of Good news on the genome assembly side of things. The assembly stats for v0.15-r327 were closer to the estimated genome size of 174 Mb (348 Mb for both haplotypes) (from k-mer spectrum):
For comparison, the assembly stats for the previous version of
Running the assembler v0.15-r327 without Hi-C
So, it looks like Hi-C mode may be working better than the other modes at the moment. |
Actually hifiasm recalculates peaks during graph cleaning step, which is different with the k-mer peaks showing in the log file. |
OK - I figured it was based on mapping coverage rather thank k-mer coverage. I'll scaffold these up to chromosome scale and let you know what I find. |
But the assembly size looks a little bit large. I worry hifiasm still misidentified some het regions as hom. Could you please set higher hom peak by '--purge-cov', e.g. 50? Could you please also show the 'p_utg.gfa'? |
Sure, Here are the results below. Trial 4. Pretty much the same as trial 3. Column Hap1 + Hap2 Size (Mb) should be around 356 Mb (178 Mb * 2).
Here is I also A lot of the smaller scaffolds, however, have no Hi-C reads that map to them over hundreds of kb/ a Mb. This means that the assembler is outputting something that isn't real, or the scaffold is just hundreds of kb of repeats that are so information-poor that no reads could map. Because of the chunks of haplotypes that seem like they are dropped with Hi-C mode, I still think I will be better off by running |
@conchoecia Sorry I missed your reply. For scaffolding, do you mean the balanced two haplotypes without Hi-C are better than phased haplotypes with Hi-C, or primary + alternate are better than than phased haplotypes with Hi-C? For Trial 2, did hifiasm find right peaks during graph cleaning? |
@chhylp123 I mean that For trial 2, and all the other trials, hifiasm did not identify the hom peak. Let me know if there's any results that I could share that would be helpful to you! Cheers, Darrin |
@conchoecia Thanks a lot. We are thinking the main challenge for phasing is that hifiasm cannot distinguish het regions and hom regions exactly. It seems it is more serious in your sample. Could you please share the bin files of hifiasm with us for debugging? It is very helpful for us. |
I sent the files via ftp to your email, @chhylp123. Thank you. |
@conchoecia For small bubbles and tips in this local subgraph, is it possible to check they are assembly errors or somatic mutations? I checked some of bubbles and found one side has ordinary coverage while another side has very low coverage. I guess so many such small bubbles and tips affect the final results. |
Hi @chhylp123 - can you possibly send me some sequences associated with this subgraph? Could be that it was bacterial, or some other bug's, comtamination that I was able to remove after Hi-C scaffolding. Otherwise, sure, I can look in my assemblies to see if there are possible errors. Thank you- Darrin |
There are too many such bubbles. I wonder all small bubbles and tigs at the subgraph in red circle are not real (it is the p_utg you showed with us). Besides, could you please run hifiasm with smaller '-s'? With -s0.3 the size of each haplotype should be reduced to 240Mb. Since the het rate is quite high, we need to use smaller ‘-s’ to find overlaps with low similarity. |
You can run minimap2 to do self-alignment on top of hap1 or hap2. With '-s0.3', it seems only the unitigs in that complex subgraph have overlaps. I'm not sure if they are somatic mutations. Probably I was wrong.. |
@conchoecia I found the main problem is that there are no overlaps for a part of unitigs. With -s0.3 (similarity of 30%), hifiasm outputs haplotype 1 of 251Mb and haplotype 2 of 240Mb. There are 53Mb unitigs which do not have any overlaps so that they are partitioned into both haplotypes. In those 53Mb contigs, the total size of contigs which are longer than 1Mb is 15Mb. And the longest one is a circle of size 3.9Mb. I'm not sure what are those. Probably they are contaminations, or short contigs with very high het rate so that hifiasm cannot identify overlaps correctly. |
Hi @chhylp123 - Thank you for your time looking at this graph! I've been going through old, unclosed issues and found this one. I'll close it for now since a lot of time has lapsed, but will open it again if I can get back to this issue in this assembly. |
Hi Haoyu,
This is related to #55, where I noticed that hifiasm was not correctly identifying the heterozygous and homozygous peaks. The animal that I'm working with is very heterozygous (>4%), so the kmer spectrum has a very large het peak, and the hom peak is much smaller. I know that the animal is not polyploid, as I already have generated a diploid, chromosome-scale assembly for this individual.
I ran the new version of hifiasm with these parameters:
And I was seeing that hifiasm was misidentifying the heterozygous peak as homozygous:
At the end of the run, the
.gfa
files for hap1 and hap2 both were twice the size (~400Mbp) of the haploid genome size (~200Mbp). I think that for this to work correctly, the assembler will need to know which kmers that are from the 1x (het) peak to get the correct haplotype binning.The text was updated successfully, but these errors were encountered: