-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hi-c issue - hap1 is too large, hap2 too small #85
Comments
For hifiasm, there are two known issues: 1) purge_dups may not be able to do sufficient purging, 2) hic phasing is unbalance. I have already fixed issue 1) and still tuning our algorithm for hic phasing. Hopefully I can fix issue 2) soon. If you think 1) is important, I can also push it to github repo right now. |
I am also using the Hi-C mode along with HiFi sequencing data. I am getting segmentation fault error. I initially thought that the error had to do with the unusual memory consumption, but could my error be related to the issue above? Lastly, there are no specific logs produced by hifiasm for further details on the segmentation fault. |
That would be helpful, appreciate it. |
Hi Haoyu, |
I will update the new version in a few hours, sorry for the delay. |
@tcb72 @xinghua1001 Please have a try using the option '-l3' with github HEAD (0.14-r313). '-s' is able to further adjust the results. Note that the option '--high-het' is removed for now, since ordinary bin files without '--high-het' + '-l3' already works on my side. If you have bin files generated by '--high-het', '-l3' may also work. Hope '-l3' can fix the purging problems for your samples. |
@chhylp123 Thanks! Running now from scratch w/o HiC parameters -- will let you know in few hours (so weird to say "few hours"... HiFi reads are the best.) We got some new HiFi data in so now we have approximately 35x coverage per haplotype. Running same parameters as last time (-l2, --high-het), I got an assembly size of 120.997 Mbp in 93 scaffolds, and Hi-C revealed multiple massive misjoins (see below... that HiC matrix was produced using hifiasm p_ctg + purge dups, which reduced the assembly to 110 Mbp in 36 scaffolds, and still has clear duplication even after purge dups) So I'll compare the -l3 assembly to the above. Should I test with Hi-C too, or this commit won't make a difference with Hi-C data? Best, Tom |
primary contig stats using -l3: Total scaffolds/contigs: 96 Looks like it got rid of a lot of the duplication but still some large misjoins. I know I can fix them in Juicebox but was wondering why this could be happening/parameters I can tweak to fix? |
There are three parameters that might be helpful: --b-cov, --h-cov, --m-rate. These three parameters break contigs at potential misassemblies. But I guess manually break contigs by HiC would be more accurate since there are not too many misassemblies from Hi-C heatmap. By the way, is it possible that you can share the data with us? I'm also confused why sometimes hifiasm introduces misassemblies. Thank you in advance. |
Maybe not. Anyway we will release a new version with updated Hi-C module soon, please wait me a few days.
|
Alternatively, you may check out shilpagarg/DipAsm#16, plus apply standalone purge_dups on pstools phased scaffolds. In our experiments, there should not be any issue of mis-joins. |
@shilpagarg I've checked out DipAsm/pstools before, unfortunately cannot install it bc our cluster currently doesn't support Docker containers. However, I asked them to install DeepVariant for a different project, and a lot of the other informaticians here want Docker support too, so they're trying to implement it. Hopefully I can try it out soon. My only concern is DeepVariant's performance on non-human samples -- any ideas? |
@tcb72 Do you have a screenshot of the r_utg graph in Bandage? What the unitig N50 of unitigs in the r_utg graph? |
N50 of r_utg file is 1.605 Mbp with max scaffold of 5.515 Mbp. |
Thanks. Haplotypes have mostly been separated at the unitig level. Hi-C should work but we need more time to improve for non-human species. |
Hello,
Assembling a diploid algae, 200-210 Mbp (uncollapsed diploid size.) Relatively high heterozygosity estimated between 1.5%-2%. Quite complex with lots of large tandem repeats. We have approximately 18x (diploid coverage) HiFi data , and very high coverage Hi-C data.
Running hifiasm with parameters -l2 -k21 --high-het yields a primary contig asm of 111.225 Mbp in 81 contigs, and the alternate contig asm is 91.132 Mbp in 380 contigs. The primary is still a bit duplicated, but generally the results are reasonable and the duplicate regions can be seen visibly after running Juicer/3DDNA/Juicebox and removed.
I am excited to see Hi-C get implemented into hifiasm. We went ahead and tried it, but our results are a bit strange. Running the same exact parameters above but with the hic parameters added (including --enzyme GATC, which we weren't sure if necessary or not), we get a hap1 size of 198.769 Mbp in 174 contigs, hap2 size of 51.748 Mbp in 70 contigs, and r_utg size of 203.042 Mbp in 594 contigs. Obviously, the hap1 size is way too large, and the hap2 size is too small to be correct. Here's the log for that run:
Other runs we did:
hiç params, --enzyme GATC, k21, no high het: hap1 size of 192.982 Mbp, hap2 size of 85.366 Mbp
hic params, --enzyme GATC, k23, no high het: hap1 size of 194.981, hap2 size of 76.830 Mbp
Let me know if I can provide any more information to help you out. Other than that, I appreciate the development of hifiasm -- it's been fantastic for diploid genomes.
Best,
Tom
The text was updated successfully, but these errors were encountered: