Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The assembled genome is twice as large as expected in HIC mode #114

Open
jinxin112233 opened this issue May 9, 2021 · 9 comments
Open

Comments

@jinxin112233
Copy link

Hi

We assemble the genome in hic mode and run the command as hifiasm -o NA12878.asm -t 40 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz

Unfortunately, the size of NA12878.asm.hic.hap1.p_ctg.gfa and NA12878.asm.hic.hap2.p_ctg.gfa are both 1.6G. While the predicted result of the genome by flow cytometry is about 800M. Do we need to purge?or another suggestion?

Thanks
JX

@chhylp123
Copy link
Owner

chhylp123 commented May 9, 2021

What's the size of p_utg.gfa? And what are the following two numbers?

[M::purge_dups] purge duplication coverage threshold:
[M::stat] # heterozygous bases: 6179497799; # homozygous bases: 480824972

@jinxin112233
Copy link
Author

Hi
Sorry for our slow response. We re-run hifiasm to get the running log.
Here is the two numbers
[M::stat] # heterozygous bases: 277085802; # homozygous bases: 1528254506

The size of NA12878.asm.hic.p_ctg.gfa is also 1.6G
Here is all the file size
17G NA12878.asm.ec.bin
61M NA12878.asm.hic.clean_d_utg.noseq.gfa
1.6G NA12878.asm.hic.hap1.p_ctg.gfa
1.3M NA12878.asm.hic.hap1.p_ctg.lowQ.bed
59M NA12878.asm.hic.hap1.p_ctg.noseq.gfa
1.6G NA12878.asm.hic.hap2.p_ctg.gfa
1.3M NA12878.asm.hic.hap2.p_ctg.lowQ.bed
58M NA12878.asm.hic.hap2.p_ctg.noseq.gfa
4.9G NA12878.asm.hic.lk.bin
1.6G NA12878.asm.hic.p_ctg.gfa
1.4M NA12878.asm.hic.p_ctg.lowQ.bed
59M NA12878.asm.hic.p_ctg.noseq.gfa
1.8G NA12878.asm.hic.p_utg.gfa
2.8M NA12878.asm.hic.p_utg.lowQ.bed
61M NA12878.asm.hic.p_utg.noseq.gfa
1.8G NA12878.asm.hic.r_utg.gfa
2.9M NA12878.asm.hic.r_utg.lowQ.bed
61M NA12878.asm.hic.r_utg.noseq.gfa
22G NA12878.asm.hic.tlb.bin
4.3G NA12878.asm.ovlp.reverse.bin
16G NA12878.asm.ovlp.source.bin

Thank you for your help
JX

@chhylp123
Copy link
Owner

Hifiasm misidentified hom peak so that it thought most regions are hom. Could you please reset hom peak by "--purge-cov"? It should be a little bit larger than hom peak.

@chhylp123
Copy link
Owner

Please note that you need to update hic bin files with new hom peak.

@jinxin112233
Copy link
Author

HI
The running command is it like this ?
hifiasm -o NA12878.asm -t40 --purge-cov 1628254506 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz

And maybe I know less about hifiasm. I don’t really understand which hic bin files needs to be updated ?where is the file

Thank you for your help
JX

@chhylp123
Copy link
Owner

chhylp123 commented May 11, 2021

What's the coverage of the dataset? Could you please show the k-mer histogram?

@jinxin112233
Copy link
Author

jinxin112233 commented May 11, 2021

Hi
The coverage of the dataset is about ~40X.
Here is the k-mer histogram generated by using genomescope
图片1

best
JX

@chhylp123
Copy link
Owner

chhylp123 commented May 11, 2021

Probably you can have a try with:
hifiasm -o NA12878.asm -t40 --purge-cov 50 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz

And please delete *hic*bin before rerunning hifiasm.

@jinxin112233
Copy link
Author

Great! let me try it ~

best wish
JX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants