Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial contigs with half-reduced coverage ccs reads is misassembled? No, by contrast, it has assembled very long and novel tandem repeat sequeces #50

Open
mydongan opened this issue Oct 27, 2020 · 11 comments

Comments

@mydongan
Copy link

mydongan commented Oct 27, 2020

Hi chhylp123,
I have sequenced a diploid genome (repeat content >70%) with 25X coverage HiFi reads. Luckly, I got a wonderful contigs with N50 of 44 Mb by hifiasm 0.12.

Then I anchored contigs to chromosomes by Allmaps with ~1500 high quality genetic markers. Finally, I obtained 10 pesudomolecules. However, I found there was 20-MB region in chr7 which is not supported by genetic markers and synteny with other homologus species.
image
image

Furthermore, I mapped ccs reads to the final assembly, and I found that 20-Mb region with half-reduced coverage reads.
image

Meanwhile, I also mapped the RNA-seq reads to the genome, and no reads covered this region. So, I think this 20-Mb region maybe misassembled.

However, this 20-Mb region was located in a single contig (108 Mb) which were constructed by sereval utgs (the length of both terminal utgs (utg000064l and utg000017l) are 29 Mb and 47 Mb, separately ), and there is no obvious evidence support to break this contig.
image

Therefore, I am wondering whether there are other probabilities for this assembly? And have you ever met that some assembly regions covered by half depth reads before? May be high heterozygosity for 20-Mb?

Thanks!

Dong An

@mydongan mydongan changed the title probably misassembly contigs partial contigs with half-reduced coverage ccs reads is misassembled? Oct 27, 2020
@chhylp123
Copy link
Owner

Could you please zoom in the utg graph around the this 20-Mb region? I'd like to see how the subgraph looks like. Also, could you please show the following numbers at hifiasm log?

[M::ha_pt_gen] peak_hom: []; peak_het: []
[M::purge_dups] purge duplication coverage threshold: []

@mydongan
Copy link
Author

mydongan commented Oct 28, 2020

Thanks! I aligned 5 Mb sequence of 20-Mb region to all utgs fasta sequences, and I found it mapped to the utg000017l (47M).
image

image

The following information of hifiasm log are listed as below:
[M::ha_pt_gen] peak_hom: 25; peak_het: -1
[M::purge_dups] purge duplication coverage threshold: 31

@lh3
Copy link
Collaborator

lh3 commented Oct 28, 2020

Based on the mapping of genetic markers, can you assign this 20Mb to other chromosomes?

@mydongan
Copy link
Author

Thank you! Dr Li. Very strange, this 20 Mb region did not have any genetic markers.

@lh3
Copy link
Collaborator

lh3 commented Oct 28, 2020

A few more things to try:

  • Blast pieces from this 20Mb region against the "nt" database and check the top hits.
  • Run RepeatMasker to check the repeat content.
  • When you map genetic markers, do you see any hits to this 20Mb or do most hits here have ambiguous mappings?

@mydongan
Copy link
Author

Thank you very much for your suggestions!

  • I have blast 5 Mb retrieved from this 20Mb region againt the nt database, and all the top 10 hits are the same plant sequences with mine, thus we could exclude sequence pollution. Furter, I also aligned this 5 Mb sequence to an high-quality reference genome (contig N50 47 Mb), and I found this sequence partially mapped to many unanchored scaffolds.
  • I have done repeat annotaion with EDTA, but only performed the LTR annotation of this pipeline, I found that LTR density is lower in this 20 Mb region of chr7.
    image
  • Thank you for reminding me. I have filtering the markers and retained unique mapped genetic marker, therefore, I misunderstood that this 20Mb region covered with no genetic markers. So, I recheck the markers, and found that this region is ambiguous mapping with many markers which located in different linkage groups.
  • Maybe this region is rDNA or other repeat elements?

@lh3
Copy link
Collaborator

lh3 commented Oct 28, 2020

  • Is your sample inbred diploid –– two sets of nearly identical chromosomes?
  • You should check rDNA and centromere satellite in this 20Mb.
  • Run HiCanu and see how HiCanu assembles this region.

@chhylp123
Copy link
Owner

chhylp123 commented Oct 28, 2020

  • If your sample is inbred, it should be better to disable purge_dups using '-l0'.
  • To find the corresponding unitigs at r_utg of this 20Mb region, a better way is to find the reads at this region (A-lines in p_ctg), and then grep them at r_utg. I assume it should correspond to the tangle between utg000017l and utg000018l. The safe way is to drop the 20Mb region of p_ctg at the boundaries of tangle if it is a potential misassembly.

@mydongan
Copy link
Author

mydongan commented Oct 29, 2020

Thanks all !

Yes, it is a inbred haploid, het is 0.232% when I did survey analysis, and I assembled the genome using "-l0".

After doing repeat annotation, 85% of this region was annotated as 180-bp knob repeat which is a specific tandem repeat in plants.
image
Therefore, this region has not been assembled by previous studies, and thus proved that HIFI reads and hifiasm are very efficient and accurate for assembly long tandem repeats. Thank you all again!
Furthermore, I do nucmer alignment using utg000017l and itself, an we can also seen the terminal 11 Mb are tandem repeat.
image

However, I still not understand why the ccs reads coverage reduced half in this region.

@mydongan mydongan changed the title partial contigs with half-reduced coverage ccs reads is misassembled? partial contigs with half-reduced coverage ccs reads is misassembled? No, by contrast, it has assembled very long and novel tandem repeat sequeces Oct 29, 2020
@lh3 lh3 mentioned this issue Jan 27, 2021
@lh3
Copy link
Collaborator

lh3 commented Jan 27, 2021

As someone was referring to this issue, I have reread the thread. I am seeing:

  • A 20Mb region on the chr7 scaffold that has half of the expected coverage.
  • The first 5Mb in this 20Mb is located at the end of utg000017l.

If this description is right, this is not a contig misassembly. You have an inbred diploid genome. One possibility is that this region is diverged between the two haplotypes although the rest of the genome is nearly homozygous. The solution is to remove the diverged copy from the primary assembly. By the way, when you scaffolded the contigs, have you discarded prefix.a_ctg.gfa?

@mydongan
Copy link
Author

Maybe you are right, this repeat region with half coverage may be divergence rapidly between the two haplotypes. Yes, I only use prefix.p_ctg.gfa for further assembly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants