Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separating haplotypes that occur at lower coverage #92

Open
tcb72 opened this issue Apr 6, 2021 · 5 comments
Open

Separating haplotypes that occur at lower coverage #92

tcb72 opened this issue Apr 6, 2021 · 5 comments

Comments

@tcb72
Copy link

tcb72 commented Apr 6, 2021

Hello again,

Same genome as last time (diploid algae, 2% heterozygosity.) I've been playing around with both purging and not purging. Given the haplotypes are highly divergent, we get clear separation of haplotypes when Hi-C scaffolding the non-purged assembly (via Juicer/3D-DNA., see the following:)

image

In some areas of the genome, one of the haplotypes doesn't have as much coverage as the other. For example, if the coverage is 40x, there might be 4 reads which show a clear variant signal. This causes one of the chromosomes within a pair to be collapsed, as shown here:

image

And here's that same area with variants highlighted in Tablet (the HiFi reads aligned with Minimap, -ax asm20 no-secondary):

image

Even more strange is that short-read alignment in that area shows about 50/50 coverage between the haplotypes, which is markedly different from the 90/10 coverage between haplotypes w/ the HiFi reads:

image

I'm not sure why the HiFi reads aren't capturing the variants to the degree the short-reads are, but there's clearly some signal there in the HiFi alignment.

With respect to hifiasm, are there any parameters I can try to tweak to avoid collapsing large regions like this?

Best,

Tom

@chhylp123
Copy link
Owner

If one haplotype doesn't have enough coverage, hifiasm cannot separate it. We will expose some options in the new version tomorrow, hopefully it might be helpful. By the way, could you please show the alignment results of corrected HiFi reads? Just curious if two haplotypes have already been collapsed during error correction step.

@tcb72
Copy link
Author

tcb72 commented Apr 8, 2021

@chhylp123

No, the haplotypes do not look to be collapsed after ec step (This is the error-corrected reads aligned to the same region)

image

Looking forward to the new release!

@StevenBai97
Copy link

StevenBai97 commented Apr 8, 2021

Hi,
I also meet the similar problem. There are several blank crosses in the contigs, and I don't know the reason. The average coverage of genome is about 68X. Should I break the contig and remove the blank ones (like showing in pictures)? If so, will some genes be lost in the genome ?
Look forward to your suggestions. Thank you.

Original:
image

Revised:
image

@lh3
Copy link
Collaborator

lh3 commented Apr 8, 2021

@tcb72 I agree hifiasm is likely collapsing that region. However, how to resolve that depends not only on the three reads you show in a 790bp region but also on the overlaps they have with reads from the correct haplotype. If the lack of coverage extends to a long region, hifiasm can do little. Also, the coverage on the correct haplotype is very low. It is not 90/10. It is 3 out of ~48 reads, or 6.3%. If hifiasm aggressively resolves this case, it may introduce new problems elsewhere. For example, it is not uncommon to see somatic mutations or artifacts supported by a few reads. It is hard. As Haoyu said, he will expose more internal parameters, but it is likely that you have to either increase coverage or manually fix by yourself. You can also try HiCanu to see if it can do a better job.

EDIT: I should have added that out of the three reads, the third read is not corrected as hifiasm is not doing the right job around a long deletion. There are probably only two reads informative to downstream assembly. This is also an important reason why high coverage helps: assemblers can tolerate with sequencing/algorithmic imperfection at low frequency.

@StevenBai97 Yours is a distinct problem. The large white band is probably repetitive regions which Hi-C reads can't be mapped to uniquely. It is fine. I am not sure about the smaller one. Please create a new issue. It is confusing to have multiple problems discussed in one github issue.

@tcb72
Copy link
Author

tcb72 commented Apr 8, 2021

@StevenBai97 I agree with @lh3... in my experience, those large white bands are large repeat sections. Do not remove them!

@lh3 Yeah, I was being generous with the 90/10 estimation... unrelated to hifiasm but why would unequal coverage like this occur in certain sections in the genome? This occurs for about 1 Mbp, which is a lot! This is the most egregious case -- there's another chromosome where this collapsing occurs, but not as large as this one. My first thought was library prep/sequencing bias, but the only HiFi bias I know of is homopolymers indels.

I had an idea but I'm not sure how viable it is/if it's bad practice: what if I duplicated these reads with the alternative haplotype to artificially get to 50/50 coverage? Of course this isn't the ideal solution...

For what its worth, hicanu collapses the region as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants