Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metagenomics assembly with Hifiasm? #48

Open
JeanMainguy opened this issue Oct 13, 2020 · 5 comments
Open

Metagenomics assembly with Hifiasm? #48

JeanMainguy opened this issue Oct 13, 2020 · 5 comments

Comments

@JeanMainguy
Copy link

Hi,

I trying to assemble metagenomics hifi reads. Is Hifiasm suited for metagenomics assembly? and if so do you have any recommended settings for that purpose?

Best,
Jean

@chhylp123
Copy link
Owner

chhylp123 commented Oct 14, 2020

We are testing hifiasm with metagenomics datasets and making specific modification. But for now it might not work as well as other metagenomics-specific tools. I guess the key point is how to sample metagenomics datasets, which cannot be performed automatically by current hifiasm.

@JeanMainguy
Copy link
Author

Ok thank you very much!

@xfengnefx
Copy link

@JeanMainguy I've made the testing fork public. You may try with hifiasm_meta -t32 -oasm --force-preovec --exp-graph-cleaning reads.fq.gz 2>STDERR.log, this includes the said read selection, and some other graph cleaning routines (contig generation hasn't been updated yet).

The default thresholds are pretty arbitrary since there's only very few datasets available. Set --lowq-10 higher if it's dropping too many reads, and lower if overlap-error correction takes too long. Please refer to the readme for other switches. May I ask what datasets were you looking at?

@JeanMainguy
Copy link
Author

That's awesome, I'll try it as soon as possible. Thank you very much.
I am currently playing with the ATCC MSA-1003 mock datasets from the preprint "Highly accurate long-read HiFi sequencing data for five complex genomes" (https://www.biorxiv.org/content/10.1101/2020.05.04.077180v1) but I will soon have freshly new sequenced datasets from another mock and real environment samples.

@xfengnefx
Copy link

@JeanMainguy We also use the ATCC MSA-1003 for dev, I think you can try pushing the read selection hard since there's so many redundancy. Getting rid of ~1/2 reads was acceptable to get the backbones of the strains (--low-q 150 as I remembered). For real datasets I'm not sure, currently we only have access to 2 and they appeared to be very different, e.g. horizontal gene transfers. It will be interesting to see how the heuristics work in more samples.

And by the way, please feel free to throw issues to the fork if you have any questions. I'll try to make the latest commit ready to go, but as a dev fork it's not yet documented or stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants