Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow processing sintax vsearch/2.28.1 #570

Open
msalamon2 opened this issue Aug 8, 2024 · 3 comments
Open

Very slow processing sintax vsearch/2.28.1 #570

msalamon2 opened this issue Aug 8, 2024 · 3 comments

Comments

@msalamon2
Copy link

Hello,

I tried to run the new version of vsearch with sintax on a computing cluster, but the processing was extremely slow despite the large amount of computing resources requested (4775 MB per core) and threading (40 cores). The input ASV fasta file is 4.6MB for 10,710 ASVs, and the reference database is the complete Eukaryote COI BOLD database (1.7GB, 2216285 sequences).

vsearch ran for 13 days, but only outputed a 72.7KB one column file, which seem to indicate that only 6236 ASVs were processed. Below is the head and tail of the output file:

ASV_7
ASV_20
ASV_16
ASV_17
ASV_10
ASV_19
ASV_12
ASV_34
ASV_35
ASV_9
...
ASV_6228
ASV_6229
ASV_6230
ASV_6231
ASV_6232
ASV_6233
ASV_6234
ASV_6235
ASV_6236

Here is the script for the .sh file used to run vsearch:
`#!/bin/bash
#SBATCH --mem-per-cpu=4775M
#SBATCH --cpus-per-task=40
#SBATCH --time=48:00:00
#SBATCH --account=def-mcristes
#SBATCH --mail-user=[email protected]
#SBATCH --mail-type=ALL

module load StdEnv/2020 vsearch/2.28.1

Run VSEARCH

vsearch --sintax ASVs_Malaise_traps_DADA2.fasta
--sintax_random
--db SINTAX_COI_v5.1.0ref.fasta
--tabbedout rdp_sintax_unoise3_COI.txt
--sintax_cutoff 0.8
--strand both
--threads 40
--log sintax_COI_MalaiseTraps_log.txt`

I am unsure why the program was so slow, could this be due to the very large reference database ?

Thank you for your help,
Best wishes,
Mathilde Salamon

@torognes
Copy link
Owner

torognes commented Aug 9, 2024

Hi Mathilde,

Thank you for reporting this issue.

Both the time used and the lack of any results for most of the sequences look very strange. I have therefore tried to reproduce your efforts and downloaded the SINTAX_COI_v5.1.0ref.fasta file from the https://github.com/terrimporter/CO1Classifier repository.

It seems like the problem is related to masking of the sequences in the database. By default, vsearch applies "soft masking" to the sequences in the databases. That means that all lower case letters are masked and not used during the initial stage of sequence comparison. It is described in the manual, but it is not mentioned for the sintax command, so we need to improve the documentation. Perhaps it should not even be applied by default for this command. Since the database seems to only contain lower case letters for the nucleotide symbols, all of the sequences are masked, leaving no results.

I am sorry that you have wasted 13 days of computation time (times 40 cpus) with this. The good news is that this problem can be easily resolved by including the --dbmask none option on the command line. When I did this with 10710 randomly subsampled sequences from the same database, the whole run completed in under 10 minutes using 8 threads and less than 6GB memory on my Macbook. And the results looked reasonable.

@torognes
Copy link
Owner

torognes commented Aug 9, 2024

For a future release of vsearch we should consider:

  • Update documentation regarding soft-masking and sintax
  • Should soft-masking be applied at all (by default) for the sintax command?
  • Should a warning be issued when detecting fully masked sequences in the query or database, in general?

@msalamon2
Copy link
Author

Hi Torbjørn,

thank you very much for your quick response, explanation, and for running the test, it was very insightful ! I'm glad this is such an easy fix, because I was planning to use vsearch with sintax for all my databases.

Best wishes,
Mathilde Salamon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants