Hi Vuthuyduong,
I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.
Have I made an error, or do I not understand how the classification is supposed to work?
Thank you in advance for your help.
Luke
Script
# Constants and subdirectories
readonly THREADS=8
readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta"
readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json"
readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification"
readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta"
readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"
log 'Starting at:'
# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search
-i $QUERY_SEQUENCES
-r $REFERENCE_SEQUENCES
-ml 50
# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
-c $CLASSIFIER
-cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02
New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Finished at: Sat Jul 27 20:39:15 AEST 2024
Bestmatch file
<style>
</style>
| ID |
ReferenceID |
BLAST score |
BLAST sim |
BLAST coverage |
| ASV_1;size=2362642 |
UDB05261093 |
1 |
1 |
178 |
| ASV_2;size=1037588 |
MW856689 |
1 |
1 |
132 |
| ASV_3;size=1412752 |
UDB01614261 |
0.5425 |
0.875 |
31 |
| ASV_4;size=2201923 |
UDB03085057 |
1 |
1 |
160 |
| ASV_5;size=3340601 |
UDB03119248 |
1 |
1 |
154 |
| ASV_6;size=823557 |
UDB01261625 |
0.9823000000000001 |
0.9823000000000001 |
112 |
| ASV_7;size=830877 |
MW214811 |
1 |
1 |
157 |
| ASV_8;size=4359501 |
UDB05107909 |
1 |
1 |
151 |
| ASV_9;size=408829 |
MZ016271 |
0.7246504 |
0.88372 |
41 |
| ASV_10;size=176701 |
UDB05818296 |
1 |
1 |
179 |
| ASV_11;size=162862 |
UDB04293913 |
1 |
1 |
141 |
| ASV_12;size=1535429 |
MT991106 |
1 |
1 |
146 |
| ASV_13;size=169945 |
UDB07371928 |
0.98324 |
0.98324 |
177 |
| ASV_14;size=130846 |
UDB02651623 |
0.9607800000000001 |
0.9607800000000001 |
50 |
| ASV_15;size=978833 |
UDB03975281 |
1 |
1 |
135 |
Classification file
<style>
</style>
| ID |
ReferenceID |
kingdom |
phylum |
class |
order |
family |
genus |
species |
rank |
score |
cutoff |
confidence |
| ASV_1;size=2362642 |
UDB05261093 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_2;size=1037588 |
MW856689 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_3;size=1412752 |
UDB01614261 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
0.5425 |
N/A |
N/A |
| ASV_4;size=2201923 |
UDB03085057 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_5;size=3340601 |
UDB03119248 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_6;size=823557 |
UDB01261625 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
0.9823000000000001 |
N/A |
N/A |
| ASV_7;size=830877 |
MW214811 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_8;size=4359501 |
UDB05107909 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_9;size=408829 |
MZ016271 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
0.7246504 |
N/A |
N/A |
| ASV_10;size=176701 |
UDB05818296 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_11;size=162862 |
UDB04293913 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_12;size=1535429 |
MT991106 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
| ASV_13;size=169945 |
UDB07371928 |
Fungi |
Ascomycota |
Pezizomycetes |
Pezizales |
Pezizales fam Incertae sedis |
Sphaerosoma |
unidentified |
genus |
0.98324 |
0.969 |
0.4812 |
| ASV_14;size=130846 |
UDB02651623 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
0.9607800000000001 |
N/A |
N/A |
| ASV_15;size=978833 |
UDB03975281 |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
unidentified |
|
1 |
N/A |
N/A |
Classified file
<style>
</style>
| ID |
Given label |
Prediction |
Full classification |
Rank |
Cut-off |
Confidence |
ReferenceID |
BLAST score |
BLAST sim |
BLAST coverage |
| ASV_1;size=2362642 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB05261093 |
1 |
1 |
178 |
|
|
| ASV_2;size=1037588 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
MW856689 |
1 |
1 |
132 |
|
|
| ASV_3;size=1412752 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB01614261 |
0.5425 |
0.875 |
31 |
|
|
| ASV_4;size=2201923 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB03085057 |
1 |
1 |
160 |
|
|
| ASV_5;size=3340601 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB03119248 |
1 |
1 |
154 |
|
|
| ASV_6;size=823557 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB01261625 |
0.9823000000000001 |
0.9823000000000001 |
112 |
|
|
| ASV_7;size=830877 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
MW214811 |
1 |
1 |
157 |
|
|
| ASV_8;size=4359501 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB05107909 |
1 |
1 |
151 |
|
|
| ASV_9;size=408829 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
MZ016271 |
0.7246504 |
0.88372 |
41 |
|
|
| ASV_10;size=176701 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB05818296 |
1 |
1 |
179 |
|
|
| ASV_11;size=162862 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB04293913 |
1 |
1 |
141 |
|
|
| ASV_12;size=1535429 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
MT991106 |
1 |
1 |
146 |
|
|
| ASV_13;size=169945 |
Sphaerosoma |
k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizales_fam_Incertae_sedis;g__Sphaerosoma;s__unidentified |
genus |
0.969 |
0.4812 |
UDB07371928 |
0.98324 |
0.98324 |
177 |
|
| ASV_14;size=130846 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB02651623 |
0.9607800000000001 |
0.9607800000000001 |
50 |
|
|
| ASV_15;size=978833 |
|
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified |
N/A |
N/A |
UDB03975281 |
1 |
1 |
135 |
|
|
Hi Vuthuyduong,
I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.
Have I made an error, or do I not understand how the classification is supposed to work?
Thank you in advance for your help.
Luke
Script
# Constants and subdirectories
readonly THREADS=8
readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta"
readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json"
readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification"
readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta"
readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"
log 'Starting at:'
# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search
-i $QUERY_SEQUENCES
-r $REFERENCE_SEQUENCES
-ml 50
# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
-c $CLASSIFIER
-cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02
New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Finished at: Sat Jul 27 20:39:15 AEST 2024
Bestmatch file
<style> </style>Classification file
<style> </style>Classified file
<style> </style>