#PANGEA+
A new implementation of PANGEA pipeline for faster and more accurated metagenomics with multiple classification methods and consensus analysis.
#Download
(LINUX):
wget https://github.com/Bioinfo-Tools/PANGEA-plus/tarball/master -O BioinfoTools_PANGEA-plus.tar.gz
(MAC):
curl https://github.com/Bioinfo-Tools/PANGEA-plus/tarball/master -o BioinfoTools_PANGEA-plus.tar.gz
#Extract the files:
tar –xvf BioinfoTools_PANGEA-plus.tar.gz
Your work dir should be set as the PANGEA-plus directory.
cd BioinfoTools_PANGEA-plus
export PANGEAWD=$PWD
#Install parallel BLAST (for High Performance Computing clusters)
cd $PANGEAWD/Classify/Runblast
sh install_MPI_blast.sh
#Trimming your input sequences
cd $PANGEAWD/Trim
perl trim2.3.pl -a ../input_A.txt -b ../input_B.txt -g 100
where: perl trim2.3.pl ... -a raw illumina input file read 1 -b raw illumina input file read 2 (if any) -g size of GAP between paired-ends (if any) -t truncate size (if any) -q quality file (in case of FASTA input) -qc quality cutoff value -lc minimum length
Supported formats: FASTA, FASTQ and QSEQ.
Results will be saved in $PANGEAWD/output/trim2 folder
#Download / Install Blast
cd $PANGEAWD/Classify/Runblast
sh install_blast.sh
#Download NCBI database for classification
cd $PANGEAWD
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
#Format the database
$PANGEAWD/Classify/Runblast/makeblastdb -in $PANGEAWD/nt -out $PANGEAWD/nt -dbtype nucl
#Classify your sequences using parallel BLAST search
cd $PANGEAWD/Classify/Runblast
Example of parallel BLAST (MPI-blastn) executed in a PBS/Torque/Maui HPC cluster:
Use an example submission script available in $PANGEAWD/Scripts directory
*EDIT THE FILE submit_MPI-blast.job FIRST!
qsub $PANGEAWD/Scripts/submit_MPI-blast.job
where: input.fasta refers to your sequences after trimming.
For running parallel blast for multiple input files at the same time:
*EDIT THE FILE submit_multiple_MPI-blast.job FIRST! Follow the instructions in the file.
*Replace ./dir/ by your input directory and change the values of these parameters before running: "database="; "total_processes="; "nodes="
for i in ./dir/*.fasta; do qsub submit_multiple_MPI-blast.job -v in=`echo $i`,out=`echo $i.txt`,database=database_name,nodes=4, total_processes=16; done
where: ./dir/ is your input sequences directory nodes= is the number of requested nodes total_processes= is the total number of processes requested database= is the name of database The output files will have the same name of your inputs, but with .txt suffix.
Example using your own blastn installation:
export PATH=$PANGEAWD/Classify/Runblast:$PATH
blastn -query input.fasta -db database.formated -outfmt 6 -out blast_output.txt
#Parse the taxonomic classification based on the NCBI taxonomy databases
Running NCBI-taxcollector:
cd $PANGEAWD/Tax_class
make all
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.tar.gz
tar -xvf taxdump.tar.gz
tar -xvf gi_taxid_nucl.dmp.tar.gz
./tax_class -c
perl NCBI-taxcollector-0.01.pl –f $PANGEAWD/parallel_output.txt -o $PANGEAWD/parallel_output_class.txt > report.txt
where: parallel_output.txt is the mpiblastn results parallel_output_class.txt are the parsed and classified output generated by this program.
#Classify your sequences using RDP Classifier search
cd $PANGEAWD/Classify/RunRDP/
sh install_RDPClassifier.sh
java -Xmx1g -jar rdp_classifier-2.5.jar -q $PANGEAWD/input_trimmed.txt -o output_rdp.txt
Where: -q refers to the query file. -o refers to the output file.
#Classify your sequences using parallel SOAP Aligner search
Format your database:
cd $PANGEAWD/Classify/Runsoap/soap2.21release
./2bwt-builder $PANGEAWD/database.fasta
Run sequence search:
./soap -a $PANGEAWD/input_trimmed.fasta -D $PANGEAWD/database.fasta.index -o $PANGEAWD/output_soap.txt -p 8 -M 4
Where: -D Prefix name for reference index [*.index]. -a Query file, for SE reads alignment or one end of PE reads. -b Query b file, one end of PE reads. -o Ouput file -p Number of threads -M INT Match mode for each read or the seed part of read, which shouldn't contain more than 2 mismaches, [4] 0: exact match only 1: 1 mismatch match only 2: 2 mismatch match only 3: [gap] (coming soon) 4: find the best hits
#Run Consensus Analysis
cd $PANGEAWD/Consensus
perl Consensus_BLAST_SOAP_RDP-1.1.pl -b output_blast_class.txt -r output_rdp.txt -o output_consensus.txt
Where: -b Classification results (Blast) parsed by NCBI-taxcollector -r Classification results (RDP) -s Classification results (SOAP2) -o Output file (txt)
The output shall look like this:
S000008953 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 92.61 1435 81 23 29 1452 1 1421 0.0 2039
#Matches found: 4
S000010870 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 91.78 1435 90 26 49 1469 1 1421 0.0 1971
#Matches found: 4
S000014058 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 92.20 1435 88 22 29 1453 1 1421 0.0 2008
#Matches found: 4
S000016099 [0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6; 91.66 1438 86 29 49 1469 1 1421 0.0 1960
#Matches found: 4
#Cluster your results by identity:
Example for 80% identity*:
perl $PANGEAWD/Megaclust/megaclust2.pl -i $PANGEAWD/output_consensus.txt -o $PANGEAWD/output_consensus.megaclust_80_hits.txt -b 100 -s 80 -e 1e-20
*More examples and automatic scripts at $PANGEAWD/Scripts
#Generate summary table for classified results:
Example for Domain level (80%) similarity*:
perl $PANGEAWD/Megaclustable/megaclustable.pl -m $PANGEAWD/output_consensus.megaclust_80_hits.txt -t 0 -o $PANGEAWD /results/megaclustable/DomainTable.txt
*Note: in the –m option you shall list all the ouput files generated by the megaclust execution for every sample. More examples and automatic scripts at $PANGEAWD/Scripts.
The classification output should be like this:
1 2 3 4 5 6 7 8 9 10
Bacteria 479 4 32 7507 11977 13245 2129 11222 539 2411
Eukaryota 1 4 5 5 2 17 78 3 10 3
Archaea 1 0 0 0 0 0 0 0 0 1
#References