RGB EPP

Reference Genome based Exon Phylogeny Pipeline

License: GPL-2.0-only

Author: Guoyi Zhang

Requirements

External software

fastp
spades.py (provided by spades)
diamond
bowtie2
samtools
bcftools
exonerate (optional, only for --codon)
java
macse (default recognized path: /usr/share/java/macse.jar)
trimal

Internal software

sortdiamond (default recognized path: /usr/bin/sortdiamond)
delstop (default recognized path: /usr/bin/delstop)

Arguments

Details

    -c	--config	config file for software path (optional)
    -g	--genes		gene file path (optional, if -r is specified)
    -f	--functions	functions type (optional): all clean assembly 
      	           	 map postmap varcall consen codon align trim
    -h	--help		show this information
    -l	--list		list file path
    -m	--memory	memory settings (optional, default 16 GB)
    -r	--reference	reference genome path
    -t	--threads	threads setting (optional, default 8 threads)
    --codon		Only use the codon region (optional)
    --fastp		Fastp path (optional)
    --spades		Spades python path (optional)
    --diamond		Diamond python path (optional)
    --sortdiamond	SortDiamond python path (optional)
    --bowtie2		Bowtie2 path (optional)
    --samtools		Samtools path (optional)
    --bcftools		Bcftools path (optional)
    --exonerate		Exonerate path (optional)
    --macse		Macse jarfile path (optional)
    --delstop		Delstop path (optional)
    --trimal		Trimal path (optional)
    for example: ./RGBEPP -f all -l list -t 8 -r reference.fasta

Directories Design

.
├── 00_raw
├── 01_fastp
├── 02_spades
├── 03_bowtie2
├── 04_bam
├── 05_vcf
├── 06_consen
├── 07_macse
├── 08_trimal
├── list
├── gene
├── reference.aa.fasta
└── RGBEPP

Each directory corresponds to each function.

00_raw should conatin all raw fastq.gz data.

Text Files

list is the text file containing all samples, if your raw data is following the style ${list_name}_R1.fastq.gz and ${list_name}_R2.fastq.gz, ${list_name} is what you should list in list file. The easy way to get it in Linux/Unix system is the following command

cd 00_raw
ls | sed "s@_R[12].fastq.gz@@g" | uniq > ../list
cd ..

genes is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command

grep '>' Reference.fasta | sed "s@>@@g" > genes

reference.aa.fasta can be replaced by another other name, but it must contain reference amino acids genome in fasta format

Process

RGBEPP functions

Function clean: Quality control + trimming (fastp)
Function assembly: de novo assembly (spades)
Function map: local nucleic acids alignment search against amino acids subject sequence (diamond, sortdiamond), mapping raw reads to its scaffolds sequences (bowtie2)
Function postmap: Sorting and marking the read read alignment (samtools)
Function varcall: variant calling and filtering (bcftools)
Function consen: get consensus fasta file from vcf files (bcftools), then sort sequences based on gene name and taxa name (RGBEPP)
Function codon (optional): only extract the exon sequence (exonerate)
Function align: multiple sequence align based on condon (macse)
Function trim: trimming based on codon (trimal, delstop)

Arguments reuqirements for functions

Functions	-g/--gene	-l/--list	-r/--reference
clean		✔
assembly		✔
map		✔	✔
postmap		✔
varcall		✔
consen	✔	✔
codon	✔		✔
align	✔
trim	✔

Downstream process

concatenate sequences via SeqCombGo or catsequences or sequencematrix
coalescent / concatenated phylogeny

Inner software

sortdiamond

Usage: sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)

Default sseq is column 2, qstart is column 8, etc.

Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6.

delstop

delstop <fasta_aa> <fasta_nt> --delete

Delete StopCondon generated by Macse. fasta_aa and fasta_nt should be macse output files, --delete should be used when downstream software is tirmal

splitfasta

Usage: splitfasta sample.fasta

It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md
RGBEPP.d		RGBEPP.d
config.example		config.example
countTaxa.d		countTaxa.d
delstop.d		delstop.d
dub.sdl		dub.sdl
sortdiamond.cpp		sortdiamond.cpp
splitfasta.cpp		splitfasta.cpp
splitfasta.d		splitfasta.d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RGB EPP

Requirements

External software

Internal software

Arguments

Details

Directories Design

Text Files

Process

RGBEPP functions

Arguments reuqirements for functions

Downstream process

Inner software

sortdiamond

delstop

splitfasta

About

Releases

Packages

Languages

License

starsareintherose/RGBEPP

Folders and files

Latest commit

History

Repository files navigation

RGB EPP

Requirements

External software

Internal software

Arguments

Details

Directories Design

Text Files

Process

RGBEPP functions

Arguments reuqirements for functions

Downstream process

Inner software

sortdiamond

delstop

splitfasta

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages