This repository contains code for a computational workflow for RNA-Seq data preprocessing and data analysis using:
- Bash (for HPC environments; my earlier scripts)
- Nextflow
Both implementations make use of containerized environments using Docker or Singularity.
Docker images for the following tools are available @dockerhub
- STAR-RSEM
- FASTQC
- MultiQC
- TRIMGALORE
These images can be pulled from docker pull amnahsid/rnaseq_analysis
. Please note that the combined image is quite large.
To get started with the Nextflow version of the pipeline, make sure you have the following installed:
- Java 8 or higher
- Nextflow
- Docker (Recommended for MacOSX users; use Docker For Mac Docker For Mac)
This is same as nf-core RNAseq sample file. Must have four columns;
- sample < Sample name can be anuY IDentifioer of users own input criteria ; this is not going to me merged for example as in for sequencing depeth etc ; Thats not gona happen here >
- fastq_1 < Path to read1.fastq file >
- fastq_2 < Path to read2.fastq file>
- strandedness <reverse, forward, auto >
── new_workflow/ <- Working directory for analysis
│ └── annotation/ <- Genome annotation file (.GTF/.GFF)
│
│ └── genome/ <- Host genome file (.FASTA)
│
│ └── input/ <- Location of input RNAseq data
│
│ └── output/ <- Data generated during processing steps
│ ├── 1_initial_qc/ <- quality control (FASTQC)
│ ├── 2_trimmed_output/ <- Log from running STAR alignment step
│ ├── 3_aligned_sequences/ <- Main alignment files for each sample (using STAR)
│ ├── 4_final_counts/ <- Summarized gene counts across all samples
│ ├── 5_multiQC/ <- Overall report of logs/QC for each step
│ └── star_index/ <- Folder to store the indexed genome files from STAR/STAR-RSEM
Example use:
> RNAseq_workflow.sh -g <109> <-p> -i <path_of_inputs> -d <analyses directory name> -o <path_of_outputs> -t <threads>
Options:
-g set version of human reference genome, default is HG38 version 108 ; newer version can be passed here e.g. 109
-p default is for paired-end data, (include for single-end data; to do)
-a sjdboverhang for STAR ;
-i path to directory of input fastq or fastq.gz files
-m metadata csv file in the described way; must have four columns
-d directory name for output files, It will be created in current working dirtecrtory
-t average number of threads for each sample, must be integer, default is 1
In the pipeline, the default and exclusive aligner utilized is STAR, which is employed for mapping raw FastQ reads to the reference genome, specifically based on the Human GRCh38 version 108. The pipeline does not incorporate the flexibility to switch to alternative aligners such as HISAT2. However, it is essential to highlight that the STAR aligner can be effectively indexed using RSEM. This indexing enables the generation of count values in TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) space, providing improved accuracy in quantification results.
The pipeline offers multiple options for raw feature counting, including Htseqcount, FeatureCount STAR quant mode, and RSEM. However, RSEM is considered the recommended choice due to its superior performance in accurately quantifying gene expression levels. This preference for RSEM over other methods is supported by the studies of Trapnell et al., 2012 and Li and Dewey, 2011, which highlight RSEM as a best practice for RNA-seq data analysis.
Ensemble FASTA and GTF
- For more detailed information, please refer to the documentation available in the repository's wiki.
- Under development
- If needed indexed genomes to download