Skip to content

amnahsiddiqa/RNASEQ_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minimal RNASEQ Workflow with BASH/Nextflow

RNASEQ_AS

About

This repository contains code for a computational workflow for RNA-Seq data preprocessing and data analysis using:

  • Bash (for HPC environments; my earlier scripts)
  • Nextflow

Both implementations make use of containerized environments using Docker or Singularity.

Docker/Singularity Images

Docker images for the following tools are available @dockerhub

  • STAR-RSEM
  • FASTQC
  • MultiQC
  • TRIMGALORE

These images can be pulled from docker pull amnahsid/rnaseq_analysis. Please note that the combined image is quite large.

For Nextflow Version (To be Public yet)

To get started with the Nextflow version of the pipeline, make sure you have the following installed:

Getting started

Create a sample metadata file

This is same as nf-core RNAseq sample file. Must have four columns;

  • sample < Sample name can be anuY IDentifioer of users own input criteria ; this is not going to me merged for example as in for sequencing depeth etc ; Thats not gona happen here >
  • fastq_1 < Path to read1.fastq file >
  • fastq_2 < Path to read2.fastq file>
  • strandedness <reverse, forward, auto >

Create directory structure like this:

── new_workflow/                    <- Working directory for analysis
  │   └── annotation/               <- Genome annotation file (.GTF/.GFF)
  │  
  │   └── genome/                   <- Host genome file (.FASTA)
  │  
  │   └── input/                    <- Location of input  RNAseq data
  │  
  │   └── output/                   <- Data generated during processing steps
  │       ├── 1_initial_qc/         <- quality control (FASTQC)
  │       ├── 2_trimmed_output/     <-  Log from running STAR alignment step
  │       ├── 3_aligned_sequences/  <- Main alignment files for each sample (using STAR)
  │       ├── 4_final_counts/       <- Summarized gene counts across all samples
  │       ├── 5_multiQC/            <- Overall report of logs/QC for each step
  │   └── star_index/               <-  Folder to store the indexed genome files from STAR/STAR-RSEM

Arguments for config file

Example use:

> RNAseq_workflow.sh -g <109> <-p> -i <path_of_inputs> -d <analyses directory name> -o <path_of_outputs> -t <threads>
Options:
   -g    set version of human reference genome, default is HG38 version 108 ; newer version can be passed here e.g. 109 
   -p    default is  for paired-end data, (include for single-end data; to do)
   -a    sjdboverhang for STAR ; 
   -i    path to directory of input fastq or fastq.gz files
   -m    metadata csv file in the described way; must have four columns 
   -d    directory name for output files, It will be created in current working dirtecrtory 
   -t    average number of threads for each sample, must be integer, default is 1

Alignment Option

In the pipeline, the default and exclusive aligner utilized is STAR, which is employed for mapping raw FastQ reads to the reference genome, specifically based on the Human GRCh38 version 108. The pipeline does not incorporate the flexibility to switch to alternative aligners such as HISAT2. However, it is essential to highlight that the STAR aligner can be effectively indexed using RSEM. This indexing enables the generation of count values in TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) space, providing improved accuracy in quantification results.

Quantification Options

The pipeline offers multiple options for raw feature counting, including Htseqcount, FeatureCount STAR quant mode, and RSEM. However, RSEM is considered the recommended choice due to its superior performance in accurately quantifying gene expression levels. This preference for RSEM over other methods is supported by the studies of Trapnell et al., 2012 and Li and Dewey, 2011, which highlight RSEM as a best practice for RNA-seq data analysis.

Reference genome files

Ensemble FASTA and GTF

NOTES:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published