ChimeraTE is a pipeline to detect chimeric transcripts derived from genes and transposable elements (TEs). It has two running Modes:
-
Mode 1 chimeric transcripts detection based upon exons and TE copies positions in the genome sequence;
-
Mode 2 chimeric transcripts detection regardless the genomic position, allowing the detection of chimeras from TEs that are not present in the referece genome, but with less sensitivity.
The installation may be easily done with conda. If you don't have conda installed in your machine, please follow this tutorial.
Once you have installed conda, you need to enable Bioconda channel with:
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Then, all dependencies to run ChimeraTE can be easily installed in a new conda environment by using the chimeraTE.yml file:
Download repository from github:git clone https://github.com/OliveiraDS-hub/ChimeraTE.git
Change to the ChimeraTE's folder:cd ChimeraTE
Create chimeraTE environment with all dependencies:conda env create -f chimeraTE.yml
Activate the new environment:conda activate chimeraTE
Note: We advise you to return your condarc config to the default with:
conda config --remove channels bioconda
conda config --remove channels conda-forge
conda config --set channel_priority false
Alternatively to conda, you can use singularity v3.10.0+ to build a container with all dependencies for ChimeraTE.
If you don't have sudo
permissions:
singularity build --fakeroot chimeraTE.simg singularity.def
If you have sudo
:
sudo singularity build chimeraTE.simg singularity.def
Then, to run ChimeraTE:
singularity exec chimeraTE.simg python3 chimTE_mode1.py --help
singularity exec chimeraTE.simg python3 chimTE_mode2.py --help
If you don't have conda or singularity, you can install all dependecies as an old school bioinformatician. It's important to highlight that all of them must be installed in your path.
-
Python dependencies
-
Softwares
In order to run ChimeraTE, the following files are required according to the running Mode:
Data | Mode 1 | Mode 2 | Mode 2 --assembly |
---|---|---|---|
Stranded paired-end RNA-seq - Fastq files | X | X | X |
Assembled genome - Fasta file with chromosomes/scaffolds/contigs sequences | X | ||
Gene annotation - GTF file with gene annotations (UTRs,exons,CDS) | X | ||
TE annotation - GTF file with TE insertions | X | ||
Reference transcripts - Fasta file with reference transcripts | X | X | |
Reference TEs - Fasta with ref. TE insertions | X | ||
Dfam taxonomy OR fasta with ref. TE consensuses | X |
In the Mode 1, chimeric transcripts will be detected considering the genomic location of TE insertions and exons. Chimeras from this Mode can be classified as TE-initiated TE-exonized, and TE-terminated transcripts. Mode 1 does not detect chimeric transcripts derived from TE insertions absent from the reference genome that is provided.
cd ChimeraTE/
python3 chimTE_mode1.py --help
ChimeraTE Mode 1: The genome-guided approach to detect chimeric transcripts with RNA-seq data.
Required arguments:
--genome Genome in fasta
--input Paired-end files and their respective group/replicate
--project Directory name with output data
--te GTF file containing TE information
--gene GTF file containing gene information
--strand Define the strandness direction of the RNA-seq. Two options:
"rf-stranded" OR "fwd-stranded"
Optional arguments:
--chimera Identify specific type of chimera: "TE-initiated" OR "TE-
exonized" OR "TE-terminated"
--window Upstream and downstream window size (default = 3000)
--replicate Minimum recurrency of chimeric transcripts between RNA-seq
replicates (default 2)
--coverage Minimum coverage (mean between replicates default 2 for
chimeric transcripts detection)
--fpkm Minimum fpkm to consider a gene as expressed (default 1)
--threads Number of threads (default 6)
--overlap Minimum overlap between chimeric reads and TE insertions (default 0.50)
--index Absolute path to pre-existing STAR index
The input tab-delimited table provided with --input
must have a specific format:
First column: Mate 1 from the paired-end data
Second column: Mate 2 from the paired-end data
Third column: Replicate/group name
mate1 | mate2 | rep |
---|---|---|
/home/user/ChimeraTE/mate1_control1.fastq.gz | /home/user/ChimeraTE/mate2_control1.fastq.gz | rep1 |
/home/user/ChimeraTE/mate1_control2.fastq.gz | /home/user/ChimeraTE/mate2_control2.fastq.gz | rep2 |
/home/user/ChimeraTE/mate1_control3.fastq.gz | /home/user/ChimeraTE/mate2_control3.fastq.gz | rep3 |
The header must be absent, as it follows in the example --input
table at example_data/mode1/input_example.tsv
Usually, the coordinates for TE insertions is given as the .out file from RepeatMasker in many databases. If you already have a .out file from RepeatMasker, you can convert it to .gtf on Linux with:
tail -n +4 RMfile.out | egrep -v 'Satellite|Simple_repeat|rRNA|Low_complexity|RNA|ARTEFACT' | awk -v OFS='\t' '{Sense=$9;sub(/C/,"-",Sense);$9=Sense;print $5,"RepeatMasker","similarity",$6,$7,$2,$9,".",$10}' > RMfile.gtf
If you don't have the .out file for your genome assembly, check it out the util section.
After installation, you can run ChimeraTE with the example data from the sampled RNA-seq from D. melanogaster used in our paper.
#Do not forget to activate your conda environment:
conda activate chimeraTE
#One-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa --input example_data/mode1/input_mode1.tsv --project example_mode1 --te example_data/mode1/dmel_TEs_sample.gtf --gene example_data/mode1/dmel_genes_sample.gtf --strand rf-stranded
#Multi-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa \
--input example_data/mode1/input_mode1.tsv \
--project example_mode1 \
--te example_data/mode1/dmel_5TEs_sample.gtf \
--gene example_data/mode1/dmel_5genes_sample.gtf \
--strand rf-stranded
If you have more than 6 threads available on your machine, you can use --threads
to speed up the process.
The output files can be found at ChimeraTE/projects/$your_project_name
. For instance, for the example data, you can find the output at ChimeraTE/projects/example_mode1
. Inside this directory, you might found 3 tables:
- TE-initiated_final.ct
- TE-exonized_final.ct
- TE-terminated_final.ct
These tables contain the chimeric transcripts list with the location of genes and TE insertions generating chimeras, as well as their corresponding coverage of chimeric reads (support). At the 7th column of TE-exonized_final.ct
, you can find the position of the TE within the gene region (Embedded, Intronic, or Overlapped). As it follows in the example below:
=========================> TE-initiated_final.ct <=========================
gene_id | gene_strand | gene_pos | TE_id | TE_strand | TE_pos | chim_reads |
---|---|---|---|---|---|---|
FBgn0031188 | - | X_RaGOO:21340686-21343686 | S2 | + | X_RaGOO:21341507-21342141 | 11.5 |
=========================> TE-exonized_final.ct <=========================
gene_id | gene_strand | gene_pos | TE_id | TE_strand | TE_pos | exonized_type | chim_reads |
---|---|---|---|---|---|---|---|
FBgn0285926 | - | X_RaGOO:10476773-10513188 | roo | - | X_RaGOO:10485868-10485985 | Embedded | 63.5 |
FBgn0052000 | + | 4_RaGOO:126456-137357 | 1360 | + | 4_RaGOO:133965-134061 | Overlapped | 4.5 |
FBgn0039923 | - | 4_RaGOO:761931-772400 | FB | - | 4_RaGOO:769101-769563 | Intronic | 91.0 |
=========================> TE-terminated_final.ct <=========================
gene_id | gene_strand | gene_pos | TE_id | TE_strand | TE_pos | chim_reads |
---|---|---|---|---|---|---|
FBgn0011747 | - | 4_RaGOO:106334-1093346 | G5 | - | 4_RaGOO:109144-109334 | 5.0 |
Mode 2 is designed to identify chimeric transcripts without the reference genome, with the prediction of chimeras from fixed and polymorphic TEs. In Mode 2, two alignments with stranded RNA-seq reads are performed: (1) against transcripts; (2) against TE insertions. From these alignments, all reads supporting chimeric transcripts (chimeric reads) will be computed. These reads are thise ones that have different singleton mates from the same read pairs splitted between transcripts and TEs, or those that have concordant alignment in one of the alignments, but singleton aligned reads in the other. There is also an option to perform de novo transcriptome assembly with --assembly
parameter. Such additional analysis will analyze whether gene transcripts contain TE-derived sequences.
cd ChimeraTE/
python3 chimTE_mode2.py --help
ChimeraTE Mode 2: The genome-blinded approach to detect chimeric transcripts with RNA-seq data.
Required arguments:
--input Paired-end files and their respective group/replicate
--project Directory name with output data
--te Fasta file containing TE information
--transcripts Fasta file containing gene information
--strand Define the strandness direction of the RNA-seq. Two options:
"rf-stranded" OR "fwd-stranded"
Optional arguments:
--coverage Minimum coverage (mean between replicates default 2 for
chimeric transcripts detection)
--fpkm Minimum fpkm to consider a gene as expressed (default = 1)
--replicate Minimum recurrency of chimeric transcripts between RNA-seq
replicates (default = 2)
--threads Number of threads (default = 6)
--assembly Search for chimeric transcript with transcriptome assembly
with Trinity
--ref_TEs "species" database used by RepeatMasker (flies, human,
mouse, arabidopsis; or a built TE library in fasta format)
--ram Ram memory in Gbytes
(default = 8)
--overlap Minimum overlap between chimeric reads and TE insertions
(default 0.50)
--TE_length Minimum TE length to keep it from RepeatMasker output
(default = 80bp)
--identity Minimum identity between de novo assembled transcripts and
reference transcripts (default = 80)
Despite the format of the input files are simple fastas, altogether with paired-end RNA-seq reads, the sequence IDs for transcripts and TEs must be in a specific pattern. In order make it easier to generate these formats, we provide util
scripts to manage your data.
- In order to run ChimeraTE correctly, this fasta file must have a specific header pattern. All IDs have be composed firstly by the isoform ID, followed by the gene name. For instance, in D. melanogaster, the gene FBgn0263977 has two transcripts:
Tim17b-RA_FBgn0263977
Tim17b-RB_FBgn0263977 - Note that headers "Tim17b-RA" and "Tim17b-RB" have isoform ID separated from gene name by "_". This is not a usual ID format, thefore we have developed auxiliary scripts ($FOLDER/ChimeraTE/util/) to convert native ID formats to ChimeraTE format.
- transcripts_IDs_NCBI.sh (native IDs from NCBI to the ChimeraTE format)
- transcripts_IDs_ensembl.sh (native IDs from ENSEMBL to the ChimeraTE format)
- transcripts_IDs_FLYBASE.sh (native IDs from FLYBASE to the ChimeraTE format)
After installation, you can run ChimeraTE Mode 2 with the example data from the sampled RNA-seq from D. melanogaster used in our paper.
#Do not forget to activate your conda environment:
conda activate chimeraTE
#One-line
python3 chimTE_mode2.py --input example_data/mode2/input_mode2.tsv --project example_mode2 --te example_data/mode2/dmel-sampled_TE-copies.fa --transcripts example_data/mode2/dmel-sampled_transcripts.fa --strand rf-stranded --assembly
#Multi-line
python3 chimTE_mode2.py --input example_data/mode2/input_mode2.tsv \
--project example_mode2 \
--te example_data/mode2/dmel-sampled_TE-copies.fa\
--transcripts example_data/mode2/dmel-sampled_transcripts.fa \
--strand rf-stranded \
--assembly
Mode 2 will run with 8 threads and 8Gb of RAM memory, but you can speed up the analysis by increasing this values with --threads
and --ram
, respectively.
NOTE: If you are not working with Drosophila data, do not forget to change --ref_TEs
parameter, providing a Dfam taxonomy level to use with RepeatMasker, or a fasta with TE consensuses.
The output files can be found at ChimeraTE/projects/$your_project_name
. For instance, for the example data, you can find the output at ChimeraTE/projects/example_mode2
. Inside this directory, you might found 3 tables:
- chimreads_evidence_FINAL.tsv
In the "chimreads_evidence" table, you will find chimeric transcripts supported only by paired-end reads that have mapped in both transcripts and TE sequences (singletons and concordant/singleton - Check manuscripts's methods). - transcriptome_evidence_FINAL.tsv
In the "transcriptome_evidence" table, you will find chimeras supported only by the transcripme assembly method (if you have activated--assembly
option). This table will provide you the gene, TE family, and the respective assembled transcript ID for which a TE sequence was found. - double_evidence_FINAL.tsv
Finally, "double_evidence" is the list of chimeras for which both previous methods have predicted the same chimera (strong evidence!), containing all information from both previous tables.
=========================> chimreads_evidence_FINAL.tsv <=========================
gene_id | TE_family | chim_reads | transcript_ID | transcript_FPKM |
---|---|---|---|---|
FBgn0058160 | DNAREP1 | 60.0 | CG40160-RH_FBgn0058160 | 62177.475 |
=========================> transcriptome_evidence_FINAL.tsv <=========================
gene_id | TE_family | transcript_ID | Trinity_transcripts | Identity_transcripts | trinity_length | ref_transcript_length | match_length | chim_reads |
---|---|---|---|---|---|---|---|---|
FBgn0286778 | HMSBEAGLE_I | CG46385-RA | TRINITY_DN87_c0_g1_i1; TRINITY_DN88_c0_g1_i1 | 97.992 | 741.5 | 5129.0 | 732.0 | 31.0 |
=========================> double_evidence_FINAL.tsv <=========================
gene_id | TE_family | chim_reads | masked_family | chim_reads_masked | ref_transcript_FPKM | Trinity_transcripts | Identity_transcripts | trinity_length | ref_transcript_length | match_length | ref_transcript_IDs |
---|---|---|---|---|---|---|---|---|---|---|---|
FBgn0001169 | ROO | 32.0 | ROO_I | 4.0 | 3011.5781 | TRINITY_DN13_c0_g1_i3; TRINITY_DN13_c0_g1_i2 | 100.0 | 604.5 | 4069.5 | 603.5 | H-RD; H-RB H-RD_FBgn0001169; H-RB_FBgn0001169; H-RA_FBgn0001169 |
Daniel S Oliveira, Marie Fablet, Anaïs Larue, Agnès Vallier, Claudia M A Carareto, Rita Rebollo, Cristina Vieira. ChimeraTE: A pipeline to detect chimeric transcripts derived from genes and transposable elements. Nucleic and Acids Research, 2023. https://doi.org/10.1093/nar/gkad671
To report bugs and give us suggestions, you can open an issue on the github repository.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.