-
Notifications
You must be signed in to change notification settings - Fork 14
How to prepare gene annotations for gprofile
Serghei Mangul edited this page Jul 30, 2016
·
5 revisions
This is internal document. Use it in case you are planing to prepare gene annotations for a new organism. Note human and mouse annotations are already prepared.
To extract the transcript names from gtf:
awk -F "transcript_id" '{print $2}' genes.gtf | awk -F "transcript_name" '{print $1}' | sed 's/"//g' | sed 's/;//' >transcripts.txt
To extract gene names from gtf:
awk -F "gene_id" '{print $2}' genes.gtf | awk -F "gene_name" '{print $1}' | sed 's/"//g' | sed 's/;//' >genes.txt
Merge them into a single file:
paste genes.txt transcripts.txt | awk '{print $1","$2}' | sort | uniq >genes_transcripts.txt
Please download UT3, UTR5, and CDS from here.
Prepare them in the correct format:
awk -F "_" '{print $1}' CDS_NCBIM37.bed | awk '{if($2<0) print}' >CDS_NCBIM37_v2.bed
awk -F "_" '{print $1}' UTR5_NCBIM37.bed | awk '{if($2<0) print}' >UTR5_NCBIM37_v2.bed
awk -F "_" '{print $1}' UTR3_NCBIM37.bed | awk '{if($2<0) print}' >UTR3_NCBIM37_v2.bed
Now let's make a new bed file with the information of the gene each element belong to:
python ../../../scripts/prepareBed.py UTR3_NCBIM37_v2.bed genes_transcripts.txt ../bedPrepared/geneCoordinates_NCBIM37.bed ../bedPrepared/UTR3_NCBIM37_prepared.bed m
python ../../../scripts/prepareBed.py UTR5_NCBIM37_v2.bed genes_transcripts.txt ../bedPrepared/geneCoordinates_NCBIM37.bed ../bedPrepared/UTR5_NCBIM37_prepared.bed m
python ../../../scripts/prepareBed.py CDS_NCBIM37_v2.bed genes_transcripts.txt ../bedPrepared/geneCoordinates_NCBIM37.bed ../bedPrepared/CDS_NCBIM37_prepared.bed m
Don’t let your unmapped reads go to waste
- Main
- About ROP Tutorial
- What is ROP?
- How ROP works?
- How to prepare unmapped reads
- How to customize tools used by ROP
- Unix Tutorial
- Get started
- Targeted analysis
- ROP analysis: one RNA-Seq sample
- How to run ROP for mouse
- ROP analysis via qsub
- ROP analysis of multiple samples via qsub array
- Immune profiling by ROP (ImReP)
- ImRep across multiple samples
- ROP input details
- ROP output details
- Source of every last read
- Additional options
- How to calculate immune diversity?
- How to run hyper editing pipeline?