AppendNCBIGeneDescriptionColumn(ExcelDataFilePath, ExcelDataFileEnsemblIDColumnName = "ensembl_gene_id")
Appends a column of NCBI gene descriptions to an Excel sheet with a column (named ExcelDataFileEnsemblIDColumnName
) containing Ensembl IDs. Depends on EnsemblID2Entrez
.
BioMartGOFilter.Nfurzeri(GO.CSV, CombineFruitFlyHomology = TRUE, CombineHumanHomology = TRUE, CombineMouseHomology = TRUE, CombineNematodeHomology = TRUE, CombineZebrafishHomology = TRUE)
Use the biomaRt package to get all Nothobranchius furzeri genes with GO term annotations in GO.CSV
[including all child terms (is_a
, regulates
, etc.)]. CombineFruitFlyHomology
/CombineHumanHomology
/CombineMouseHomology
/CombineNematodeHomology
/CombineZebrafishHomology
allows complementation using the gene homology [to fly (Drosophila melanogaster)/human/mouse (Mus musculus)/nematode (Caenorhabditis elegans)/zebrafish (Danio rerio)] information. Note that this only works for Ensembl 113 (released on October 18th, 2024) or later.
The output is a list in which the name of each element is the Ensembl ID of a N. furzeri gene and the content of each element is the GO term annotations of that gene (supplemented with homology information).
CorrelateOmics(ProteomicsDataFilePath, UniProtIDColumnName = "Protein IDs", To = "Ensembl", GeneNameColumnName = "Gene name", ProteomicsColumnsToCalculateMean, TranscriptomicsDataFilePath, TranscriptomicsColumnsToCalculateMean, RefreshGeneNames = TRUE)
plotCorrelateOmics(DataFrame, Alpha = 0.1, HighlightGeneNameRegex, HighlightAlpha = 1, HighlightColor = "#C40233", HighlightSize = 2.5)
CorrelateOmics
links proteomics data from ProteomicsDataFilePath
and transcriptomics data from TranscriptomicsDataFilePath
of each gene. Only proteins/genes with a one-to-one mapping will be included. If RefreshGeneNames
is set as FALSE
, the result is a data frame with 5 columns: "logTranscriptomicsMean", "logTranscriptomicsStdev", "logProteomicsMean", "logProteomicsStdev", and "GeneName" (copied directly from the GeneNameColumnName
column in ProteomicsDataFilePath
). If RefreshGeneNames
is set as TRUE
, EnsemblID2Entrez
(see below) will be deployed to re-download gene names from the NCBI Gene database, appending an additional CurrentEntrezGeneName
column to the output data frame. The row names of the data frame are the corresponding Ensembl IDs. CorrelateOmics
depends on UniProtKBAC2EnsemblID
and passes the To
input variable solely to UniProtKBAC2EnsemblID
(i.e. To
should be set as "WormBase" instead of the default "Ensembl" when working with C. elegans datasets).
In rare cases when there are too many UniProtKB accession IDs in the proteomics data file, the line that executes the ID conversion step
Table <- UniProtKBAC2EnsemblID(paste(All.UniProtKB.Entries, collapse = ","))
could be modified as follows to split the job into smaller jobs of PackageSize
entries each.
Table <- c()
for (i in 1 : ceiling(length(All.UniProtKB.Entries) / PackageSize)) {
Package.UniProtKB.Entries <- All.UniProtKB.Entries[((i - 1) * PackageSize + 1) :
min(i * PackageSize, length(All.UniProtKB.Entries))]
Package.Table <- UniProtKBAC2EnsemblID(paste(Package.UniProtKB.Entries, collapse = ","))
Table <- rbind(Table, Package.Table)
}
plotCorrelateOmics
can plot the output data frame of CorrelateOmics
. To highlight certain genes, specify them by their NCBI gene names using the HighlightGeneNameRegex
. The returned ggplot can be viewed interactively by plotly::ggplotly
.
EnsemblID2Entrez(EnsemblID, Output = c("Accession", "ID", "Description", "Name"))
Converts a single Ensembl ID to its corresponding NCBI Entrez accession(s)/ID(s)/description(s)/name(s) using the rentrez package. If the mapping exists, the output will be a string; otherwise, the output will be ""
. This works better than using biomaRt
because the mapping is more complete. And unlike using org.*.eg.db
, this works for all species.
Note: the default genome assembly of Nothobranchius furzeri on Ensembl is Nfu_20140520 while OrthoDB and the NCBI have opted for the new UI_Nfuz_MZM_1.0 as the default (the UI_Nfuz_MZM_1.0 assembly has less unknown base pairs and more annotated genes owing to the long-read sequencing method, but the Nfu_20140520 assembly has a slightly higher BUSCO score and is based on the more commonly used GRZ-AD strain). This may cause differences in annotations. The UniProt reference proteome UP000694548 is also based on Nfu_20140520, making it convenient to correlate omics.
EnsemblIDFilter(ExcelDataFilePath, BioMartExportFilePaths = NA, PassedEnsemblIDVector = NA, ExcelDataFileEnsemblIDColumnName = "ensembl_gene_id", BioMartExportEnsemblIDColumnName, ReAdjustPValues = TRUE, PValueColumnName = "pvalue", AdjustedPValueColumnName = "padj")
Filters a Flaski RNAseq pipeline output Excel sheet (ExcelDataFilePath
) based on an vector of desired Ensembl IDs
-
in
PassedEnsemblIDVector
or -
compiled from exported Ensembl BioMart TSV files whose paths are specified in
BioMartExportFilePaths
, ifBioMartExportFilePaths
is notNA
(note: this overrides the input variablePassedEnsemblIDVector
).
FindUniqueGenes.EnsemblID(TargetSpecies, CheckHomologySpecies = c("drerio", "kmarmoratus"))
Identifies genes of the TargetSpecies
(returns a vector of their Ensembl IDs) without a homolog in CheckHomologySpecies
.
GOFilter(ExcelDataFilePath, GOVector, godir, GOTermColumnName = "GO_id", ReAdjustPValues = TRUE, PValueColumnName = "pvalue", AdjustedPValueColumnName = "padj")
Filters a Flaski RNAseq pipeline output Excel sheet (ExcelDataFilePath
) based on the desired GO terms in GOVector
[including all child terms (is_a
, regulates
, etc.) defined by godir
].
Note: Ensembl BioMart provides a built-in functionality to filter genes by GO term annotations (see the figure below; all child terms will also be included), which is better because a fresh download from Ensembl BioMart will reflect the most up-to-date GO term annotations. See BioMartGOFilter.Nfurzeri
and EnsemblIDFilter
.
plotlog2ReadDistribution(ExcelDataFilePath, DataColumns)
Plots the smoothed empirical distribution function of all normalized reads (each gene in each sample; compiled from columns whose IDs/names are in DataColumns
) to help determine a threshold to filter genes with valid expression and a meaningful fold-change. This step is helpful when picking genes for further functional studies but dispensable if only bioinformatic analyses (like a gene set enrichment analysis) are to be done.
samples.beeswarm(GeneNameRegex, ExcelDataFilePath, GeneNameColumnName = "gene_name", ColumnOffset = 2, Group1RepNum, Group2RepNum, GroupTags, Colours = c("black", "red"), Standardized = 1, Breaks = 10, PointSize = 0.75, LineWidth = 0.5, AsteriskSignificance = TRUE, PValueColumnName = "pvalue", PValueDigit = 2)
Plots a beeswarm plot of sample reads for gene(s) whose name(s) match the GeneNameRegex
. If Standardized == 1
(or Standardized == 2
), the sample reads of each gene will be standardized by the mean of sample reads of group 1 (or 2) of each gene; otherwise, no standardization will be performed. The plot is automatically saved as a time-tagged .png
file in the working directory.
SRX2SRR(SRXSheetFilePath, SRXColumnName = "SRX")
Converts a column (from an Excel sheet SRXSheetFilePath
) of experiment numbers into corresponding run numbers (printed directly onto the console alongside the sequencing format employed). Note that this critical sequencing format info is not available in SRA_Accessions.tab
(which allows batch searching using the corresponding SRP/PRJNA accession number directly) or SRA_Run_Members.tab
(which allows batch searching using the corresponding SRP accession number) on the FTP site.
Recommended alternative: the SRA Run Selector tool provided by the NCBI (→ tutorial). We can get an overview of all the project's associated datasets by searching for the corresponding SRP/PRJNA/GSE accession number (see the figure attached below as an example for an overview of all datasets in Hussein et al., Developmental Cell, 2020).
UniProtKBAC2EnsemblID(UniProtKBAC.CSV, Wait = 5, To = "Ensembl")
UniProtKBAC2EnsemblID
utilizes the UniProt REST API (which is more complete than biomaRt
) to convert UniProtKB accession IDs in UniProtKBAC.CSV
into the Ensembl IDs of corresponding genes (note: since Ensembl inherits the WormBase IDs for C. elegans genes, To
should be set as "WormBase" instead of the default "Ensembl" when converting C. elegans genes). Once a job is submitted, the status will be inquired every Wait
seconds until the job is finished. The downloaded output is then parsed into a matrix with two columns named uniprotsptrembl
and ensembl_gene_id
(consistent with BioMart). Each row corresponds to a mapping. If a protein/peptide is mapped to more than one Ensembl gene ID, multiple rows will share the same UniProtKB accession ID in column 1 but possess different Ensembl IDs in column 2.
volcano.ma(Data, PlotType = "ma", HighlightEnsemblIDs = NA, GeneNameColumnName = "gene_name", EnsemblIDColumnName = "ensembl_gene_id", log2FoldChangeColumnName = "log2FoldChange", abslog2FoldChangeThreshold = 1, abslog2FoldChangeLimit, baseMeanColumnName = "baseMean", log2baseMeanLowerLimit, log2baseMeanUpperLimit, AdjustedPValueColumnName = "padj", SignificanceThreshold = 0.01, negativelog10AdjustedPValueLimit, LineWidth = 0.25, Alpha = 1, NSAlpha = 0.1, UpColor = "#FFD300", DownColor = "#0087BD", HighlightColor = "#C40233", HighlightSize = 2.5, log2FoldChangeLabel, log2FoldChangeTickDistance = 1, log10AdjustedPValueTickDistance = 5)
Plots a volcano plot (PlotType = "volcano"
) or an MA plot (PlotType = "ma"
) and highlight genes with an Ensembl ID in HighlightEnsemblIDs
. Points beyond limits (defined by ±abslog2FoldChangeLimit
, log2baseMeanLowerLimit
, log2baseMeanUpperLimit
, and negativelog10AdjustedPValueLimit
; ignored if NA
) will be coerced onto the border.
Note while using ggplotly
to plot the graph: the axis titles should be adjusted to avoid an error [for the volcano plot, use Plot <- Plot + xlab("log2(fold change)") + ylab("-log10(adjusted p)"); ggplotly(Plot)
; for the MA plot, use Plot <- Plot + xlab("log2(base mean)") + ylab("log2(fold change)"); ggplotly(Plot)
].