forked from YuLab-SMU/MicrobiotaProcess
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMicrobiotaProcess-biomaker-discovery.Rmd
219 lines (180 loc) · 11.1 KB
/
MicrobiotaProcess-biomaker-discovery.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "biomarker discovery using MicrobiotaProcess"
author: |
| Shuangbin Xu
| School of Basic Medical Sciences, Southern Medical University
|
date: "`r Sys.Date()`"
bibliography: MicrobiotaProcess.bib
biblio-style: apalike
output:
prettydoc::html_pretty:
toc: true
theme: cayman
highlight: vignette
pdf_document:
toc: true
vignette: >
%\VignetteIndexEntry{ MicrobiotaProcess: biomarker discovery using MicrobiotaProcess.}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="asis", message=FALSE, KnitrSetUp}
knitr::opts_chunk$set(tidy=FALSE,warning=FALSE,message=FALSE)
Biocpkg <- function (pkg){
sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}
CRANpkg <- function(pkg){
cran <- "https://CRAN.R-project.org/package"
fmt <- "[%s](%s=%s)"
sprintf(fmt, pkg, cran, pkg)
}
```
```{r, echo=FALSE, results="hide", message=FALSE, Loadpackages}
library(tidyverse)
library(phyloseq)
library(ggtree)
library(treeio)
library(tidytree)
library(MicrobiotaProcess)
```
# 1. Biomarker discovery
Biomarker discovery has proven to be capacity to convert genomic data into clinical practice[@tothill2008novel; @banerjee2015computed]. And many reports have shown that the microbial communities can be used as biomarkers for human disease[@kostic2012genomic; @zhang2019leveraging; @yu2017metagenomic; @ren2019gut]. `MicrobiotaProcess` presents `diff_analysis` for the biomarker discovery. And It also provided the `ggdiffclade`, based on the `ggtree`[@yu2017ggtree; @yu2018two], to visualize the results of `diff_analysis`. The rule of `diff_analysis` is similar with the `LEfSe`[@Nicola2011LEfSe]. First, all features are tested whether values in different classes are differentially distributed. Second, the significantly different features are tested whether all pairwise comparisons between subclass in different classes distinctly consistent with the class trend. Finally, the significantly discriminative features are assessed by `LDA` (`linear discriminant analysis`) or `rf`(`randomForest`). However, `diff_analysis` is more flexible. The test method of two step can be set by user, and we used the general fold change[@wirbel2019meta] and `wilcox.test`(default) to test whether all pairwise comparisons between subclass in different classes distinctly consistent with the class trend. Moreover, `MicrobiotaProcess` implements more flexible and convenient tools, (`ggdiffclade`, `ggdiffbox`, `ggeffectsize` and `ggdifftaxbar`) to produce publication-quality figures. Here, we present several examples to demonstrate how to perform different analysis with `MicrobiotaProcess`.
## 1.1 colorectal cancer dataset.
```{r, error=FALSE, KosticCRCdata}
data(kostic2012crc)
kostic2012crc
#datatable(sample_data(kostic2012crc), options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
kostic2012crc <- phyloseq::rarefy_even_depth(kostic2012crc,rngseed=1024)
table(sample_data(kostic2012crc)$DIAGNOSIS)
```
This datasets contained 86 Colorectal Carcinoma samples and 91 Control samples(remove the none sample information and low depth sample).In the research, they found the *Fusobacterium* sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA, while the *Firmicutes* and *Bacteroidetes* phyla were depleted in tumors[@kostic2012genomic].
```{r, error=FALSE, KosticCRCdiff_analysis}
set.seed(1024)
diffres <- diff_analysis(obj=kostic2012crc, classgroup="DIAGNOSIS",
mlfun="lda",
filtermod="fdr",
firstcomfun = "kruskal.test",
firstalpha=0.05,
strictmod=TRUE,
secondcomfun = "wilcox.test",
subclmin=3,
subclwilc=TRUE,
secondalpha=0.01,
lda=3)
diffres
```
The results of `diff_analysis` is a `S4` class, contained the original feature datasets, results of first test, results of second test, results of `LDA` or `rf` assessed and the record of some arguments. It can be visualized by `ggeffectsize`. The horizontal ordinate represents the effect size (`LDA` or `MeanDecreaseAccuracy`), the vertical ordinate represents the feature of significantly discriminative. And the colors represent the classgroup that the relevant feature is positive.
```{r, fig.align="center", fig.height=5, fig.width=6, error=FALSE, KosticCRCplotEffectSize}
plotes <- ggeffectsize(obj=diffres) + scale_color_manual(values=c("#00AED7", "#FD9347"))
plotes
```
The results also can be visualized using `ggdiffbox`.
```{r, fig.align="center", fig.height=5, fig.width=7, error=FALSE, KosticCRCLDAtax}
plotes_ab <- ggdiffbox(obj=diffres, box_notch=FALSE, colorlist=c("#00AED7", "#FD9347"), l_xlabtext="relative abundance")
plotes_ab
```
If the `taxda` was provided, it also can be visualized by `ggdiffclade`. The colors represent the relevant features enriched in the relevant classgroup. The size of point colored represent the `-log10(pvalue)`.
```{r, fig.width=7, fig.height=7, fig.align="center", error=FALSE, KosticCRCdiffclade}
diffcladeplot <- ggdiffclade(obj=diffres,
alpha=0.3, size=0.2,
skpointsize=0.6,
taxlevel=3,
settheme=FALSE,
setColors=FALSE) +
scale_fill_manual(values=c("#00AED7", "#FD9347"))+
guides(color = guide_legend(keywidth = 0.1,
keyheight = 0.6,
order = 3,
ncol=1)) +
theme(panel.background=element_rect(fill=NA),
legend.position="right",
plot.margin=margin(0,0,0,0),
legend.spacing.y = unit(0.02, "cm"),
legend.title=element_text(size=7),
legend.text=element_text(size=6),
legend.box.spacing=unit(0.02,"cm"))
diffcladeplot
```
Moreover, the abundance of the features can be visualized by `ggdifftaxbar`. This will generate the figures in specific directory. And the horizontal ordinate of figures represent the sample of different classgroup, the vertical ordinate represent relative abundance of relevant features (sum is 1).
```r
ggdifftaxbar(obj=diffres, xtextsize=1.5, output="./kostic2012crc_biomarkder_barplot")
```
And we also provided `as.data.frame` to produce the table of results of `diff_analysis`.
```{r, KosticCRCdiffTab, error=FALSE}
crcdiffTab <- as.data.frame(diffres)
#datatable(crcdiffTab, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
```
As show in the results of `diff_analysis`, we also found *Fusobacterium* sequences were enriched in carcinomas, and *Firmicutes*, *Bacteroides*, *Clostridiales* were depleted in tumors. These results were consistent with the original article[@kostic2012genomic]. In addition, we also found *Campylobacter* were enriched in tumors, but the relative abundance of it is lower than *Fusobacterium*. And the species of *Campylobacter* has been proven to associated with the colorectal cancer[@He289; @wu2013dysbiosis; @amer2017microbiome].
## 1.2 a small subset of HMP dataset.
```{r, hmpdatasets}
data(hmp_aerobiosis_small)
# contained "featureda" "sampleda" "taxda" datasets.
#datatable(featureda, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
#datatable(sampleda, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
#datatable(taxda, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
dim(featureda)
dim(sampleda)
dim(taxda)
```
This dataset is from a small subset of the `HMP` 16S dataset[@Nicola2011LEfSe], contained 55 samples from 6 body sites. The dataset isn't `phyloseq` class, because `diff_analysis` also supported the matrix datasets as input. The `featureda` contained the relative abundance of different levels features. The `sampleda` contained the information of the samples. And the `taxda` contained the information of the hierarchical relationship of taxonomy. We set the `oxygen_availability` in `sampleda` as `classgroup`, and `body_site` also in `sampleda` as `subclass`.
```{r, hmpdiff_analysis, error=FALSE}
set.seed(1024)
hmpdiffres <- diff_analysis(obj=featureda,
sampleda=sampleda,
taxda=taxda,
alltax=FALSE,
classgroup="oxygen_availability",
subclass="body_site",
filtermod="fdr",
firstalpha=0.01,
strictmod=TRUE,
subclmin=3,
subclwilc=TRUE,
secondalpha=0.05,
ldascore=2)
hmpdiffres
```
```{r, fig.align="center", fig.height=7, fig.width=5.5, error=FALSE, hmpplotEffectSize}
hmpeffetsieze <- ggeffectsize(obj=hmpdiffres,
setColors=FALSE,
settheme=FALSE) +
scale_color_manual(values=c('#00AED7', '#FD9347', '#C1E168'))+
theme_bw()+
theme(strip.background=element_rect(fill=NA),
panel.grid=element_blank(),
strip.text.y=element_blank())
hmpeffetsieze
```
```{r, fig.align="center", fig.height=7, fig.width=7, error=FALSE, hmpplotEffectSizeTax}
hmpes_ab <- ggdiffbox(obj=hmpdiffres, colorlist=c("#00AED7", "#FD9347", '#C1E168'),
box_notch=FALSE, l_xlabtext="relative abundance(%)")
hmpes_ab
```
The explanation of figures refer to the previous section.
```{r, fig.width=7, fig.height=7, fig.align="center", error=FALSE, hmpdiffclade}
hmpdiffclade <- ggdiffclade(obj=hmpdiffres, alpha=0.3, size=0.2,
skpointsize=0.4, taxlevel=3,
settheme=TRUE,
setColors=FALSE) +
scale_fill_manual(values=c('#00AED7', '#FD9347', '#C1E168'))
hmpdiffclade
```
The explanation of figures refer to the previous section.
```r
ggdifftaxbar(obj=hmpdiffres, output="./hmp_biomarker_barplot")
```
Finally, we found the *Staphylococcus*, *Propionibacterium* and some species of *Actinobacteria* was enriched in `High_O2`, these species mainly live in high oxygen environment. Some species of *Bacteroides*, species of *Clostridia* and species of *Erysipelotrichi* was enriched in `Low_O2`, these species mainly inhabit in the gut of human. These results were consistent with the reality.
```{r, hmpdiffTab, error=FALSE}
hmpdiffTab <- as.data.frame(hmpdiffres)
#datatable(hmpdiffTab, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
```
# 2. Need helps?
If you have questions/issues, please visit [github issue tracker](https://github.com/YuLab-SMU/MicrobiotaProcess/issues).
# 3. Session information
Here is the output of `sessionInfo()` on the system on which this document was compiled:
```{r, echo=FALSE}
sessionInfo()
```
# 4. References