-
Notifications
You must be signed in to change notification settings - Fork 51
/
Copy pathLocus_Stats.Rmd
266 lines (210 loc) · 10.2 KB
/
Locus_Stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
```{r,message=FALSE,echo=FALSE}
html <- TRUE
library("knitcitations")
library("knitr")
cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
bib <- read.bibtex("bibtexlib.bib")
opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE,
fig.width = 10, fig.height = 6, cache = TRUE)
if (html) opts_chunk$set(out.width = "700px", dpi = 300)
# use this to set knitr options:
# http://yihui.name/knitr/options #chunk_options
```
---
title: "Locus stats, heterozygosity, HWE"
subtitle: "*ZN Kamvar, SE Everhart, and NJ Grünwald*"
---
A rigorous population genetic analysis looks closely at the data to assess
quality and identify outliers or problems in the data such as erroneous allele
calls. This chapter focuses on analysis on a per-locus level. While there are
statistics that analyze populations across loci, it is important to analyze each
locus independently to make sure that one locus is not introducing bias or
spurious errors into the analysis.
> **Note:** Many of these statistics are specific to co-dominant data.
Locus summary statistics
-----
A quick way to assess quality of the data is to determine the number, diversity,
expected heterozygosity, and evenness of the alleles at each locus. As an
example, we will use data for the fungal-like protist *Phytophthora infestans*
from `r citep(bib["goss2014irish"])`. First, we'll use the function
`locus_table` to get all of the statistics mentioned above. For documentation on
this function type `?locus_table`. Here is a first look at each locus:
```{r, locus_table_stats, eval = FALSE}
library("poppr") # Make sure poppr is loaded if you haven't done so already.
library("magrittr") # We will also use magrittr for part of this chapter
data("Pinf") # P. infestans data set from Mexico and South America
locus_table(Pinf)
```
```{r, load_packages, echo = FALSE}
library("poppr") # Make sure poppr is loaded if you haven't done so already.
library("magrittr") # We will also use magrittr for part of this chapter
data("Pinf") # P. infestans data set from Mexico and South America
```
```{r, locus_table_stats_show, echo = FALSE, message = TRUE}
locus_table(Pinf)
```
We can see here that we have a widely variable number of alleles per locus and
that we actually have a single locus that only has two alleles, `r {x <-
poppr::locus_table(Pinf); rownames(x)[x[, 1] == 2]}`. This locus also has low
diversity, low expected heterozygosity and is very uneven in allele
distribution. This is a sign that this might be a phylogenetically uninformative
locus, where we have two alleles and one is occurring at a minor frequency. We
will explore analysis with and without this locus. Let's first see if
both of these alleles exist in both populations of this data set.
```{r, locus_table_pops, message = TRUE}
locus_table(Pinf, pop = "North America")
locus_table(Pinf, pop = "South America")
```
Phylogenetically uninformative loci
-----
We can see that the South American populations is fixed for one allele, thus
it would not be a bad idea to remove that locus from downstream analyses. We can
do this using the function `informloci`. This will remove loci that contain less
than a given percentage of divergent individuals (the default is $2/N$, where
$N$ equals the number of individuals in the data set).
```{r, informloci, message = TRUE}
nLoc(Pinf) # Let's look at our data set, note how many loci we have.
iPinf <- informloci(Pinf)
nLoc(iPinf) # Note that we have 1 less locus
```
So, how does this affect multi-locus based statistics? We can see immediately
that it didn't affect the number of multilocus genotypes, let's take a look at
the index of association:
```{r, iatest}
poppr(Pinf)
poppr(iPinf)
```
We can see that it increased ever so slightly for the "North America" and
"Total" populations, but not the "South America" population as expected given
the fixed alleles at locus P33.
Missing data
----
It is often important to asses the percentage of missing data. The *poppr*
function `info_table` will help you visualize missing data so that you can
assess how to treat these further using the function `missingno`. For this
example, we will use the nancycats data set as it contains a wide variety of
possibilities for missing data:
```{r, missing_table, fig.height=6, fig.width = 10, fig.cap = "Plot of missing data", message = TRUE}
data(nancycats)
info_table(nancycats, plot = TRUE)
```
Here we see a few things. The data set has an average of 2.34% missing data
overall. More alarming, perhaps is the fact that population 17 has not been
genotyped at locus fca45 at all and that locus fca8 shows missing data across
many populations. Many analyses in *poppr* can be performed with missing data in
place as it will be either considered an extra allele in the case of MLG
calculations or will be interpolated to not contribute to the distance measure
used for the index of association. If you want to specifically treat missing
data, you can use the function `missingno` to remove loci or individuals, or
replace missing data with zeroes or the average values of the locus.
Removing loci and genotypes
-----
When removing loci or genotypes, you can specify a cutoff representing the
percent missing to be removed. The default is `0.05` (5%).
```{r, fig.cap = "Plot of missing data", message = TRUE}
nancycats %>% missingno("loci") %>% info_table(plot = TRUE, scale = FALSE)
```
> **Advanced Users:** when `scale = TRUE`, the color scale will be set so that
the warmest color corresponds to the highest value.
We only removed two loci. If we wanted to make sure we removed everything, we
could set `cutoff = 0`.
```{r, message = TRUE}
miss <- nancycats %>% missingno("loci", cutoff = 0) %>% info_table(plot = TRUE)
```
Again, removing individuals is also relatively easy:
```{r, fig.cap = "Plot of missing data", message = TRUE}
miss <- nancycats %>% missingno("geno") %>% info_table(plot = TRUE)
miss <- nancycats %>% missingno("geno", cutoff = 0) %>% info_table(plot = TRUE)
```
The function `missingno` removes individuals based on the percent of missing
data relative to the number of loci. Let's remove all individuals with 2 missing
loci:
```{r, fig.height=6, fig.width = 10, fig.cap = "Plot of missing data"}
miss <- nancycats %>%
missingno("geno", cutoff = 2/nLoc(nancycats)) %>%
info_table(plot = TRUE)
```
We only found one individual in population 11.
<!--
Replacing missing data
----
Replacement of missing data occurs for each allele over all loci. It will
replace all missing data in your data set. There are two options: replacement of
missing data with zeroes, in fact recoding these as another allelic state, or
replacement of missing data with the average allele frequency observed.
Note that the first population in the data set has 20% missing data
at the first locus. Here is the un-replaced data for reference:
```{r}
nan1 <- popsub(nancycats, 1, drop = TRUE) # Dropping alleles not in that population.
tab(nan1[loc = "fca8"])
```
```{r, message = TRUE}
nanzero <- missingno(nancycats, "zero")
tab(popsub(nanzero, 1, drop = TRUE)[loc = "fca8"])
```
The `NA`s have been replaced with zeroes. Now let's look at what happens when we
replace with `"mean"`.
> Note: We get a warning with this command that says `@tab does not contain integers`.
> This is because the `genind` object should only contain counts of alleles, but
> replacing missing data with the mean will give decimal numbers (aka numeric).
```{r, message = TRUE, warning = TRUE}
nanmean <- missingno(nancycats, "mean")
tab(popsub(nanmean, 1, drop = TRUE)[loc = "fca8"])
```
Notice that there are a lot more alleles than there were originally. This is
because the procedure is performed over the entire data set, not by population.
Let's look at what happens if we perform the same routine on the subset data.
```{r, message = TRUE, warning = TRUE}
nan1 <- nancycats %>% popsub(1, drop = TRUE) %>% missingno("mean")
tab(nan1[loc = "fca8"])
```
-->
Hardy-Weinberg equilibrium
-----
Next, let's determine if our populations are in Hardy-Weinberg equilibrium. We
will again use the nancycats data to test for HWE using the function `hw.test()`
from the *pegas* package. This will compute the $\chi^2$ statistic over the
entire data set and compute two P-values, one analytical and one derived from
permutations:
```{r, hwe, cache = TRUE}
library("pegas")
(nanhwe.full <- hw.test(nancycats, B = 1000)) # performs 1000 permuatations
```
We can see here that both the analytical p-value and permuted p-value show that
we have confidence that all loci are not under the null expectation of HWE. This
makes sense given that these data represent 17 populations of cats. If we wanted
to check what the HWE statistic for each population is, we should first separate
the populations with the function `seppop()`. For this exercise, we will only
focus on the analytical p-value by setting `B = 0`.
```{r, hwe_pop, cache = TRUE}
(nanhwe.pop <- seppop(nancycats) %>% lapply(hw.test, B = 0))
```
### Advanced: Visualization of population-wise p-values
Now we have one matrix per sample, but all we care about are the p-values, which
are in the third column. We can use the functions `sapply` and `[` to loop to
create a matrix that only contains populations in columns and loci in rows.
> `[` is used to subset data, and it's also a function! It provides an easy way
> to subset a list of data.
```{r, hwe_pop_mat, cache = TRUE}
(nanhwe.mat <- sapply(nanhwe.pop, "[", i = TRUE, j = 3)) # Take the third column with all rows
```
This output is still hard to sift through. An easy way to analyze this is by
visualizing this as a heatmap. Since we only care whether or not a given locus
is in or out of HWE, we will thus define an $\alpha$ value and set everything
above that value to 1 so that we can visually detect candidate loci where we
might reject the $H_o$ of HWE.
```{r}
alpha <- 0.05
newmat <- nanhwe.mat
newmat[newmat > alpha] <- 1
```
Now we can create a simple heatmap with `levelplot`.
```{r, fig.height=6, fig.width = 10, fig.cap = "Heatmap showing significant departures from HWE"}
library("lattice")
levelplot(t(newmat))
```
This simple plot shows us loci in rows and populations in columns. Note, that
all loci shown in pink are loci suspected of not being in HWE with $p \leq
0.05$.
References
----------