-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathmaximum-likelihood.Rmd
311 lines (227 loc) · 14 KB
/
maximum-likelihood.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# (PART\*) Estimation {-}
# Maximum Likelihood
This is a brief refresher on <span class="emph">maximum likelihood estimation</span> using a standard regression approach as an example, and more or less assumes one hasn't tried to roll their own such function in a programming environment before. Given the likelihood's role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation. The following is taken directly from [my document](https://m-clark.github.io/bayesian-basics/appendix.html#maximum-likelihood-review) with mostly just cleaned up code and visualization. The TLDR version can be viewed in the [Linear Regression][Linear Regression] chapter.
In the standard model setting we attempt to find parameters $\theta$ that will maximize the probability of the data we actually observe. We'll start with an observed random target vector $y$ with $i...N$ independent and identically distributed observations and some data-generating process underlying it $f(\cdot|\theta)$. We are interested in estimating the model parameter(s), $\theta$, that would make the data most likely to have occurred. The probability density function for $y$ given some particular estimate for the parameters can be noted as $f(y_i|\theta)$. The joint probability distribution of the (independent) observations given those parameters, $f(y_i|\theta)$, is the product of the individual densities, and is our *likelihood function*. We can write it out generally as:
$$\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)$$
Thus, the *likelihood* for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates. Furthermore, we can compare one set, $\mathcal{L}(\theta_A)$, to that of another, $\mathcal{L}(\theta_B)$, and whichever produces the greater likelihood would be the preferred set of estimates. We can get a sense of this with the following visualization, based on a single parameter. The data is drawn from Poisson distributed variable with mean $\theta=5$. We note the calculated likelihood increases as we estimate values for $\theta$ closer to $5$, or more precisely, whatever the mean observed value is for the data. However, with more and more data, the final ML estimate will converge on the true value.
```{r ml-illustration, echo=FALSE, out.width='500px', message=T}
set.seed(1234)
y = rpois(100000, lambda = 5)
mus = seq(3, 8, l = 100)
L = map_dbl(mus, function(mu) sum(dpois(y, lambda = mu, log = T)))
message(glue::glue('Final estimate = ', round(mus[L == max(L)], 2)))
ggplot(data.frame(mus, L)) +
geom_vline(aes(xintercept = 5), alpha = .5, lty = 2) +
geom_hline(aes(yintercept = max(L)), alpha = .5, lty = 2) +
geom_path(aes(x = mus, y = L), color = '#ff5500', lwd = 2) +
labs(x = expression(theta), y = 'Likelihood') +
visibly::theme_clean(center_axis_labels = TRUE) +
theme(
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(color = 'gray10', size = 14),
axis.title.x = element_text(color = 'gray10', size = 24)
)
# ggsave('img/maxLikeNormalCompareLLforDifferentMeans.svg', bg='transparent')
```
For computational reasons, we instead work with the sum of the natural log probabilities, and so deal with the *log likelihood*:
$$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]$$
Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) $\theta$.
The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters. It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:
$$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]$$
### Example
In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values $y$. First the data is created, and then we create the function that will compute the log likelihood. Using the built in R distributions makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters. We will set things up to work with the <span class="pack">bbmle</span> package, which has some nice summary functionality and other features. However, one should take a glance at <span class="func">optim</span> and the other underlying functions that do the work.
```{r ml-bblmDemo1}
# for replication
set.seed(1234)
# create the data
y = rnorm(1000, mean = 5, sd = 2)
starting_values = c(0, 1)
# the log likelihood function
simple_ll <- function(mu, sigma, verbose = TRUE) {
ll = sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
if (verbose)
message(paste(mu, sigma, ll))
-ll
}
```
The <span class="func">simple_ll</span> function takes starting points for the parameters as arguments, in this case we call them $\mu$ and $\sigma$, which will be set to `r starting_values[1]` and `r starting_values[2]` respectively. Only the first line (ll = -sum...) is actually necessary, and we use <span class="func">dnorm</span> to get the density for each point. Since this optimizer is by default minimization, we reverse the sign of the sum so as to minimize the negative log likelihood, which is the same as maximizing the likelihood. Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values. I do not show that here but you'll see it in your console if `trace = TRUE`.
We are now ready to obtain maximum likelihood estimates for the parameters. For comparison we will use <span class="pack" style = "">bbmle</span> due to its nice summary result, but you can use optim as in the other demonstrations. For the <span class="func">mle2</span> function we will need the function we've created, plus other inputs related to that function or the underlying optimizing function used (by default <span class="func">optim</span>). In this case we will use an optimization procedure that will allow us to set a lower bound for $\sigma$. This isn't strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for $\sigma$ were allowed.
```{r ml-bblmDemo2}
# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm = bbmle::mle2(
simple_ll,
start = list(mu = 2, sigma = 1),
method = "L-BFGS-B",
lower = c(sigma = 0),
trace = TRUE
)
mlnorm
# compare to an intercept only regression model
summary(lm(y~1))
```
We can see that the ML estimates are the same as the `lm` model estimates based on least squares, and which given the sample size are close to the true values.
In terms of the parameters we estimate, instead of the curve presented previously, in the typical case of two or more parameters we can think of a <span class="emph">likelihood surface</span> that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust). I also plot the path of the optimizer from a top down view. The large dot noted represents the minimum negative log likelihood.
```{r ml-plotSurface, echo=FALSE}
mu = seq(4, 6, length = 50)
sigma = seq(1.5, 3, length = 50)
llsurf = matrix(NA, length(mu), length(sigma))
for (i in 1:length(mu)){
for (j in 1:length(sigma)){
llsurf[i,j] = -sum(dnorm(y, mean = mu[i], sd = sigma[j], log = TRUE))
}
}
rownames(llsurf) = mu
colnames(llsurf) = sigma
plotdat = read.csv('data/mle_est.csv')
pointdat = filter(plotdat, mu <=6 & mu >=4 & sigma<= 3 & sigma >=1.5) %>%
mutate(ll = -ll)
# because plotly legend doesn't work correctly with surface, color scale dropped, not needed anyway https://community.plot.ly/t/how-to-name-axis-and-show-legend-in-mesh3d-and-surface-3d-plots/1819
library(plotly)
llsurf_trans = t(llsurf)
palettes = visibly::palettes
plot_ly(
x = ~ mu,
y = ~ sigma,
z = ~ llsurf_trans,
colors = viridis::plasma(500),
type = 'surface',
showscale = FALSE,
width = 750
) %>%
add_trace(
x = ~ mu,
y = ~ sigma,
z = ~ ll,
mode = 'markers',
marker = list(color = palettes$orange, size = 3),
type = "scatter3d",
data = pointdat,
name = NULL,
showlegend = F
) %>%
add_trace(
x = coef(mlnorm)[1],
y = coef(mlnorm)[2],
z = mlnorm@min,
marker = list(color = palettes$orange$complementary, size = 10),
type = "scatter3d",
showlegend = F
) %>%
visibly::theme_plotly() %>%
layout(
scene = list(
xaxis = list(
title = 'mu',
titlefont = list(size = 12),
tickfont = list(size = 10),
dtick = .25
),
yaxis = list(
title = 'sigma',
titlefont = list(size = 12),
tickfont = list(size = 10),
dtick = .25
),
zaxis = list(
title = 'Neg. Log Lik ',
titlefont = list(size = 8),
tickfont = list(size = 10),
position = .1
)
),
paper_bgcolor = 'rgba(0,0,0,0)',
plot_bgcolor = 'rgba(0,0,0,0)'
)
detach(package:plotly)
```
```{r ml-plotNormMLEpath, echo=FALSE, out.width='500px'}
plotdat$nll = -plotdat$ll
ggplot(aes(x = mu, y = sigma), data = plotdat) +
geom_point(
aes(),
color = '#ff5500',
size = 4,
alpha = .25,
show.legend = F,
position = position_jitter(w = 0.2, h = 0.2)
) +
scale_size_continuous(range = c(.1, 5)) +
geom_point(
aes(),
col = palettes$orange$complementary[2],
data = data.frame(t(coef(mlnorm))),
size = 10,
alpha = .9
) +
labs(x = expression(mu),
y = expression(sigma),
caption = "A bit of jitter was added to the points to better see what's going on.") +
theme(
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(color = 'gray10', size = 14),
axis.title.x = element_text(color = 'gray10', size = 24),
axis.title.y = element_text(color = 'gray10', size = 24)
)
# ggsave('img/mlnorm.svg', bg='transparent')
```
Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.
## Linear Model
In the standard regression context, our expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters. We can expand our previous example to the standard linear model without too much change. In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations. Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters. As we need to estimate our intercept and coefficient for the X predictor (collectively referred to as $\beta$), we can think of our likelihood explicitly as before:
$$\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]$$
```{r ml-bblmeReg}
# for replication
set.seed(1234)
# predictor
X = rnorm(1000)
# coefficients for intercept and predictor
beta = c(5, 2)
# add intercept to X and create y with some noise
y = cbind(1, X) %*% beta + rnorm(1000, sd = 2.5)
regression_ll <- function(sigma = 1, Int = 0, b1 = 0) {
coefs = c(Int, b1)
mu = cbind(1,X)%*%coefs
ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
message(paste(sigma, Int, b1, ll))
ll
}
mlopt = bbmle::mle2(regression_ll, method = "L-BFGS-B", lower = c(sigma = 0))
summary(mlopt)
# plot(profile(mlopt), absVal=F)
modlm = lm(y ~ X)
summary(modlm)
- 2 * logLik(modlm)
```
As before, our estimates and final log likelihood value are about where they should be, and reflect the `lm` output, as the OLS estimates are the maximum likelihood estimates. The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot.
```{r ml-misc, echo=FALSE}
plotdat = read.csv('..//bayesian-basics/data//mleEstimates.csv')
colnames(plotdat) = c('sigma', 'Intercept', 'β', 'll')
plotdat$nll = -plotdat$ll
plotdat %>%
select(-ll) %>%
pivot_longer(-c(sigma, nll), names_to = 'variable') %>%
ggplot() +
geom_point(
aes(x = sigma, y = value, size = nll),
color = palettes$orange$orange,
alpha = .15,
show.legend = F,
position = position_jitter(w = 0.2, h = 0.2)
) +
facet_wrap( ~ variable) +
scale_size_continuous(range = c(.5, 5)) +
labs(x = expression(sigma)) +
visibly::theme_clean(center_axis_labels = T) +
theme(
axis.ticks.y = element_blank(),
axis.text.x = element_text(color = 'gray10', size = 14),
axis.title.x = element_text(color = 'gray10', size = 24)
)
# ggsave('img/mlreg.svg', bg='transparent')
```
To move to generalized linear models, very little changes of the process outside of the distribution assumed and that we are typically modeling a function of the target variable (e.g. $\log(y)=X\beta; \mu = e^{X\beta}$).
## Source
Original code available at:
https://m-clark.github.io/bayesian-basics/appendix.html#maximum-likelihood-review