-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathbayesian-non-parametric.Rmd
204 lines (142 loc) · 5.44 KB
/
bayesian-non-parametric.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# Bayesian Nonparametric Models
The following provides some conceptual code for the [Chinese restaurant](https://en.wikipedia.org/wiki/Chinese_restaurant_process) and [Indian buffet process](https://en.wikipedia.org/wiki/Chinese_restaurant_process#The_Indian_buffet_process) for categorical and continuous/combinations of categorical latent variables respectively. For more detail, see the Bayesian nonparametric section of my [structural equation modeling document](https://m-clark.github.io/sem/bayesian-nonparametric-models.html).
## Chinese Restaurant Process
To start, we have a couple functions demonstrating the Chinese restaurant process. The first is succinct and more conceptual, but notably slower.
```{r crp-func-1}
crp <- function(alpha, n) {
table_assignments = 1
for (i in 2:n){
table_counts = table(table_assignments) # counts of table assignments
nt = length(table_counts) # number of tables
table_prob = table_counts/(i - 1 + alpha) # probabilities of previous table assignments
# sample assignment based on probability of current tables and potential next table
current_table_assignment = sample(1:(nt+1), 1, prob = c(table_prob, 1 - sum(table_prob)))
# concatenate new to previous table assignments
table_assignments = c(table_assignments, current_table_assignment)
}
table_assignments
}
```
The following function is similar to the restaurant function here https://github.com/mcdickenson/shinyapps, and notably faster.
```{r crp-func-2}
crpF <- function(alpha, n) {
table_assignments = c(1, rep(NA, n-1))
table_counts = 1
for (i in 2:n){
init = c(table_counts, alpha)
table_prob = init/sum(init)
current_table_assignment = sample(seq_along(init), 1, prob = table_prob)
table_assignments[i] = current_table_assignment
if (current_table_assignment == length(init)) {
table_counts[current_table_assignment] = 1
} else {
table_counts[current_table_assignment] = table_counts[current_table_assignment] + 1
}
}
table_assignments
}
# library(microbenchmark)
# test = microbenchmark(crp(alpha = 1, n = 1000),
# crpF(alpha = 1, n = 1000), times = 100)
# test
# ggplot2::autoplot(test)
```
Visualize some examples at a given setting.
```{r vis-crp}
out = replicate(5 , crpF(alpha = 1, n = 500), simplify = FALSE)
library(tidyverse)
map_df(
out,
function(x) data.frame(table(x)),
.id = 'result'
) %>%
rename(cluster = x) %>%
ggplot(aes(cluster, Freq)) +
geom_col() +
facet_grid(~ result)
```
Visualize cluster membership. With smaller `alpha`, there is more tendency to stick to fewer clusters.
```{r vis-crp2}
set.seed(123)
n = 100
crp_1 = crp(alpha = 1, n = n)
crp_1_mat = matrix(0, nrow = n, ncol = n_distinct(crp_1))
for (i in 1:n_distinct(crp_1)) {
crp_1_mat[, i] = ifelse(crp_1 == i, 1, 0)
}
crp_4 = crp(alpha = 5, n = n)
crp_4_mat = matrix(0, nrow = n, ncol = n_distinct(crp_4))
for (i in 1:n_distinct(crp_4)) {
crp_4_mat[, i] = ifelse(crp_4 == i, 1, 0)
}
```
```{r crp-heatmaps}
heatmaply::heatmaply(
crp_1_mat,
Rowv = FALSE,
Colv = FALSE,
colors = scico::scico(n = 256, alpha = 1, begin = 0, end = 1),
width = 400
)
heatmaply::heatmaply(
crp_4_mat,
Rowv = FALSE,
Colv = FALSE,
colors = scico::scico(n = 256, alpha = 1, begin = 0, end = 1),
width = 400
)
```
## Indian Buffet Process
The following demonstrates the Indian buffet process for continuous latent variable settings.
```{r ibp-func}
ibp <- function(alpha, N){
# preallocate assignments with upper bound of N*alpha number of latent factors
assignments = matrix(NA, nrow = N, ncol = N*alpha)
# start with some dishes/assignments
dishes = rpois(1, alpha)
zeroes = ncol(assignments) - dishes # fill in the rest of potential dishes
assignments[1, ] = c(rep(1, dishes), rep(0, zeroes))
for(i in 2:N){
prev = i - 1
# esoteric line that gets the last dish sampled without a search for it
last_previously_sampled_dish = sum(colSums(assignments[1:prev, , drop = FALSE]) > 0)
# initialize
dishes_previously_sampled = matrix(0, nrow=1, ncol=last_previously_sampled_dish)
# calculate probability of sampling from previous dishes
dish_prob = colSums(assignments[1:prev, 1:last_previously_sampled_dish, drop = FALSE]) / i
dishes_previously_sampled[1, ] = rbinom(n = last_previously_sampled_dish,
size = 1,
prob = dish_prob)
# sample new dish and assign based on results
new_dishes = rpois(1, alpha/i)
zeroes = ncol(assignments) - (last_previously_sampled_dish + new_dishes)
assignments[i,] = c(dishes_previously_sampled, rep(1,new_dishes), rep(0, zeroes))
}
# return only the dimensions sampled
last_sampled_dish = sum(colSums(assignments[1:prev,]) > 0)
assignments[, 1:last_sampled_dish]
}
```
As before, we can compare different settings.
```{r vis-ibp}
set.seed(123)
ibp_1 = ibp(1, 100)
ibp_4 = ibp(5, 100)
heatmaply::heatmaply(
ibp_1,
Rowv = FALSE,
Colv = FALSE,
colors = scico::scico(n = 256, alpha = 1, begin = 0, end = 1),
width = 400
)
heatmaply::heatmaply(
ibp_4,
Rowv = FALSE,
Colv = FALSE,
colors = scico::scico(n = 256, alpha = 1, begin = 0, end = 1),
width = 400
)
```
## Source
Original code available at:
https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/crp.R