-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
437 lines (320 loc) · 20.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
---
title: "clustext"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---
```{r, echo=FALSE}
library(knitr)
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
````
```{r, echo=FALSE}
knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="")
}
})
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")
```
[](http://www.repostatus.org/#active)
[](https://travis-ci.org/trinker/clustext)
[](https://coveralls.io/r/trinker/clustext?branch=master)
`r verbadge`

**clustext** is a collection of optimized tools for clustering text data via various text appropriate clustering algorithms. There are many great R [clustering tools](https://cran.r-project.org/web/views/Cluster.html) to locate topics within documents. I have had success with hierarchical clustering for topic extraction. This initial success birthed the [**hclustext**](https://github.com/trinker/hclustext) package. Additional techniques such as kmeans and non-negative matrix factorization also proved useful. These algorithms began to be collected in a consistent manor of use in the **clustext** package. This package wraps many of the great R tools for clustering and working with sparse matrices to aide in the workflow associated with topic extraction.
The general idea is that we turn the documents into a matrix of words. After this we weight the terms by importance using [tf-idf](http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html). This helps the more salient words to rise to the top. Some clustering algorithms require a similarity matrix while others require just the tf-idf weighted DocumentTermMatrices. Likewise, some algorithms require `k` terms to be specified before the model fit while others allow `k` topics to be determined after the model has been fit.
With algorithms that require a similarity matrix (e.g., hierarchical clustering) we apply cosine distance measures to compare the terms (or features) of each document. I have found cosine distance to work well with sparse matrices to produce distances metrics between the documents. The clustering model is fit to separate the documents into clusters. In the case of some clustering techniques (e.g., hierarchical clustering) the user then may apply k clusters to the fit, clustering documents with similar important text features. Other techniques require that `k` be specified prior to fitting the model. The documents can then be grouped by clusters and their accompanying salient words extracted as well.
# Functions
The main functions, task category, & descriptions are summarized in the table below:
| Function | Category | Description |
|------------------------|----------------|-------------------------------------------------------------------------|
| `data_store` | data structure | **clustext**'s data structure (list of dtm + text) |
| `hierarchical_cluster` | cluster fit | Fits a hierarchical cluster model |
| `kmeans_cluster` | cluster fit | Fits a kmeans cluster model |
| `skmeans_cluster` | cluster fit | Fits an skmeans cluster model |
| `nfm_cluster` | cluster fit | Fits a non-negative matrix factorization cluster model |
| `assign_cluster` | assignment | Assigns cluster to document/text element |
| `get_text` | extraction | Get text from various **clustext** objects |
| `get_dtm` | extraction | Get `tm::DocumentTermMatrix` from various **clustext** objects |
| `get_removed` | extraction | Get removed text elements from various **clustext** objects |
| `get_documents` | extraction | Get clustered documents from an **assign_cluster** object |
| `get_terms` | extraction | Get clustered weighted important terms from an **assign_cluster** object|
| `as_topic` | categorization | View `get_terms` object as topics (pretty printed important words) |
| `write_cluster_text` | categorization | Write `get_text(assign_cluster(myfit))` to file for human coding |
| `read_cluster_text` | categorization | Read in a human coded `write_cluster_text` file |
| `categorize` | categorization | Assign human categories and matching clusters to original data |
# Installation
To download the development version of **clustext**:
Download the [zip ball](https://github.com/trinker/clustext/zipball/master) or [tar ball](https://github.com/trinker/clustext/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
"trinker/textshape",
"trinker/gofastr",
"trinker/termco",
"trinker/clustext"
)
```
# Contact
You are welcome to:
* submit suggestions and bug-reports at: <https://github.com/trinker/clustext/issues>
* send a pull request on: <https://github.com/trinker/clustext/>
* compose a friendly e-mail to: <[email protected]>
# Demonstration
## Load Packages and Data
```{r}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(clustext, dplyr, textshape, ggplot2, tidyr)
data(presidential_debates_2012)
```
## Data Structure
The data structure for **clustext** is very specific. The `data_storage` produces a `DocumentTermMatrix` which maps to the original text. The empty/removed documents are tracked within this data structure, making subsequent calls to cluster the original documents and produce weighted important terms more robust. Making the `data_storage` object is the first step to analysis.
We can give the `DocumentTermMatrix` rownames via the `doc.names` argument. If these names are not unique they will be combined into a single document as seen below. Also, if you want to do stemming, minimum character length, stopword removal or such this is when/where it's done.
```{r}
ds <- with(
presidential_debates_2012,
data_store(dialogue, doc.names = paste(person, time, sep = "_"))
)
ds
```
## Fit the Model: Hierarchical Cluster
Next we can fit a hierarchical cluster model to the `data_store` object via `hierarchical_cluster`.
```{r}
myfit <- hierarchical_cluster(ds)
myfit
```
This object can be plotted with various `k` or `h` parameters specified to experiment with cutting the dendrogram. This cut will determine the number of clusters or topics that will be generated in the next step. The visual inspection allows for determining how to cluster the data as well as determining if a tf-idf, cosine, hierarchical cluster model is a right fit for the data and task. By default `plot` uses an approximation of `k` based on Can & Ozkarahan's (1990) formula $(m * n)/t$ where $m$ and $n$ are the dimensions of the matrix and $t$ is the length of the non-zero elements in matrix $A$.
- Can, F., Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. *ACM Transactions on Database Systems 15* (4): 483. doi:10.1145/99935.99938
Interestingly, in the plots below where `k = 6` clusters, the model groups each of the candidates together at each of the debate times.
```{r}
plot(myfit)
plot(myfit, k=6)
plot(myfit, h = .75)
```
## Assigning Clusters
The `assign_cluster` function allows the user to dictate the number of clusters. Because the model has already been fit the cluster assignment is merely selecting the branches from the dendrogram, and is thus very quick. Unlike many clustering techniques the number of clusters is done after the model is fit, this allows for speedy cluster assignment, meaning the user can experiment with the number of clusters.
```{r}
ca <- assign_cluster(myfit, k = 6)
ca
```
### Cluster Loading
To check the number of documents loading on a cluster there is a `summary` method for `assign_cluster` which provides a descending data frame of clusters and counts. Additionally, a horizontal bar plot shows the document loadings on each cluster.
```{r}
summary(ca)
```
### Cluster Text
The user can grab the texts from the original documents grouped by cluster using the `get_text` function. Here I demo a 40 character substring of the document texts.
```{r}
get_text(ca) %>%
lapply(substring, 1, 40)
```
### Cluster Frequent Terms
As with many topic clustering techniques, it is useful to get the to salient terms from the model. The `get_terms` function uses the [min-max](https://en.wikipedia.org/wiki/Feature_scaling#Rescaling) scaled, [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), `DocumentTermMatrix` to extract the most frequent salient terms. These terms can give a sense of the topic being discussed. Notice the absence of clusters 1 & 6. This is a result of only a single document included in each of the clusters. The `term.cutoff` hyperparmeter sets the lower bound on the min-max scaled tf-idf to accept. If you don't get any terms you may want to lower this or reduce `min.n`. Likewise, these two parameters can be raised to eliminate noise.
```{r}
get_terms(ca)
```
Or pretty printed...
```{r}
get_terms(ca) %>%
as_topic()
```
### Clusters, Terms, and Docs Plot
Here I plot the clusters, terms, and documents (grouping variables) together as a combined heatmap. This can be useful for viewing & comparing what documents are clustering together in the context of the cluster's salient terms. This example also shows how to use the cluster terms as a lookup key to extract probable salient terms for a given document.
```{r, fig.width=11}
key <- data_frame(
cluster = 1:6,
labs = get_terms(ca) %>%
tidy_list("cluster") %>%
select(-weight) %>%
group_by(cluster) %>%
summarize(term=paste(term, collapse=", ")) %>%
apply(1, paste, collapse=": ")
)
ca %>%
tidy_vector("id", "cluster") %>%
separate(id, c("person", "time"), sep="_") %>%
tbl_df() %>%
left_join(key, by = "cluster") %>%
mutate(n = 1) %>%
mutate(labs = factor(labs, levels=rev(key[["labs"]]))) %>%
unite("time_person", time, person, sep="\n") %>%
select(-cluster) %>%
complete(time_person, labs) %>%
mutate(n = factor(ifelse(is.na(n), FALSE, TRUE))) %>%
ggplot(aes(time_person, labs, fill = n)) +
geom_tile() +
scale_fill_manual(values=c("grey90", "red"), guide=FALSE) +
labs(x=NULL, y=NULL)
```
### Cluster Documents
The `get_documents` function grabs the documents associated with a particular cluster. This is most useful in cases where the number of documents is small and they have been given names.
```{r}
get_documents(ca)
```
## Putting it Together
I like working in a chain. In the setup below we work within a **magrittr** pipeline to fit a model, select clusters, and examine the results. In this example I do not condense the 2012 Presidential Debates data by speaker and time, rather leaving every sentence as a separate document. On my machine the initial `data_store` and model fit take ~5-8 seconds to run. Note that I do restrict the number of clusters (for texts and terms) to a random 5 clusters for the sake of space.
```{r, fig.height = 10}
.tic <- Sys.time()
myfit2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
difftime(Sys.time(), .tic)
## View Document Loadings
ca2 <- assign_cluster(myfit2, k = 100)
summary(ca2) %>%
head(12)
## Split Text into Clusters
set.seed(5); inds <- sort(sample.int(100, 5))
get_text(ca2)[inds] %>%
lapply(head, 10)
## Get Associated Terms
get_terms(ca2, .4)[inds]
## Pretty Printed Topics
## Get Associated Terms
get_terms(ca2, .4) %>%
as_topic()
```
## An Experiment
It seems to me that if the hierarchical clustering is function as expected we'd see topics clustering together within a conversation as the natural eb and flow of a conversation is to talk around a topic for a while and then move on to the next related topic. A Gantt style plot of topics across time seems like an excellent way to observe clustering across time. In the experiment I first ran the hierarchical clustering at the sentence level for all participants in the 2012 presidential debates data set. I then decided to use turn of talk as the unit of analysis. Finally, I pulled out the two candidates (President Obama and Romney) and faceted n their topic use over time.
```{r, fig.width=10, fig.height = 9}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, clustext, textshape, ggplot2, stringi)
myfit3 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
plot(myfit3, 75)
```
Can & Ozkarahan's (1990) formula indicated a `k = 259`. This umber seemed overly large. I used `k = 75` for the number of topics as it seemed unreasonable that there'd be more topics than this but with `k = 75` over half of the sentences loaded on one cluster. Note the use of the `attribute` `join` from `assign_cluster` to make joining back to the original data set easier.
```{r, fig.width=14, fig.height = 16}
k <- 75
ca3 <- assign_cluster(myfit3, k = k)
presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
tbl_df() %>%
attributes(ca3)$join() %>%
group_by(time) %>%
mutate(
word_count = stringi::stri_count_words(dialogue),
start = starts(word_count),
end = ends(word_count)
) %>%
na.omit() %>%
mutate(cluster = factor(cluster, levels = k:1)) %>%
ggplot2::ggplot(ggplot2::aes(x = start-2, y = cluster, xend = end+2, yend = cluster)) +
ggplot2::geom_segment(ggplot2::aes(position="dodge"), color = 'white', size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = 'grey20'),
panel.grid.minor.x = ggplot2::element_blank(),
panel.grid.major.x = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = 'grey35'),
strip.text.y = ggplot2::element_text(angle=0, hjust = 0),
strip.background = ggplot2::element_blank()) +
ggplot2::facet_wrap(~time, scales='free', ncol=1) +
ggplot2::labs(x="Duration (words)", y="Cluster")
```
Right away we notice that not all topics are used across all three times. This is encouraging that the clustering is working as expected as we'd expect some overlap in debate topics as well as some unique topics. However, there were so many topics clustering on cluster 3 that I had to make some decisions. I could (a) ignore this mass and essentially throw out half the data that loaded on a single cluster, (b) increase `k` to split up the mass loading on cluster 3, (c) change the unit of analysis. It seemed the first option was wasteful of data and could miss information. The second approach could lead to a model that had so many topics it wouldn't be meaningful. The last approach seemed reasonable, inspecting the cluster text showed that many were capturing functions of language rather than content. For example, people use *"Oh."* to indicate agreement. This isn't a topic but the clustering would group sentences that use this convention together. Combining this sentence with other sentences in the turn of talk are more likely to get the content we're after.
Next I used the `textshape::combine` function to group turns of talk together.
```{r, fig.width=10, fig.height = 9}
myfit4 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
plot(myfit4, k = 80)
```
The distribution of turns of talk looked much more dispersed across clusters. I used `k = 60` for the number of topics.
```{r, fig.width=14, fig.height = 16}
k <- 80
ca4 <- assign_cluster(myfit4, k = k)
presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
tbl_df() %>%
attributes(ca4)$join() %>%
group_by(time) %>%
mutate(
word_count = stringi::stri_count_words(dialogue),
start = starts(word_count),
end = ends(word_count)
) %>%
na.omit() %>%
mutate(cluster = factor(cluster, levels = k:1)) %>%
ggplot2::ggplot(ggplot2::aes(x = start-2, y = cluster, xend = end+2, yend = cluster)) +
ggplot2::geom_segment(ggplot2::aes(position="dodge"), color = 'white', size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = 'grey20'),
panel.grid.minor.x = ggplot2::element_blank(),
panel.grid.major.x = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = 'grey35'),
strip.text.y = ggplot2::element_text(angle=0, hjust = 0),
strip.background = ggplot2::element_blank()) +
ggplot2::facet_wrap(~time, scales='free', ncol=1) +
ggplot2::labs(x="Duration (words)", y="Cluster")
```
The plots looked less messy and indeed topics do appear to be clustering around one another. I wanted to see how the primary participants, the candidates, compared to each other in topic use.
In this last bit of analysis I filter out all participants except Obama and Romeny and facet by participant across time.
```{r, fig.width=10, fig.height = 9}
myfit5 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
plot(myfit5, 50)
```
Based on the dendrogram, I used `k = 50` for the number of topics.
```{r, fig.width=14, fig.height = 12}
k <- 50
ca5 <- assign_cluster(myfit5, k = k)
presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
tbl_df() %>%
attributes(ca5)$join() %>%
group_by(time) %>%
mutate(
word_count = stringi::stri_count_words(dialogue),
start = starts(word_count),
end = ends(word_count)
) %>%
na.omit() %>%
mutate(cluster = factor(cluster, levels = k:1)) %>%
ggplot2::ggplot(ggplot2::aes(x = start-10, y = cluster, xend = end+10, yend = cluster)) +
ggplot2::geom_segment(ggplot2::aes(position="dodge"), color = 'white', size = 3) +
ggplot2::theme_bw() +
ggplot2::theme(panel.background = ggplot2::element_rect(fill = 'grey20'),
panel.grid.minor.x = ggplot2::element_blank(),
panel.grid.major.x = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank(),
panel.grid.major.y = ggplot2::element_line(color = 'grey35'),
strip.text.y = ggplot2::element_text(angle=0, hjust = 0),
strip.background = ggplot2::element_blank()) +
ggplot2::facet_grid(person~time, scales='free', space='free') +
ggplot2::labs(x="Duration (words)", y="Cluster")
```
If you're curious about the heaviest weighted tf-idf terms in each cluster the next code chunk provides the top five weighted terms used in each cluster. Below this I provide a bar plot of the frequencies of clusters to help put the other information into perspective.
```{r}
invisible(Map(function(x, y){
if (is.null(x)) {
cat(sprintf("Cluster %s: ...\n", y))
} else {
m <- dplyr::top_n(x, 5, n)
o <- paste(paste0(m[[1]], " (", round(m[[2]], 1), ")"), collapse="; ")
cat(sprintf("Cluster %s: %s\n", y, o))
}
}, get_terms(ca5, .4), names(get_terms(ca5, .4))))
```
```{r, fig.height = 9}
invisible(summary(ca5))
```
It appears that in fact the topics do cluster within segments of time as we'd expect. This is more apparent when turn of talk is used as the unit of analysis (document level) rather than each sentence.