This repository has been archived by the owner on Sep 25, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
15_4-STM.Rmd
124 lines (83 loc) · 3.87 KB
/
15_4-STM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: "Topic Models"
author: "PLSC 31101"
date: "Fall 2020"
output: html_document
---
## Structural Topic Models
This unit gives a brief overview of the `stm` (structural topic model) package. Please read the vignette for more detail.
Structural topic model is a way to estimate a topic model that includes document-level metadata. One can then see how topical prevalence changes according to that metadata.
```{r message=FALSE}
library(stm)
```
The data we will be using for this unit consists of all articles about women published in the New York Times and Washington Post, 1980-2014. You worked with a subset of this data in your last homework.
Load the dataset. Notice that we have the text of the articles along with some metadata.
```{r}
# Load data
women <- read.csv('data/women-full.csv')
names(women)
```
### Preprocessing
STM has its own unique preprocessing functions and procedure, which I have coded below. Notice that we are going to use the `TEXT.NO.NOUN` column, which contains all the text of the articles without proper nouns (which I removed earlier).
```{r message=FALSE}
# Pre-process
temp<-textProcessor(documents = women$TEXT.NO.NOUN, metadata = women)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
# Prep documents in the correct format
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
```
#### Challenge 1. {-}
Read the help file for the `prepDocuments` function. Alter the code above (in 2.1) to keep only words that appear in at least 10 documents.
```{r}
# YOUR CODE HERE
```
### Estimate Model
We are now going to estimate a topic model with 15 topics by regressing topical prevalence on region and year covariates.
Running the full model takes a **long** time to finish. For that reason, we are going to add an argument `max.em.its`, which sets the number of iterations. By keeping it low (15), we will see a rough estimate of the topics. You can always go back and estimate the model to convergence.
```{r results="hide" }
model <- stm(docs, vocab, 15, prevalence = ~ REGION + s(YEAR), data = meta, seed = 15, max.em.its = 15)
```
Let's see what our model came up with! The following tools can be used to evaluate the model:
- `labelTopics` gives the top words for each topic.
- `findThoughts` gives the top documents for each topic (the documents with the highest proportion of each topic).
```{r}
# Top Words
labelTopics(model)
# Example Docs
findThoughts(model, texts = meta$TITLE, n=2,topics = 1:15)
```
#### Challenge 2.
Estimate other models using 5 and 40 topics, respectively. Look at the top words for each topic. How do the topics vary when you change the number of topics?
Now look at your neighbor's model. Did you get the same results? Why or why not?
```{r}
# YOUR CODE HERE
```
### Interprete Model
Let's all load a fully-estimated model that I ran before class.
```{r}
# Load the already-estimated model
load("data/stm.RData")
```
#### Challenge 3. {-}
Using the functions `labelTopics` and `findThoughts`, hand label the 15 topics. Hold these labels as a character vector called `labels`.
```{r eval = F}
# Store your hand labels below
labels = c()
```
Now look at your neighbor's labels. Did you get the same results? Why or why not?
### Analyze Topics
We are now going to see how the topics compare in terms of their prevalence across regions. What do you notice about the distribution of topic 9?
```{r}
# Corpus summary
plot.STM(model, type="summary", custom.labels = labels, main="")
# Estimate covariate effects
prep <- estimateEffect(1:15 ~ REGION + s(YEAR), model, meta = meta, uncertainty = "Global", documents=docs)
# Plot topic 9 over regions
regions = c("Asia", "EECA", "MENA", "Africa", "West", "LA")
plot.estimateEffect(prep, "REGION", method = "pointestimate", topics = 9, printlegend = TRUE, labeltype = "custom", custom.labels = regions, main = "Women's Rights", ci.level = .95, nsims=100)
```