15_4-STM.Rmd

---
title: "Topic Models"
author: "PLSC 31101"
date: "Fall 2020"
output: html_document
---

## Structural Topic Models

This unit gives a brief overview of the `stm` (structural topic model) package. Please read the vignette for more detail.

Structural topic model is a way to estimate a topic model that includes document-level metadata. One can then see how topical prevalence changes according to that metadata.

```{r message=FALSE}
library(stm)
```

The data we will be using for this unit consists of all articles about women published in the New York Times and Washington Post, 1980-2014. You worked with a subset of this data in your last homework.

Load the dataset. Notice that we have the text of the articles along with some metadata.

```{r}
# Load data
women <- read.csv('data/women-full.csv')
names(women)
```

### Preprocessing

STM has its own unique preprocessing functions and procedure, which I have coded below. Notice that we are going to use the `TEXT.NO.NOUN` column, which contains all the text of the articles without proper nouns (which I removed earlier).

```{r message=FALSE}
# Pre-process
temp<-textProcessor(documents = women$TEXT.NO.NOUN, metadata = women)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents

# Prep documents in the correct format
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
```

#### Challenge 1. {-}

Read the help file for the `prepDocuments` function. Alter the code above (in 2.1) to keep only words that appear in at least 10 documents.

```{r}
# YOUR CODE HERE
```

### Estimate Model

We are now going to estimate a topic model with 15 topics by regressing topical prevalence on region and year covariates. 

Running the full model takes a **long** time to finish. For that reason, we are going to add an argument `max.em.its`, which sets the number of iterations. By keeping it low (15), we will see a rough estimate of the topics. You can always go back and estimate the model to convergence.    

```{r results="hide" }
model <- stm(docs, vocab, 15, prevalence = ~ REGION + s(YEAR), data = meta, seed = 15, max.em.its = 15)
```

Let's see what our model came up with! The following tools can be used to evaluate the model:

- `labelTopics` gives the top words for each topic.
- `findThoughts` gives the top documents for each topic (the documents with the highest proportion of each topic).

```{r}
# Top Words
labelTopics(model)

# Example Docs
findThoughts(model, texts = meta$TITLE, n=2,topics = 1:15)
```

#### Challenge 2.

Estimate other models using 5 and 40 topics, respectively. Look at the top words for each topic. How do the topics vary when you change the number of topics?

Now look at your neighbor's model. Did you get the same results? Why or why not?

```{r}
# YOUR CODE HERE
```


### Interprete Model

Let's all load a fully-estimated model that I ran before class.

```{r}
# Load the already-estimated model
load("data/stm.RData")
```

#### Challenge 3. {-}

Using the functions `labelTopics` and `findThoughts`, hand label the 15 topics. Hold these labels as a character vector called `labels`.

```{r eval = F}
# Store your hand labels below
labels = c()
```

Now look at your neighbor's labels. Did you get the same results? Why or why not?


### Analyze Topics

We are now going to see how the topics compare in terms of their prevalence across regions. What do you notice about the distribution of topic 9? 

```{r}
# Corpus summary
plot.STM(model, type="summary", custom.labels = labels, main="")

# Estimate covariate effects
prep <- estimateEffect(1:15 ~ REGION + s(YEAR), model, meta = meta, uncertainty = "Global", documents=docs)

# Plot topic 9 over regions
regions = c("Asia", "EECA", "MENA", "Africa", "West", "LA")
plot.estimateEffect(prep, "REGION", method = "pointestimate", topics = 9, printlegend = TRUE, labeltype = "custom", custom.labels = regions, main = "Women's Rights", ci.level = .95, nsims=100)
```