15_1-Preprocessing.Rmd

---
title: "Preprocessing"
author: "PLSC 31101"
date: "Fall 2020"
output: html_document
---

# Text Analysis

This unit focuses on computational text analysis (or "text-as-data"). We will explore:

1. **Preprocessing** a corpus for common text analysis.
2. **Sentiment Analysis and Dictionary Methods**, a simple, supervised method for classification.
3. **Distinctive Words**, or word-separating techniques to compare corpora.
4. **Structural Topic Models**, a popular unsupervised method for text exploration and analysis.

These materials are based off a longer, week-long intensive workshop on computational text analysis. If you are interested in text-as-data, I would encourage you to work through these materials on your own: https://github.com/rochelleterman/FSUtext


## Preprocessing

First let's load our required packages:

```{r message=F}
library(tm) # Framework for text mining
library(tidyverse) # Data preparation and pipes %>%
library(ggplot2) # For plotting word frequencies
library(wordcloud) # Wordclouds!
```

A __corpus__ is a collection of texts, usually stored electronically, and from which we perform our analysis. A corpus might be a collection of news articles from Reuters or the published works of Shakespeare. 

Within each corpus we will have separate articles, stories, volumes, etc., each treated as a separate entity or record. Each unit is called a __document__.

For this unit, we will be using a section of Machiavelli's Prince as our corpus. Since The Prince is a monograph, we have already "chunked" the text so that each short paragraph or "chunk" is considered a "document."

### From Words to Numbers

#### Corpus Readers {-}

The `tm` package supports a variety of sources and formats. Run the code below to see what it includes.

```{r}
getSources()
getReaders()
```

Here we will be reading documents from a CSV file in which each row is a document that includes columns for text and metadata (information about each document). This is the easiest option if you have metadata.

```{r}
docs.df <-read.csv("data/mach.csv", header=TRUE) # Read in CSV file
docs.df <- docs.df %>%
  mutate(text = str_conv(text, "UTF-8"))
docs <- Corpus(VectorSource(docs.df$text))
docs
```

Once we have the corpus, we can inspect the documents using `inspect()`.

```{r}
# See the 16th document
inspect(docs[16])
```

#### Preprocessing Functions {-}

Many text analysis applications follow a similar 'recipe' for preprocessing, involving (the order of these steps might differ as per application)

1. Tokenizing the text to unigrams (or bigrams, or trigrams).
2. Converting all characters to lowercase.
3. Removing punctuation.
4. Removing numbers.
5. Removing Stop Words, inclugind custom stop words.
6. "Stemming" words, or lemmitization. There are several stemming algorithms. Porter is the most popular.
7. Creating a Document-Term Matrix.

`tm` lets us convert a corpus to a DTM while completing the pre-processing steps in one step.

```{r}
dtm <- DocumentTermMatrix(docs,
           control = list(stopwords = TRUE,
                          tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stemming=TRUE))
```

#### Weighting {-}

One common pre-processing step that some applications may call for is applying tf-idf weights. The [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), or term frequency-inverse document frequency, is a weight that ranks the importance of a term in its contextual document corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. In other words, it places importance on terms frequent in the document but rare in the corpus.

```{r}
dtm.weighted <- DocumentTermMatrix(docs,
           control = list(weighting =function(x) weightTfIdf(x, normalize = TRUE),
                          stopwords = TRUE,
                          tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stemming=TRUE))
```

Compare the first 5 rows and 5 columns of the `dtm` and `dtm.weighted`. What do you notice?

```{r}
inspect(dtm[1:5,1:5])
inspect(dtm.weighted[1:5,1:5])
```

### Exploring the DTM

#### Dimensions {-}

Let's look at the structure of our DTM. Print the dimensions of the DTM. How many documents do we have? How many terms?

```{r}
# How many documents? How many terms?
dim(dtm)
```

#### Frequencies {-}

We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and using `colSums` to sum the column counts.

```{r}
# How many terms?
freq <- colSums(as.matrix(dtm))
freq[1:5]
length(freq)
```

By ordering the frequencies, we can list the most frequent terms and the least frequent terms.

```{r}
# Order
sorted <- sort(freq, decreasing = T)

# Most frequent terms
head(sorted)

# Least frequent
tail(sorted)
```

#### Plotting Frequencies {-}

Let's make a plot that shows the frequency of frequencies for the terms. (For example, how many words are used only once? 5 times? 10 times?)

```{r}
# Frequency of frenquencies
head(table(freq),15)
tail(table(freq),15)

# Plot
plot(table(freq))
```

What does this tell us about the nature of language?

#### Exploring Common Words {-}

The `tm` package has lots of useful functions to help you explore common words and associations:

```{r}
# Have a look at common words
findFreqTerms(dtm, lowfreq=50) # Words that appear at least 50 times

# Which words correlate with "war"?
findAssocs(dtm, "war", 0.3)
```

We can even make wordclouds showing the most commons terms:

```{r}
# Wordclouds!
set.seed(123)
wordcloud(names(sorted), sorted, max.words=100, colors=brewer.pal(6,"Dark2"))
```

#### Removing Sparse Terms {-}

Somtimes we want to remove sparse terms and, thus, increase efficency. Look up the help file for the function `removeSparseTerms`. Using this function, create an object called `dtm.s` that contains only terms with <.9 sparsity (meaning they appear in more than 10% of documents).

```{r}
dtm.s <- removeSparseTerms(dtm,.9)
dtm 
dtm.s 
```

### Exporting the DTM

We can convert a DTM to a matrix or dataframe in order to write it to a CSV, add metadata, etc.

First, create an object that converts the DTM to a dataframe (we first have to convert it to a matrix and then to a dataframe):

```{r}
# Coerce into dataframe
dtm <- as.data.frame(as.matrix(dtm))
names(dtm)[1:10]  # Names of documents

# Write CSV
write.csv(dtm, "dtm.csv", row.names = F)
```

#### Challenge.

Using one of the datasets in the `data` directory, create a document term matrix and a wordcloud of the most common terms.

```{r eval = F}
# YOUR CODE HERE
```