This repository has been archived by the owner on Sep 25, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
15_1-Preprocessing.Rmd
210 lines (148 loc) · 6.73 KB
/
15_1-Preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "Preprocessing"
author: "PLSC 31101"
date: "Fall 2020"
output: html_document
---
# Text Analysis
This unit focuses on computational text analysis (or "text-as-data"). We will explore:
1. **Preprocessing** a corpus for common text analysis.
2. **Sentiment Analysis and Dictionary Methods**, a simple, supervised method for classification.
3. **Distinctive Words**, or word-separating techniques to compare corpora.
4. **Structural Topic Models**, a popular unsupervised method for text exploration and analysis.
These materials are based off a longer, week-long intensive workshop on computational text analysis. If you are interested in text-as-data, I would encourage you to work through these materials on your own: https://github.com/rochelleterman/FSUtext
## Preprocessing
First let's load our required packages:
```{r message=F}
library(tm) # Framework for text mining
library(tidyverse) # Data preparation and pipes %>%
library(ggplot2) # For plotting word frequencies
library(wordcloud) # Wordclouds!
```
A __corpus__ is a collection of texts, usually stored electronically, and from which we perform our analysis. A corpus might be a collection of news articles from Reuters or the published works of Shakespeare.
Within each corpus we will have separate articles, stories, volumes, etc., each treated as a separate entity or record. Each unit is called a __document__.
For this unit, we will be using a section of Machiavelli's Prince as our corpus. Since The Prince is a monograph, we have already "chunked" the text so that each short paragraph or "chunk" is considered a "document."
### From Words to Numbers
#### Corpus Readers {-}
The `tm` package supports a variety of sources and formats. Run the code below to see what it includes.
```{r}
getSources()
getReaders()
```
Here we will be reading documents from a CSV file in which each row is a document that includes columns for text and metadata (information about each document). This is the easiest option if you have metadata.
```{r}
docs.df <-read.csv("data/mach.csv", header=TRUE) # Read in CSV file
docs.df <- docs.df %>%
mutate(text = str_conv(text, "UTF-8"))
docs <- Corpus(VectorSource(docs.df$text))
docs
```
Once we have the corpus, we can inspect the documents using `inspect()`.
```{r}
# See the 16th document
inspect(docs[16])
```
#### Preprocessing Functions {-}
Many text analysis applications follow a similar 'recipe' for preprocessing, involving (the order of these steps might differ as per application)
1. Tokenizing the text to unigrams (or bigrams, or trigrams).
2. Converting all characters to lowercase.
3. Removing punctuation.
4. Removing numbers.
5. Removing Stop Words, inclugind custom stop words.
6. "Stemming" words, or lemmitization. There are several stemming algorithms. Porter is the most popular.
7. Creating a Document-Term Matrix.
`tm` lets us convert a corpus to a DTM while completing the pre-processing steps in one step.
```{r}
dtm <- DocumentTermMatrix(docs,
control = list(stopwords = TRUE,
tolower = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming=TRUE))
```
#### Weighting {-}
One common pre-processing step that some applications may call for is applying tf-idf weights. The [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), or term frequency-inverse document frequency, is a weight that ranks the importance of a term in its contextual document corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. In other words, it places importance on terms frequent in the document but rare in the corpus.
```{r}
dtm.weighted <- DocumentTermMatrix(docs,
control = list(weighting =function(x) weightTfIdf(x, normalize = TRUE),
stopwords = TRUE,
tolower = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming=TRUE))
```
Compare the first 5 rows and 5 columns of the `dtm` and `dtm.weighted`. What do you notice?
```{r}
inspect(dtm[1:5,1:5])
inspect(dtm.weighted[1:5,1:5])
```
### Exploring the DTM
#### Dimensions {-}
Let's look at the structure of our DTM. Print the dimensions of the DTM. How many documents do we have? How many terms?
```{r}
# How many documents? How many terms?
dim(dtm)
```
#### Frequencies {-}
We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and using `colSums` to sum the column counts.
```{r}
# How many terms?
freq <- colSums(as.matrix(dtm))
freq[1:5]
length(freq)
```
By ordering the frequencies, we can list the most frequent terms and the least frequent terms.
```{r}
# Order
sorted <- sort(freq, decreasing = T)
# Most frequent terms
head(sorted)
# Least frequent
tail(sorted)
```
#### Plotting Frequencies {-}
Let's make a plot that shows the frequency of frequencies for the terms. (For example, how many words are used only once? 5 times? 10 times?)
```{r}
# Frequency of frenquencies
head(table(freq),15)
tail(table(freq),15)
# Plot
plot(table(freq))
```
What does this tell us about the nature of language?
#### Exploring Common Words {-}
The `tm` package has lots of useful functions to help you explore common words and associations:
```{r}
# Have a look at common words
findFreqTerms(dtm, lowfreq=50) # Words that appear at least 50 times
# Which words correlate with "war"?
findAssocs(dtm, "war", 0.3)
```
We can even make wordclouds showing the most commons terms:
```{r}
# Wordclouds!
set.seed(123)
wordcloud(names(sorted), sorted, max.words=100, colors=brewer.pal(6,"Dark2"))
```
#### Removing Sparse Terms {-}
Somtimes we want to remove sparse terms and, thus, increase efficency. Look up the help file for the function `removeSparseTerms`. Using this function, create an object called `dtm.s` that contains only terms with <.9 sparsity (meaning they appear in more than 10% of documents).
```{r}
dtm.s <- removeSparseTerms(dtm,.9)
dtm
dtm.s
```
### Exporting the DTM
We can convert a DTM to a matrix or dataframe in order to write it to a CSV, add metadata, etc.
First, create an object that converts the DTM to a dataframe (we first have to convert it to a matrix and then to a dataframe):
```{r}
# Coerce into dataframe
dtm <- as.data.frame(as.matrix(dtm))
names(dtm)[1:10] # Names of documents
# Write CSV
write.csv(dtm, "dtm.csv", row.names = F)
```
#### Challenge.
Using one of the datasets in the `data` directory, create a document term matrix and a wordcloud of the most common terms.
```{r eval = F}
# YOUR CODE HERE
```