Skip to content

Commit

Permalink
fix #40 ---cleaned up language, package stuff (just a bit)
Browse files Browse the repository at this point in the history
  • Loading branch information
soodoku committed Nov 22, 2017
1 parent 1dfc9db commit 3dc4ac5
Show file tree
Hide file tree
Showing 2 changed files with 105 additions and 72 deletions.
87 changes: 55 additions & 32 deletions vignettes/emoji_vignette.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Identifying Emojis in YouTube Comments"
title: "Emojis in YouTube Comments"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Identifying Emojis in YouTube Comments}
Expand All @@ -12,28 +12,38 @@ knitr::opts_chunk$set(echo = TRUE)
```

```{r Setup 0, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
# Load the libs.
library(tuber)
library(emo)
packs = c("stringr","ggplot2","devtools","DataCombine","ggmap","tm","wordcloud","plyr","tuber","tidytext","dplyr","tidyr","readr","ggrepel","emoGG","lubridate","corpus", "purrr", "broom", "anytime")
lapply(packs, library, character.only=T)
library(emoGG)
library(anytime)
library(devtools)
library(emo) #devtools::install_github("hadley/emo")
library(DataCombine)
library(tidytext)
library(dplyr)
```

Depending on the video(s) you are exploring, it might be useful to account for and analyze the use of emojis in comments. As emojis become more and more popular and more complex in the meanings they are able to signify, the more important it is to at least account for emojis and include them in textual analyses. Here's how you can do this with YouTube data!
Depending on the video(s) you are exploring, it might be useful to account for and analyze the use of emojis in comments. As emojis become more popular and more complex in the meanings they signify, the more important it becomes to analyze them in text. Here's how you can do this with YouTube data!

## Getting YouTube Data
I wanted to choose something that had both 1) a lot of comments and 2) a strong liklihood of comments containing emojis, so let's look at the comments from the ill-advised (and ill-fated) 'Emoji Movie' trailer. This also has a lot of varying sentiment (one of the comments is "The movie is a such disgrace to the animation film industry."`r emo::ji("joy_cat")``r emo::ji("joy_cat")``r emo::ji("joy_cat")`).

If you don't have the YouTube API set up, please see instructions on how to do so [here](https://developers.google.com/youtube/v3/).
I wanted to choose something that had lots of comments containing emojis. So let's look at the comments from the ill-advised (and ill-fated) 'Emoji Movie' trailer. The comments also vary a lot in sentiment. Sample this: "The movie is a such disgrace to the animation film industry."`r emo::ji("joy_cat")` `r emo::ji("joy_cat")` `r emo::ji("joy_cat")`).

If you don't have access to the YouTube API set up, please see instructions on how [here](https://developers.google.com/youtube/v3/).

```{r Set Up Youtube + Get Comments, echo=FALSE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
yt_oauth("389417729099-ps9gvfjg0p43j0roloqrpkhbvpu4kb4n.apps.googleusercontent.com","9UMvY0_zEDzSXrWlVrAT52Tm", token='')
```{r Set Up YouTube + Get Comments, echo=FALSE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
yt_oauth("389417729099-ps9gvfjg0p43j0roloqrpkhbvpu4kb4n.apps.googleusercontent.com", "9UMvY0_zEDzSXrWlVrAT52Tm", token = '')
```

```{r Get YouTube Comments, echo=TRUE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
# Connect to YouTube API
# Leave token blank
# yt_oauth("app_id", "app_password", token='')
# Get comments. 'max_results = 101' ensures I get all of the comments on the video.
emojimovie <- get_comment_threads(c(video_id="o_nfdzMhmrA"), max_results = 101)
emojimovie <- get_comment_threads(c(video_id = "o_nfdzMhmrA"), max_results = 100)
# Save data (if you want)
# save(emojimovie,file=paste("sampletubedata.Rda"))
Expand All @@ -42,37 +52,41 @@ emojimovie <- get_comment_threads(c(video_id="o_nfdzMhmrA"), max_results = 101)
# load('sampletubedata.Rda')
```

Now we have some (~10,300) comments to play with -- let's identify the emojis in our data. To do so, we'll use the FindReplace function from the [DataCombine package](https://cran.r-project.org/web/packages/DataCombine/DataCombine.pdf) and an [emoji dictionary](https://lyons7.github.io/portfolio/2017-10-04-emoji-dictionary/) I put together that has each emoji's prose name, UTF-8 encoding and R encoding (the R encoding is specifically for emojis in Twitter data). There are a couple of steps to change the dictionary to be able to identify emojis in our YouTube data, but depending on your computer you might be able to just search by UTF-8 encoding.
Now we have some (~10,300) comments to play with. Let's identify the emojis in our data. To do that, we'll use the `FindReplace` function from the [DataCombine package](https://cran.r-project.org/web/packages/DataCombine/DataCombine.pdf) and an [emoji dictionary](https://lyons7.github.io/portfolio/2017-10-04-emoji-dictionary/) that I put together. The dictionary has each emoji's prose name, UTF-8 encoding, and R encoding---the R encoding is specifically for emojis in Twitter data. There are a couple of steps to change the dictionary to be able to identify emojis in YouTube data, but depending on your computer you might be able to just search by UTF-8 encoding.

Help figuring out the emoji encoding issue from [Patrick Perry](https://stackoverflow.com/questions/47243155/get-r-to-keep-utf-8-codepoint-representation/47243425#47243425) -- thanks Patrick! `r emo::ji("smiling_face_with_smiling_eyes")`
[Patrick Perry](https://stackoverflow.com/questions/47243155/get-r-to-keep-utf-8-codepoint-representation/47243425#47243425) helped figure out the emoji encoding issue--- thanks Patrick! `r emo::ji("smiling_face_with_smiling_eyes")`

```{r Emojis in YouTube Comments, echo=TRUE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
xemo <- getURL("https://raw.githubusercontent.com/lyons7/emojidictionary/master/Emoji%20Dictionary%205.0.csv")
emojis <- read.csv(text = xemo)# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
emojis <- read.csv(url("https://raw.githubusercontent.com/lyons7/emojidictionary/master/Emoji%20Dictionary%205.0.csv"))
# Specific to YouTube data
emojis <- emojis[!emojis$Name == " SHRUGFACE ",]
emojis <- emojis[!emojis$Name == " SHRUGFACE ", ]
emojis$escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
emojis$codes <- sapply(parse(text = paste0("'", emojis$escapes, "'"),
keep.source = FALSE), eval)
keep.source = FALSE), eval)
emojimovie$text <- as.character(emojimovie$textOriginal)
# Go through and identify emojis
emoemo <- FindReplace(data = emojimovie, Var = "text",
replaceData = emojis,
from = "codes", to = "Name",
exact = FALSE)
emoemo <- FindReplace(data = emojimovie,
Var = "text",
replaceData = emojis,
from = "codes",
to = "Name",
exact = FALSE)
# This might take some time, we have a big data set.
# Save if you want
# save(emoemo,file=paste("sampletubedataemojis.Rda"))
# save(emoemo, file = "sampletubedataemojis.Rda")
```

Now you have your comments with emojis identified. Let's look at the top emojis in our data set.
Now that you have identified comments with emojis, let's look at the top emojis in our data set.

```{r YouTube Emoji Comments, echo=TRUE, message=FALSE, warning=FALSE, error=FALSE}
# Have to do keep the "to_lower" parameter FALSE so our emojis in our dictionary are kept separate from words that happen to be the same as emoji names
emotidy_tube <- emoemo %>%
unnest_tokens(word, text, to_lower = FALSE)
Expand All @@ -91,17 +105,20 @@ tube_emojis_total <- tube_tidy_emojis %>%
tube_freqe <- tube_emojis_total %>%
count(word, sort = TRUE)
tube_freqe[1:10,]
tube_freqe[1:10, ]
```

So, our ten most frequent emojis in the comments of the Emoji Movie trailer are `r emo::ji("face_with_tears_of_joy")`, `r emo::ji("boy")`, `r emo::ji("mobile_phone")`, `r emo::ji("kissing_heart")`, `r emo::ji("man")`, `r emo::ji("skull_and_crossbones")`, `r emo::ji("atom_symbol")`, `r emo::ji("dancing_women")`, `r emo::ji("grimacing")` and `r emo::ji("kissing_smiling_eyes")`. Read into that what you will! `r emo::ji("face_with_tears_of_joy")`

What if we want to look at how the use of these emojis has changed over time? We can also look at WHEN the posts were generated. We can make a graph of comment frequency over time. Graphs constructed with help from [here](http://www.cyclismo.org/tutorial/R/time.html), [here](https://gist.github.com/stephenturner/3132596),
What if we want to look at how the use of these emojis has changed over time? We can also look at WHEN comments were posted. We can also graph frequency of comments over time.

Graphs constructed with help from [here](http://www.cyclismo.org/tutorial/R/time.html), [here](https://gist.github.com/stephenturner/3132596),
[here](http://stackoverflow.com/questions/27626915/r-graph-frequency-of-observations-over-time-with-small-value-range), [here](http://michaelbommarito.com/2011/03/12/a-quick-look-at-march11-saudi-tweets/), [here](http://stackoverflow.com/questions/31796744/plot-count-frequency-of-tweets-for-word-by-month), [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html), [here](http://sape.inf.usi.ch/quick-reference/ggplot2/geom) and [here](http://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r).

We will also use the [anytime](https://cran.r-project.org/web/packages/anytime/index.html) package to format the time in a useable way.
We will also use the [anytime](https://cran.r-project.org/web/packages/anytime/index.html) package to format time.

```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Subset to just have posts that have our top ten emojis
top_ten <- subset(tube_emojis_total, word == "FACEWITHTEARSOFJOY" | word == "BOY"| word == "MOBILEPHONE" | word == "FACETHROWINGAKISS" | word == "MAN" | word == "SKULLANDCROSSBONES" | word == "ATOMSYMBOL" | word == "COLONEWOMANWITHBUNNYEARS"| word == "GRIMACINGFACE" | word == "KISSINGFACEWITHSMILINGEYES")
Expand All @@ -110,31 +127,37 @@ top_ten$created <- anytime(as.factor(top_ten$publishedAt))
Emoji <- top_ten$word
minutes <- 60
ggplot(top_ten, aes(created, color = Emoji)) +
geom_freqpoly(binwidth=10080*minutes)
geom_freqpoly(binwidth = 10080*minutes)
# We can look at these one by one too and use the emoGG package to use actual emojis to show which ones we are talking about
# The code you use in emoGG is the same as UTF-8 but without "U+" etc, and all letters lowercase
tearsofjoy <- top_ten[top_ten$word == "FACEWITHTEARSOFJOY",]
tearsofjoy <- top_ten[top_ten$word == "FACEWITHTEARSOFJOY", ]
ggplot(tearsofjoy, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f602")
geom_freqpoly(binwidth = 10080*minutes) +
add_emoji(emoji = "1f602")
boy <- top_ten[top_ten$word == "BOY",]
ggplot(boy, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f466")
geom_freqpoly(binwidth = 10080*minutes) +
add_emoji(emoji = "1f466")
boy <- top_ten[top_ten$word == "BOY",]
ggplot(boy, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f466")
geom_freqpoly(binwidth = 10080*minutes) +
add_emoji(emoji = "1f466")
# Sometimes emoGG doesn't have your emoji -- here we have to use skull, not skull and crossbones
skull <- top_ten[top_ten$word == "SKULLANDCROSSBONES",]
ggplot(skull, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f480")
geom_freqpoly(binwidth = 10080*minutes) + add_emoji(emoji = "1f480")
grimace <- top_ten[top_ten$word == "GRIMACINGFACE",]
ggplot(grimace, aes(created)) +
geom_freqpoly(binwidth=10080*minutes) + add_emoji(emoji="1f62c")
geom_freqpoly(binwidth = 10080*minutes) + add_emoji(emoji = "1f62c")
# ad infinitum!
```
```
Loading

0 comments on commit 3dc4ac5

Please sign in to comment.