r4ds_week36.Rmd

---
title: "R4DS Study Group - Week 37"
author: Pierrette Lo
date: 12/18/2020
output: 
  github_document:
    toc: true
    toc_depth: 2
    html_preview: true
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE, cache=FALSE}
# allow code to show errors
knitr::opts_chunk$set(error = TRUE)
```

## NOTE:

I've uploaded the R Markdown document this week instead of the knitted HTML, because all of the slashes and special characters need even more escaping/special treatment when knitting, and I didn't have time to do all the mental gymnastics.

## This week's assignment

* Ch. 14.3

```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```

## Ch 14\.3 Regular expressions

### Notes

Regex -- and especially the concept of escaping -- can be super opaque for those of us who didn't come from a programming background. Your brain will eventually wrap itself around this after some practice - don't be discouraged!

Here's an xkcd comic about how ridiculous the escaping is:

![](https://imgs.xkcd.com/comics/backslashes.png)

Tips:

* The {stringr} [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) is super handy
* There are some handy websites for testing regexes before incorporating them into your R code (google "regex tester"). I like https://regex101.com/. You type in your regex, and your sample text, and it will highlight the sample text according to your regex, so you can see if it's working correctly and fix it if it's not.
* Searching with regexes is as much about excluding things you *don't* want to match as it is about including things you want to find. It is helpful to start by scrolling through your data to identify exceptions, weird cases, etc.

Note that if you're using regexes on the command line or in a regex tester, where the regex is directly part of the code, you only need to use one backslash. In R, because you have to wrap the regex in quotes and use it as a string, you need two backslashes.

Also, if you (or your kids!) are interested in some fun, language-agnostic coding puzzles:

https://regexcrossword.com/

https://adventofcode.com/2020

### Exercises

#### 14.3.1.1

>1. Explain why each of these strings don’t match a \: "\", "\\", "\\\".

"\" = R thinks you're trying to escape the "
"\\" = escaped literal slash - matches \ (works directly in regex tester, but not as a string in R)
"\\\" = only one of the slashes is escaped - need to escape both of them (again, R thinks that last slash is escaping the ")
"\\\\" = this string will match a \

>2. How would you match the sequence "'\ ?

First, note that there are special characters that need to be escaped even when they're just part of a string (not a regex) - see ?"'" for a list.

" and \ have special functions in strings, and thus need to be escaped with a slash (' does not need to be escaped when it's inside ""; " does not need to be escaped when it's inside '')

So saving this sequence as a string would look like this:

```{r}
test_string <- "\"'\\"      

writeLines(test_string)
```

Now to build a regex to match this string, we will need to double-escape each slash (\\).   

```{r}
str_view(test_string, "\\\"'\\\\")
```

>3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

This regex will match a 6-character string with 3 periods each followed by any character (try it in a regex tester).

To represent it as a string in R, you have to double-escape the periods that represent literal periods, but you don't have to escape the periods that represent "any character".

```{r}
test_string <- "123.a.b.c.456"

writeLines(test_string)

str_view(test_string, "\\..\\..\\..")
```


#### 14.3.2.1

>1. How would you match the literal string "$^$"?

Try this in a regex tester as well.

```{r}
test_string <- "$^$"

writeLines(test_string)

str_view(test_string, "\\$\\^\\$")
```

>2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: (Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.)

>Start with “y”. [show where to find things on the cheat sheet]

```{r}
head(words)
tail(words)

#output looks better if you use the Console
str_view(words, "^y", match = TRUE)
```

>End with “x”

```{r}
str_view(words, "x$", match = TRUE)
```

>Are exactly three letters long. (Don’t cheat by using `str_length()`!)

```{r}
# using only what we've learned so far
str_view(words, "^...$", match = TRUE)

# alternate method that specifies only letters
str_view(words, "^[:alpha:]{3}$", match = TRUE)
```

>Have seven letters or more.

```{r}
str_view(words, ".......", match = TRUE)

# or
str_view(words, "[:alpha:]{7}", match = TRUE)
```

#### 14.3.3.1

>1. Create regular expressions to find all words that:

>Start with a vowel.

Here's a way to subset a vector to get only the words that meet the criteria, and take a random sample of 20:

```{r}
str_subset(words, "^[aeiou]") %>% 
  sample(20)
```

>That only contain consonants. (Hint: thinking about matching “not”-vowels.)

```{r}
str_subset(words, "[aeiou]", negate = TRUE)
```

>End with ed, but not with eed.

```{r}
str_subset(words, "[^e]ed$")
```

>End with ing or ise.

```{r}
str_subset(words, "i(ng|se)$")
```

>2. Empirically verify the rule “i before e except after c”.

There are a couple of exceptions to the rule!

```{r}
str_subset(words, "cie")
```

>3. Is “q” always followed by a “u”?

Yes, at least in this particular list of words.

```{r}
str_subset(words, "q[^u]")
```

>4. Write a regular expression that matches a word if it’s probably written in British English, not American English.

Also Canadian English! e.g. 

* ends with "our" ("colour" instead of "color")
* ends with "ise" ("synthesise" instead of "synthesize")
* ends with "tre" ("centre" instead of "center")

```{r}
str_subset(words, "our$|ise$|tre$")
```
You end up catching several words that are not just British (e.g. raise, hour), but you can at least narrow down the large dataset to something you can easily go through manually.

>5. Create a regular expression that will match telephone numbers as commonly written in your country.

Criteria: groups of 3, 3, and 4 digits; doesn't start with 0 or 1

```{r}
test <- c("503-555-1234", "(503) 987-6543", "5032468100", "5035-2334-87", "a23-456-7890", "123-456-7890", "(503)236-5555")

str_subset(test, ".*[2-9]\\d{2}.*\\d{3}.*\\d{4}")
```

#### 14.3.4.1

>1. Describe the equivalents of ?, +, * in {m,n} form.

? = {0,1}
+ = {1, }
* = {0, }

>2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

^.*$ = starts with any string with 0 or more of any character, then ends

"\\{.+\\}" = 1 or more of any character, surrounded by curly brackets

\d{4}-\d{2}-\d{2} = four numbers, hyphen, two numbers, hyphen, two numbers

"\\\\{4}" = four slashes

>3. Create regular expressions to find all words that:

>Start with three consonants.

```{r}
str_subset(words, "^[^aeiou]{3}")
```

>Have three or more vowels in a row.

```{r}
str_subset(words, "[aeiou]{3,}")
```

>Have two or more vowel-consonant pairs in a row.

```{r}
str_subset(words, "([aeiou][^aeiou]){2,}")
```

>4. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

#### 14.3.5.1

>1. Describe, in words, what these expressions will match:

(.)\1\1 = any character, 3 times in a row (e.g. "aaa")

"(.)(.)\\2\\1" = character 1, character 2, character 2, character 1 (e.g. "abba")

(..)\1 = pair of characters, twice (e.g. "abab")

"(.).\\1.\\1" = character 1, character 2, char 1, char 3, char 1 (e.g. "abaca") 

"(.)(.)(.).*\\3\\2\\1" = char 1, char 2, char 3, char 4 (0 or more), char 3, char 2, char 1 (e.g. "abccba" or "abcdddcba")

>2. Construct regular expressions to match words that:

>Start and end with the same character.

Note - I used "." here since the `words` dataset only contains letters, but you would want to be more precise ("[A-Za-z]" or "[:alpha:]") in other datasets that also include numbers or characters. 

```{r}
str_subset(words, "^(.).*\\1$")
```

>Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

```{r}
str_subset(words, "(..).*\\1")
```

>Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

```{r}
str_subset(words, "(.).*\\1.*\\1")
```