-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathr4ds_week36.Rmd
297 lines (186 loc) · 8.34 KB
/
r4ds_week36.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
title: "R4DS Study Group - Week 37"
author: Pierrette Lo
date: 12/18/2020
output:
github_document:
toc: true
toc_depth: 2
html_preview: true
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE, cache=FALSE}
# allow code to show errors
knitr::opts_chunk$set(error = TRUE)
```
## NOTE:
I've uploaded the R Markdown document this week instead of the knitted HTML, because all of the slashes and special characters need even more escaping/special treatment when knitting, and I didn't have time to do all the mental gymnastics.
## This week's assignment
* Ch. 14.3
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```
## Ch 14\.3 Regular expressions
### Notes
Regex -- and especially the concept of escaping -- can be super opaque for those of us who didn't come from a programming background. Your brain will eventually wrap itself around this after some practice - don't be discouraged!
Here's an xkcd comic about how ridiculous the escaping is:

Tips:
* The {stringr} [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) is super handy
* There are some handy websites for testing regexes before incorporating them into your R code (google "regex tester"). I like https://regex101.com/. You type in your regex, and your sample text, and it will highlight the sample text according to your regex, so you can see if it's working correctly and fix it if it's not.
* Searching with regexes is as much about excluding things you *don't* want to match as it is about including things you want to find. It is helpful to start by scrolling through your data to identify exceptions, weird cases, etc.
Note that if you're using regexes on the command line or in a regex tester, where the regex is directly part of the code, you only need to use one backslash. In R, because you have to wrap the regex in quotes and use it as a string, you need two backslashes.
Also, if you (or your kids!) are interested in some fun, language-agnostic coding puzzles:
https://regexcrossword.com/
https://adventofcode.com/2020
### Exercises
#### 14.3.1.1
>1. Explain why each of these strings don’t match a \: "\", "\\", "\\\".
"\" = R thinks you're trying to escape the "
"\\" = escaped literal slash - matches \ (works directly in regex tester, but not as a string in R)
"\\\" = only one of the slashes is escaped - need to escape both of them (again, R thinks that last slash is escaping the ")
"\\\\" = this string will match a \
>2. How would you match the sequence "'\ ?
First, note that there are special characters that need to be escaped even when they're just part of a string (not a regex) - see ?"'" for a list.
" and \ have special functions in strings, and thus need to be escaped with a slash (' does not need to be escaped when it's inside ""; " does not need to be escaped when it's inside '')
So saving this sequence as a string would look like this:
```{r}
test_string <- "\"'\\"
writeLines(test_string)
```
Now to build a regex to match this string, we will need to double-escape each slash (\\).
```{r}
str_view(test_string, "\\\"'\\\\")
```
>3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?
This regex will match a 6-character string with 3 periods each followed by any character (try it in a regex tester).
To represent it as a string in R, you have to double-escape the periods that represent literal periods, but you don't have to escape the periods that represent "any character".
```{r}
test_string <- "123.a.b.c.456"
writeLines(test_string)
str_view(test_string, "\\..\\..\\..")
```
#### 14.3.2.1
>1. How would you match the literal string "$^$"?
Try this in a regex tester as well.
```{r}
test_string <- "$^$"
writeLines(test_string)
str_view(test_string, "\\$\\^\\$")
```
>2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: (Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.)
>Start with “y”. [show where to find things on the cheat sheet]
```{r}
head(words)
tail(words)
#output looks better if you use the Console
str_view(words, "^y", match = TRUE)
```
>End with “x”
```{r}
str_view(words, "x$", match = TRUE)
```
>Are exactly three letters long. (Don’t cheat by using `str_length()`!)
```{r}
# using only what we've learned so far
str_view(words, "^...$", match = TRUE)
# alternate method that specifies only letters
str_view(words, "^[:alpha:]{3}$", match = TRUE)
```
>Have seven letters or more.
```{r}
str_view(words, ".......", match = TRUE)
# or
str_view(words, "[:alpha:]{7}", match = TRUE)
```
#### 14.3.3.1
>1. Create regular expressions to find all words that:
>Start with a vowel.
Here's a way to subset a vector to get only the words that meet the criteria, and take a random sample of 20:
```{r}
str_subset(words, "^[aeiou]") %>%
sample(20)
```
>That only contain consonants. (Hint: thinking about matching “not”-vowels.)
```{r}
str_subset(words, "[aeiou]", negate = TRUE)
```
>End with ed, but not with eed.
```{r}
str_subset(words, "[^e]ed$")
```
>End with ing or ise.
```{r}
str_subset(words, "i(ng|se)$")
```
>2. Empirically verify the rule “i before e except after c”.
There are a couple of exceptions to the rule!
```{r}
str_subset(words, "cie")
```
>3. Is “q” always followed by a “u”?
Yes, at least in this particular list of words.
```{r}
str_subset(words, "q[^u]")
```
>4. Write a regular expression that matches a word if it’s probably written in British English, not American English.
Also Canadian English! e.g.
* ends with "our" ("colour" instead of "color")
* ends with "ise" ("synthesise" instead of "synthesize")
* ends with "tre" ("centre" instead of "center")
```{r}
str_subset(words, "our$|ise$|tre$")
```
You end up catching several words that are not just British (e.g. raise, hour), but you can at least narrow down the large dataset to something you can easily go through manually.
>5. Create a regular expression that will match telephone numbers as commonly written in your country.
Criteria: groups of 3, 3, and 4 digits; doesn't start with 0 or 1
```{r}
test <- c("503-555-1234", "(503) 987-6543", "5032468100", "5035-2334-87", "a23-456-7890", "123-456-7890", "(503)236-5555")
str_subset(test, ".*[2-9]\\d{2}.*\\d{3}.*\\d{4}")
```
#### 14.3.4.1
>1. Describe the equivalents of ?, +, * in {m,n} form.
? = {0,1}
+ = {1, }
* = {0, }
>2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
^.*$ = starts with any string with 0 or more of any character, then ends
"\\{.+\\}" = 1 or more of any character, surrounded by curly brackets
\d{4}-\d{2}-\d{2} = four numbers, hyphen, two numbers, hyphen, two numbers
"\\\\{4}" = four slashes
>3. Create regular expressions to find all words that:
>Start with three consonants.
```{r}
str_subset(words, "^[^aeiou]{3}")
```
>Have three or more vowels in a row.
```{r}
str_subset(words, "[aeiou]{3,}")
```
>Have two or more vowel-consonant pairs in a row.
```{r}
str_subset(words, "([aeiou][^aeiou]){2,}")
```
>4. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
#### 14.3.5.1
>1. Describe, in words, what these expressions will match:
(.)\1\1 = any character, 3 times in a row (e.g. "aaa")
"(.)(.)\\2\\1" = character 1, character 2, character 2, character 1 (e.g. "abba")
(..)\1 = pair of characters, twice (e.g. "abab")
"(.).\\1.\\1" = character 1, character 2, char 1, char 3, char 1 (e.g. "abaca")
"(.)(.)(.).*\\3\\2\\1" = char 1, char 2, char 3, char 4 (0 or more), char 3, char 2, char 1 (e.g. "abccba" or "abcdddcba")
>2. Construct regular expressions to match words that:
>Start and end with the same character.
Note - I used "." here since the `words` dataset only contains letters, but you would want to be more precise ("[A-Za-z]" or "[:alpha:]") in other datasets that also include numbers or characters.
```{r}
str_subset(words, "^(.).*\\1$")
```
>Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
```{r}
str_subset(words, "(..).*\\1")
```
>Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
```{r}
str_subset(words, "(.).*\\1.*\\1")
```