-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy path03-basic-manipulations.Rmd
498 lines (352 loc) · 13.7 KB
/
03-basic-manipulations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
# Basic Manipulations with `"base"` Functions {#basics1}
## Introduction
In this chapter you will learn about the different functions to do what I call
_basic manipulations_. By "basic" I mean transforming and processing strings
in such way that do not require the use of [regular expressions](#intro). More
advanced manipulations involve defining patterns of text and matching such
patterns. This is the essential idea behind regular expressions, which is
the content of part 3 in this book.
## Basic String Manipulations
Besides creating and printing strings, there are a number of very handy
functions in R for doing some basic manipulation of strings. In this section
we will review the following functions:
| Function | Description |
|:---------------|:---------------------------------|
| `nchar()` | number of characters |
| `tolower()` | convert to lower case |
| `toupper()` | convert to upper case |
| `casefold()` | case folding |
| `chartr()` | character translation |
| `abbreviate()` | abbreviation |
| `substring()` | substrings of a character vector |
| `substr()` | substrings of a character vector |
### Count number of characters with `nchar()`
One of the main functions for manipulating character strings is `nchar()`
which counts the number of characters in a string. In other words, `nchar()`
provides the length of a string:
```{r nchar_ex1}
# how many characters?
nchar(c("How", "many", "characters?"))
# how many characters?
nchar("How many characters?")
```
Notice that the white spaces between words in the second example are also
counted as characters.
It is important not to confuse `nchar()` with `length()`. While the former
gives us the __number of characters__, the later only gives the number of
elements in a vector.
```{r length_ex1}
# how many elements?
length(c("How", "many", "characters?"))
# how many elements?
length("How many characters?")
```
### Convert to lower case with `tolower()`
R comes with three functions for text casefolding.
1. `tolower()`
2. `toupper()`
3. `casefold()`
The first function we'll discuss is `tolower()` which converts any upper case characters into lower case:
```{r tolower_ex}
# to lower case
tolower(c("aLL ChaRacterS in LoweR caSe", "ABCDE"))
```
### Convert to upper case with `toupper()`
The opposite function of `tolower()` is `toupper`. As you may guess, this
function converts any lower case characters into upper case:
```{r toupper_ex}
# to upper case
toupper(c("All ChaRacterS in Upper Case", "abcde"))
```
### Upper or lower case conversion with `casefold()`
The third function for case-folding is `casefold()` which is a wrapper for both `tolower()` and `toupper()`. Its uasge has the following form:
```
casefold(x, upper = FALSE)
```
By default, `casefold()` converts all characters to lower case, but you can
use the argument `upper = TRUE` to indicate the opposite (characters in upper
case):
```{r casefold_ex}
# lower case folding
casefold("aLL ChaRacterS in LoweR caSe")
# upper case folding
casefold("All ChaRacterS in Upper Case", upper = TRUE)
```
I've found the case-folding functions to be very helpful when I write functions
that take a character input which may be specified in lower or upper case, or
perhaps as a mix of both cases. For instance, consider the function
`temp_convert()` that takes a temperature value in Fahrenheit degress, and
a character string indicating the name of the scale to be converted.
```{r}
temp_convert <- function(deg = 1, to = "celsius") {
switch(to,
"celsius" = (deg - 32) * (5/9),
"kelvin" = (deg + 459.67) * (5/9),
"reaumur" = (deg - 32) * (4/9),
"rankine" = deg + 459.67)
}
```
Here is how you call `temp_convert()` to convert 10 Fahrenheit degrees into
celsius degrees:
```{r}
temp_convert(deg = 10, to = "celsius")
```
`temp_convert()` works fine when the argument `to = 'celsius'`. But what
happens if you try `temp_convert(30, 'Celsius')` or
`temp_convert(30, 'CELSIUS')`?
To have a more flexible function `temp_convert()` you can apply `tolower()`
to the argument `to`, and in this way guarantee that the provided string by
the user is always in lower case:
```{r}
temp_convert <- function(deg = 1, to = "celsius") {
switch(tolower(to),
"celsius" = (deg - 32) * (5/9),
"kelvin" = (deg + 459.67) * (5/9),
"reaumur" = (deg - 32) * (4/9),
"rankine" = deg + 459.67)
}
```
Now all the following three calls are equivalent:
```{r, eval=FALSE}
temp_convert(30, 'celsius')
temp_convert(30, 'Celsius')
temp_convert(30, 'CELSIUS')
```
### Character translation with `chartr()`
There's also the function `chartr()` which stands for _character translation_. `chartr()` takes three arguments: an `old` string, a `new` string, and a
character vector `x`:
```
chartr(old, new, x)
```
The way `chartr()` works is by replacing the characters in `old` that appear
in `x` by those indicated in `new`. For example, suppose we want to translate
the letter `"a"` (lower case) with `"A"` (upper case) in the sentence
`"This is a boring string"`:
```{r chartr_ex1}
# replace 'a' by 'A'
chartr("a", "A", "This is a boring string")
```
It is important to note that `old` and `new` must have the same number of
characters, otherwise you will get a nasty error message like this one:
```{r chartr_ex2, error = TRUE}
# incorrect use
chartr("ai", "X", "This is a bad example")
```
Here's a more interesting example with `old = "aei"` and `new = "\#!?"`.
This implies that any `'a'` in `'x'` will be replaced by `'\#'`, any `'e'` in
`'x'` will be replaced by `'?'`, and any `'i'` in `'x'` will be replaced by
`'?'`:
```{r chartr_ex3}
# multiple replacements
crazy <- c("Here's to the crazy ones", "The misfits", "The rebels")
chartr("aei", "#!?", crazy)
```
### Abbreviate strings with `abbreviate()`
Another useful function for basic manipulation of character strings is
`abbreviate()`. Its usage has the following structure:
```
abbreviate(names.org, minlength = 4, dot = FALSE, strict = FALSE,
method = c("left.keep", "both.sides"))
```
Although there are several arguments, the main parameter is the character
vector (`names.org`) which will contain the names that we want to abbreviate:
```{r abbreviate_ex}
# some color names
some_colors <- colors()[1:4]
some_colors
# abbreviate (default usage)
colors1 <- abbreviate(some_colors)
colors1
# abbreviate with 'minlength'
colors2 <- abbreviate(some_colors, minlength = 5)
colors2
# abbreviate
colors3 <- abbreviate(some_colors, minlength = 3, method = "both.sides")
colors3
```
A common use for `abbreviate()` is when plotting names of objects or variables
in a graphic. I will use the built-in data set `mtcars` to show you a simple
example with a scatterplot between variables `mpg` and `disp`
```{r}
plot(mtcars$mpg, mtcars$disp, type = "n")
text(mtcars$mpg, mtcars$disp, rownames(mtcars))
```
The names of the cars are all over the plot. In this situation you may want to
consider using `abbreviate()` to shrink the names of the cars and produce a
less "crowded" plot:
```{r}
plot(mtcars$mpg, mtcars$disp, type = "n")
text(mtcars$mpg, mtcars$disp, abbreviate(rownames(mtcars)))
```
### Replace substrings with `substr()`
One common operation when working with strings is the extraction and
replacement of some characters. There a various ways in which characters can
be replaced. If the replacement is based on the positions that characters
occupy in the string, you can use the functions `substr()` and `substring()`
`substr()` extracts or replaces substrings in a character vector. Its usage has
the following form:
```
substr(x, start, stop)
```
`x` is a character vector, `start` indicates the first element to be replaced,
and `stop` indicates the last element to be replaced:
```{r substr_ex}
# extract 'bcd'
substr("abcdef", 2, 4)
# replace 2nd letter with hash symbol
x <- c("may", "the", "force", "be", "with", "you")
substr(x, 2, 2) <- "#"
x
# replace 2nd and 3rd letters with happy face
y = c("may", "the", "force", "be", "with", "you")
substr(y, 2, 3) <- ":)"
y
# replacement with recycling
z <- c("may", "the", "force", "be", "with", "you")
substr(z, 2, 3) <- c("#", "```")
z
```
### Replace substrings with `substring()`
Closely related to `substr()` is the function `substring()` which extracts or
replaces substrings in a character vector. Its usage has the following form:
```
substring(text, first, last = 1000000L)
```
`text` is a character vector, `first` indicates the first element to be
replaced, and `last` indicates the last element to be replaced:
```{r substring_ex}
# same as 'substr'
substring("ABCDEF", 2, 4)
substr("ABCDEF", 2, 4)
# extract each letter
substring("ABCDEF", 1:6, 1:6)
# multiple replacement with recycling
text6 <- c("more", "emotions", "are", "better", "than", "less")
substring(text6, 1:3) <- c(" ", "zzz")
text6
```
## Set Operations
R has dedicated functions for performing set operations on two given vectors.
This implies that we can apply functions such as set union, intersection,
difference, equality and membership, on `"character"` vectors.
| Function | Description |
|:---------------|:---------------|
| `union()` | set union |
| `intersect()` | intersection |
| `setdiff()` | set difference |
| `setequal()` | equal sets |
| `identical()` | exact equality |
| `is.element()` | is element |
| `%in%()` | contains |
| `sort()` | sorting |
| `paste(rep())` | repetition |
### Set union with `union()`
Let's start our reviewing of set functions with `union()`. As its name
indicates, you can use `union()} when you want to obtain the elements of
the union between two character vectors:
```{r union_ex}
# two character vectors
set1 <- c("some", "random", "words", "some")
set2 <- c("some", "many", "none", "few")
# union of set1 and set2
union(set1, set2)
```
Notice that `union()` discards any duplicated values in the provided vectors.
In the previous example the word `"some"` appears twice inside `set1` but it
appears only once in the union. In fact all the set operation functions will
discard any duplicated values.
### Set intersection with `intersect()`
Set intersection is performed with the function `intersect()`. You can use this
function when you wish to get those elements that are common to both vectors:
```{r intersect_ex}
# two character vectors
set3 <- c("some", "random", "few", "words")
set4 <- c("some", "many", "none", "few")
# intersect of set3 and set4
intersect(set3, set4)
```
### Set difference with `setdiff()`
Related to the intersection, you might be interested in getting the difference
of the elements between two character vectors. This can be done with `setdiff()`:
```{r setdiff_ex}
# two character vectors
set5 <- c("some", "random", "few", "words")
set6 <- c("some", "many", "none", "few")
# difference between set5 and set6
setdiff(set5, set6)
```
### Set equality with `setequal()`
The function `setequal()` allows you to test the equality of two character
vectors. If the vectors contain the same elements, `setequal()` returns `TRUE` (`FALSE` otherwise)
```{r setequal_ex}
# three character vectors
set7 <- c("some", "random", "strings")
set8 <- c("some", "many", "none", "few")
set9 <- c("strings", "random", "some")
# set7 == set8?
setequal(set7, set8)
# set7 == set9?
setequal(set7, set9)
```
### Exact equality with `identical()`
Sometimes `setequal()` is not always what we want to use. It might be the case
that you want to test whether two vectors are _exactly equal_ (element by
element). For instance, testing if `set7` is exactly equal to `set9`. Although
both vectors contain the same set of elements, they are not exactly the same
vector. Such test can be performed with the function `identical()`
```{r setidentical_ex}
# set7 identical to set7?
identical(set7, set7)
# set7 identical to set9?
identical(set7, set9)
```
If you consult the help documentation of `identical()`, you will see that this
function is the "safe and reliable way to test two objects for being exactly
equal".
### Element contained with `is.element()`
If you wish to test if an element is contained in a given set of character
strings you can do so with `is.element()`:
```{r iselement_ex}
# three vectors
set10 <- c("some", "stuff", "to", "play", "with")
elem1 <- "play"
elem2 <- "crazy"
# elem1 in set10?
is.element(elem1, set10)
# elem2 in set10?
is.element(elem2, set10)
```
Alternatively, you can use the binary operator `%in%` to test if an element
is contained in a given set. The function `%in%` returns `TRUE` if the first
operand is contained in the second, and it returns `FALSE` otherwise:
```{r in_ex}
# elem1 in set10?
elem1 %in% set10
# elem2 in set10?
elem2 %in% set10
```
### Sorting with `sort()`
The function `sort()` allows you to sort the elements of a vector, either in
increasing order (by default) or in decreasing order using the argument `decreasing`:
```{r sort_ex1}
set11 = c("today", "produced", "example", "beautiful", "a", "nicely")
# sort (decreasing order)
sort(set11)
# sort (increasing order)
sort(set11, decreasing = TRUE)
```
If you have alpha-numeric strings, `sort()` will put the numbers first when
sorting in increasing order:
```{r sort_ex2}
set12 = c("today", "produced", "example", "beautiful", "1", "nicely")
# sort (decreasing order)
sort(set12)
# sort (increasing order)
sort(set12, decreasing = TRUE)
```
### Repetition with `paste(rep())`
A very common operation with strings is replication, that is, given a string we want to replicate it several times. Although there is no single function in R for that purpose, we can combine `paste()` and `rep()` like so:
```{r}
# repeat "x" 4 times
paste(rep("x", 4), collapse = '')
```