-
Notifications
You must be signed in to change notification settings - Fork 193
/
Copy pathstringr.Rmd
302 lines (221 loc) · 9.36 KB
/
stringr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
title: "Introduction to stringr"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to stringr}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r}
#| include = FALSE
library(stringr)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE
)
```
There are four main families of functions in stringr:
1. Character manipulation: these functions allow you to manipulate
individual characters within the strings in character vectors.
1. Whitespace tools to add, remove, and manipulate whitespace.
1. Locale sensitive operations whose operations will vary from locale
to locale.
1. Pattern matching functions. These recognise four engines of
pattern description. The most common is regular expressions, but there
are three other tools.
## Getting and setting individual characters
You can get the length of the string with `str_length()`:
```{r}
str_length("abc")
```
This is now equivalent to the base R function `nchar()`. Previously it was needed to work around issues with `nchar()` such as the fact that it returned 2 for `nchar(NA)`. This has been fixed as of R 3.3.0, so it is no longer so important.
You can access individual character using `str_sub()`. It takes three arguments: a character vector, a `start` position and an `end` position. Either position can either be a positive integer, which counts from the left, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.
```{r}
x <- c("abcdef", "ghifjk")
# The 3rd letter
str_sub(x, 3, 3)
# The 2nd to 2nd-to-last character
str_sub(x, 2, -2)
```
You can also use `str_sub()` to modify strings:
```{r}
str_sub(x, 3, 3) <- "X"
x
```
To duplicate individual strings, you can use `str_dup()`:
```{r}
str_dup(x, c(2, 3))
```
## Whitespace
Three functions add, remove, or modify whitespace:
1. `str_pad()` pads a string to a fixed length by adding extra whitespace on
the left, right, or both sides.
```{r}
x <- c("abc", "defghi")
str_pad(x, 10) # default pads on left
str_pad(x, 10, "both")
```
(You can pad with other characters by using the `pad` argument.)
`str_pad()` will never make a string shorter:
```{r}
str_pad(x, 4)
```
So if you want to ensure that all strings are the same length (often useful
for print methods), combine `str_pad()` and `str_trunc()`:
```{r}
x <- c("Short", "This is a long string")
x %>%
str_trunc(10) %>%
str_pad(10, "right")
```
1. The opposite of `str_pad()` is `str_trim()`, which removes leading and
trailing whitespace:
```{r}
x <- c(" a ", "b ", " c")
str_trim(x)
str_trim(x, "left")
```
1. You can use `str_wrap()` to modify existing whitespace in order to wrap
a paragraph of text, such that the length of each line is as similar as
possible.
```{r}
jabberwocky <- str_c(
"`Twas brillig, and the slithy toves ",
"did gyre and gimble in the wabe: ",
"All mimsy were the borogoves, ",
"and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))
```
## Locale sensitive
A handful of stringr functions are locale-sensitive: they will perform differently in different regions of the world. These functions are case transformation functions:
```{r}
x <- "I like horses."
str_to_upper(x)
str_to_title(x)
str_to_snake(x) # case transformation for programming
str_to_lower(x)
# Turkish has two sorts of i: with and without the dot
str_to_lower(x, "tr")
```
String ordering and sorting:
```{r}
x <- c("y", "i", "k")
str_order(x)
str_sort(x)
# In Lithuanian, y comes between i and k
str_sort(x, locale = "lt")
```
The locale always defaults to English to ensure that the default behaviour is identical across systems. Locales always include a two letter ISO-639-1 language code (like "en" for English or "zh" for Chinese), and optionally a ISO-3166 country code (like "en_UK" vs "en_US"). You can see a complete list of available locales by running `stringi::stri_locale_list()`.
## Pattern matching
The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.
### Tasks
Each pattern matching function has the same first two arguments, a character vector of `string`s to process and a single `pattern` to match. stringr provides pattern matching functions to **detect**, **locate**, **extract**, **match**, **replace**, and **split** strings. I'll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
```{r}
strings <- c(
"apple",
"219 733 8965",
"329-293-8753",
"Work: 579-499-7527; Home: 543.355.3679"
)
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
```
- `str_detect()` detects the presence or absence of a pattern and returns a
logical vector (similar to `grepl()`). `str_subset()` returns the elements
of a character vector that match a regular expression (similar to `grep()`
with `value = TRUE`)`.
```{r}
# Which strings contain phone numbers?
str_detect(strings, phone)
str_subset(strings, phone)
```
- `str_count()` counts the number of matches:
```{r}
# How many phone numbers in each string?
str_count(strings, phone)
```
- `str_locate()` locates the **first** position of a pattern and returns a numeric
matrix with columns start and end. `str_locate_all()` locates all matches,
returning a list of numeric matrices. Similar to `regexpr()` and `gregexpr()`.
```{r}
# Where in the string is the phone number located?
(loc <- str_locate(strings, phone))
str_locate_all(strings, phone)
```
- `str_extract()` extracts text corresponding to the **first** match, returning a
character vector. `str_extract_all()` extracts all matches and returns a
list of character vectors.
```{r}
# What are the phone numbers?
str_extract(strings, phone)
str_extract_all(strings, phone)
str_extract_all(strings, phone, simplify = TRUE)
```
- `str_match()` extracts capture groups formed by `()` from the **first** match.
It returns a character matrix with one column for the complete match and
one column for each group. `str_match_all()` extracts capture groups from
all matches and returns a list of character matrices. Similar to
`regmatches()`.
```{r}
# Pull out the three components of the match
str_match(strings, phone)
str_match_all(strings, phone)
```
- `str_replace()` replaces the **first** matched pattern and returns a character
vector. `str_replace_all()` replaces all matches. Similar to `sub()` and
`gsub()`.
```{r}
str_replace(strings, phone, "XXX-XXX-XXXX")
str_replace_all(strings, phone, "XXX-XXX-XXXX")
```
- `str_split_fixed()` splits a string into a **fixed** number of pieces based
on a pattern and returns a character matrix. `str_split()` splits a string
into a **variable** number of pieces and returns a list of character vectors.
```{r}
str_split("a-b-c", "-")
str_split_fixed("a-b-c", "-", n = 2)
```
### Engines
There are four main engines that stringr can use to describe patterns:
* Regular expressions, the default, as shown above, and described in
`vignette("regular-expressions")`.
* Fixed bytewise matching, with `fixed()`.
* Locale-sensitive character matching, with `coll()`
* Text boundary analysis with `boundary()`.
#### Fixed matches
`fixed(x)` only matches the exact sequence of bytes specified by `x`. This is a very limited "pattern", but the restriction can make matching much faster. Beware using `fixed()` with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define "á": either as a single character or as an "a" plus an accent:
```{r}
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
a1 == a2
```
They render identically, but because they're defined differently,
`fixed()` doesn't find a match. Instead, you can use `coll()`, explained
below, to respect human character comparison rules:
```{r}
str_detect(a1, fixed(a2))
str_detect(a1, coll(a2))
```
#### Collation search
`coll(x)` looks for a match to `x` using human-language **coll**ation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you'll also need to supply a `locale` parameter.
```{r}
i <- c("I", "İ", "i", "ı")
i
str_subset(i, coll("i", ignore_case = TRUE))
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
```
The downside of `coll()` is speed. Because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`. Note that when both `fixed()` and `regex()` have `ignore_case` arguments, they perform a much simpler comparison than `coll()`.
#### Boundary
`boundary()` matches boundaries between characters, lines, sentences or words. It's most useful with `str_split()`, but can be used with all pattern matching functions:
```{r}
x <- "This is a sentence."
str_split(x, boundary("word"))
str_count(x, boundary("word"))
str_extract_all(x, boundary("word"))
```
By convention, `""` is treated as `boundary("character")`:
```{r}
str_split(x, "")
str_count(x, "")
```