forked from gastonstat/r4strings
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcharacter-sets.Rmd
369 lines (268 loc) · 12.5 KB
/
character-sets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
# Character Sets
In this chapter we will talk about character sets. You will learn about
a couple of more metacharacters, the opening and closing brackets `[ ]`, that
will help you define a character set.
These square brackets indicate a character set which will match any one of
the various characters that are inside the set. Keep in mind that a character
set will match only one character. The order of the characters inside the set
does not matter; what matter is just the presence of the characters inside the
brackets. So for example if you have a set defined by `"[AEIOU]"`, that will
match any one upper case vowel.
## Defining character sets
Consider the following pattern that includes a character set: `"p[aeiou]n"`,
and a vector with the words _pan_, _pen_, and _pin_:
```{r results='hide'}
pns <- c('pan', 'pen', 'pin', 'pon', 'pun')
str_view(pns, "p[aeiou]n")
```
The set `"p[aeiou]n"` matches all elements in `pns`. Now let's use the same
set with another vector `pnx`:
```{r results='hide'}
pnx <- c('pan', 'pen', 'pin', 'p0n', 'p.n', 'p1n', 'paun')
str_view(pnx, "p[aeiou]n")
```
As you can tell, this time only the first three elements in `pnx` are matched.
Notice also that _paun_ is not matched. This is because the character set
matches only one character, either _a_ or _u_ but not _au_.
If you are interested in matching all capital letters in English,
you can define a set formed as:
```
[ABCDEFGHIJKLMNOPQRSTUVWXYZ]
```
Likewise, you can define a set with only lower case letters in English:
```
[abcdefghijklmnopqrstuvwxyz]
```
If you are interested in matching any digit, you can also specify a character
set like this:
```
[0123456789]
```
## Character ranges
The previous examples that show character sets containing all the capital
letters or all lower case letters are very convenient but require a lot of
typing. __Character ranges__ are going to help you solve that problem, by
giving you a convenient shortcut based on the dash metacharacter `"-"` to
indicate a range of characters. A character range consists of a character set
with two characters separated by a dash or minus `"-"` sign.
Let's see how you can reexpress the examples in the previous section as
character ranges. The set of all digits can be expressed as a character range
using the following pattern:
```
[0-9]
```
Likewise, the set of all lower case letters _abcd...xyz_ is compactly
represented with the character range:
```
[a-z]
```
And the character set of all upper case letters _ABD...XYZ_ is formed by
```
[A-Z]
```
Note that the dash is only a metacharacter when it is inside a character set;
outside the character set it is just a literal dash.
So how do you use character range? To illustrate the concept of character
ranges let's create a `basic` vector with some simple strings, and see
what the different ranges match:
```{r}
basic <- c('1', 'a', 'A', '&', '-', '^')
```
```{r results='hide'}
# digits
str_view(basic, '[0-9]')
```
```{r results='hide'}
# lower case letters
str_view(basic, '[a-z]')
```
```{r results='hide'}
# upper case letters
str_view(basic, '[A-Z]')
```
Now consider the following vector `triplets`:
```{r}
triplets <- c('123', 'abc', 'ABC', ':-)')
```
You can use a series of character ranges to match various occurrences of
a certain type of character. For example, to match three consecutive digits
you can define a pattern `"[0-9][0-9][0-9]"`; to match three consecutive
lower case letters you can use the pattern `"[a-z][a-z][a-z]"`; and the
same idea applies to a pattern that matches three consecutive upper case
letters `"[A-Z][A-Z][A-Z]"`.
```{r results='hide'}
str_view(triplets, '[0-9][0-9][0-9]')
str_view(triplets, '[A-Z][A-Z][A-Z]')
```
Observe that the element `":-)"` is not matched by any of the character ranges
that we have seen so far.
Character ranges can be defined in multiple ways. For example, the range
`"[1-3]"` indicates any one digit 1, 2, or 3. Another range may be defined
as `"[5-9]"` comprising any one digit 5, 6, 7, 8 or 9. The same idea applies
to letters. You can define shorter ranges other than `"[a-z]"`. One example
is `"[a-d]"` which consists of any one lettere _a_, _b_, _c_, and _d_.
## Negative Character Sets
A common situation when working with regular expressions consists of matching
characters that are NOT part of a certain set. This type of matching can be
done using a negative character set: by matching any one character that is not
in the set. To define this type of sets you are going to use the metacharacter
caret `"^"`. If you are using a QWERTY keyboard, the caret symbol should be
located over the key with the number 6.
The caret `"^"` is one of those metacharacters that have more than one meaning
depending on where it appears in a pattern. If you use a caret in the first
position inside a character set, e.g. `[^aeiou]`, it means _negation_. In
other words, the caret in `[^aeiou]` means "not any one of lower case vowels."
Let's use the `basic` vector previously defined:
```{r}
basic <- c('1', 'a', 'A', '&', '-', '^')
```
To match those elements that are NOT upper case letters, you define a negative
character range `"[^A-Z]"`:
```{r results='hide'}
str_view(basic, '[^A-Z]')
```
It is important that the caret is the first character inside the character
set, otherwise the set is not a negative one:
```{r results='hide'}
str_view(basic, '[A-Z^]')
```
In the example above, the pattern `"[A-Z^]"` means "any one upper case letter
or the caret character." Which is completely different from the negative set
`"[^A-Z]"` that negates any one upper case letter.
If you want to match any character except the caret, then you need to use a
character set with two carets: `"[^^]"`. The first caret works as a negative
operator, the second caret is the caret character itself:
```{r results='hide'}
str_view(basic, '[^^]')
```
## Metacharacters inside character sets
Now that you know what character sets are, how to define character ranges,
and how to specify negative character sets, we need to talk about what
happens when including metacharacters inside character sets.
Except for the caret in the first position of the character set, any other
metacharacter inside a character set is already escaped. This implies that
you do not need to escape them using backslashes.
To illustrate the use of metacharacters inside character sets, let's use
the `pnx` vector:
```{r results='hide'}
pnx <- c('pan', 'pen', 'pin', 'p0n', 'p.n', 'p1n', 'paun')
```
The character set formed by `"p[ae.io]n"` includes the dot character. Remember
that, in general, the period is the wildcard metacharacter and it matches
any type of character. However, the period in this example is inside a
character set, and because of that, it loses its wildcard behavior.
```{r, results='hide'}
str_view(pnx, "p[ae.io]n")
```
As you can tell, `"p[ae.io]n"` matches _pan_, _pen_, _pin_ and _p.n_,
but not _p0n_ or _p1n_ because the dot is the literal dot, not a wildcard
character anymore.
Not all metacharacters become literal characters when they appear inside a
character set. The exceptions are the closing bracket `]`, the dash `-`,
the caret `^`, and the backslash `\`.
The closing bracket `]` is used to enclose the character set. Thus, if you
want to use a literal right bracket inside a character set you must escape it:
`[aei\\[ou]`. Remember that in R you use double backslash for escaping
purposes. This is also why the backslash `\`, or double backslash in R,
does not become a literal character.
Another interesting case has to do with the dash or hyphen `-` character. As
you know, the dash inside a character set is used to define a range of
characters: e.g. `[0-9]`, `[x-z]`, and `[K-P]`. As a general rule, if you
want to include a literal dash as part of a range, you should escape it:
`"[a-z\\-]"`.
Let's modify the `basic` vector by adding an opening and ending brackets:
```{r}
basic <- c('1', 'a', 'A', '&', '-', '^', '[', ']')
```
How do you match each of the characters that have a special meaning inside a
character set?
```{r, results='hide'}
# matching a literal caret
str_view(basic, "[a\\^]")
```
```{r, results='hide'}
# matching a literal dash
str_view(basic, "[a\\-]")
```
```{r, results='hide'}
# matching a literal opening bracket
str_view(basic, "[a\\[]")
```
```{r, results='hide'}
# matching a literal closing bracket
str_view(basic, "[a\\]]")
```
## Character Classes
Closely related with character sets and character ranges, regular expressions
provide another useful construct called __character classes__ which, as their
name indicates, are used to match a certain class of characters. The most
common character classes in most regex engines are:
| Character | Matches | Same as |
|:----------|:--------------------------------------------|:----------------|
| `\\d` | any digit | `[0-9]` |
| `\\D` | any nondigit | `[^0-9]` |
| `\\w` | any character considered part of a word | `[a-zA-Z0-9_]` |
| `\\W` | any character not considered part of a word | `[^a-zA-Z0-9_]` |
| `\\s` | any whitespace character | `[\f\n\r\t\v]` |
| `\\S` | any nonwhitespace character | `[^\f\n\r\t\v]` |
You can think of character classes as another type of metacharacters, or as
shortcuts for special character sets.
The following table shows the characters that represent whitespaces:
| Character | Description |
|:----------|:----------------|
| `\f` | form feed |
| `\n` | line feed |
| `\r` | carriage return |
| `\t` | tab |
| `\v` | vertical tab |
Sometimes you have to deal with nonprinting whitespace characters. In these
situations you probably will end up using the whitespace character class `\\s`.
A common example is when you have to match tab characters, or line breaks.
The operating system Windows uses `\r\n` as an end-of-line marker. In contrast,
Unix-like operating systems (including Mac OS) use `\n`.
Tab characters `\t` are commonly used as a field-separator for data files. But
most text editors render them as whitespaces.
## POSIX Character Classes
We finish this chapter with the introduction of another type of character
classes known as __POSIX character classes__. These are yet another class
construct that is supported by the regex engine in R.
| Class | Description | Same as |
|:-------------|:---------------------------|:----------------|
| `[:alnum:]` | any letter or digit | `[a-zA-Z0-9]` |
| `[:alpha:]` | any letter | `[a-zA-Z]` |
| `[:digit:]` | any digit | `[0-9]` |
| `[:lower:]` | any lower case letter | `[a-z]` |
| `[:upper:]` | any upper case letter | `[A-Z]` |
| `[:space:]` | any whitespace inluding space | `[\f\n\r\t\v ]` |
| `[:punct:]` | any punctuation symbol | |
| `[:print:]` | any printable character | |
| `[:graph:]` | any printable character excluding space | |
| `[:xdigit:]` | any hexadecimal digit | `[a-fA-F0-9]` |
| `[:cntrl:]` | ASCII control characters | |
Notice that a POSIX character class is formed by an opening bracket `[`,
followed by a colon `:`, followed by some keyword, followed by another
colon `:`, and finally a closing bracket `]`.
In order to use them in R, you have to wrap a POSIX class inside a regex
character class. That is, you have to surround a POSIX class with brackets.
Once again, refer to the `pnx` vector to illustrate the use of POSIX classes:
```{r results='hide'}
pnx <- c('pan', 'pen', 'pin', 'p0n', 'p.n', 'p1n', 'paun')
```
Let's start with the `[:alpha:]` class, and see what does it match in `pnx`:
```{r, results='hide'}
str_view(pnx, "[[:alpha:]]")
```
Now let's test it with `[:digit:]`
```{r, results='hide'}
str_view(pnx, "[[:digit:]]")
```
-----
#### Make a donation {-}
If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_donations" />
<input type="hidden" name="business" value="ZF6U7K5MW25W2" />
<input type="hidden" name="currency_code" value="USD" />
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" title="PayPal - The safer, easier way to pay online!" alt="Donate with PayPal button" />
<img alt="" border="0" src="https://www.paypal.com/en_US/i/scr/pixel.gif" width="1" height="1" />
</form>