forked from gastonstat/r4strings
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmetacharacters.Rmd
142 lines (105 loc) · 5.21 KB
/
metacharacters.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# Metacharacters {#metacharacters}
The next topic that you should learn about regular expressions has to do with
__metacharacters__. As you just learned, the most basic type of regular
expressions are the literal characters which are characters that match
themselves. However, not all characters match themselves. Any character that
is not a literal character is a metacharacter.
## About Metacharacters
Metacharacter are characters that have a special meaning and they allow you to transform literal characters in very powerful ways.
Below is the list of metacharacters in _Extended Regular Expressions_ (EREs):
```
. \ | ( ) [ ] { } $ - ^ * + ?
```
- the dot `.`
- the backslash `\`
- the bar `|`
- left or opening parenthesis `(`
- right or closing parenthesis `)`
- left or opening bracket `[`
- right or closing bracket `]`
- left or opening brace `{`
- right or closing brace `}`
- the dollar sign `$`
- the dash, hyphen or minus sign `-`
- the caret or hat `^`
- the star or asterisk `*`
- the plus sign `+`
- the question mark `?`
We're going to be working with these characters throughout the rest of the book.
Simply put, everything else that you need to know about regular expressions
besides literal characters is how these metacharacters work. The good news is
that there are only a few metacharacters to learn. The bad news is
that some metacharacters can have more than one meaning. And learning those
meanings definitely takes time and requires hours of practice. The meaning of
the metacharacters greatly depend on the context in which you use them,
how you use them, and where you use them. If it wasn't enough complication,
it is also the metacharacters that have variation between the different
regex engines.
## The Wild Metacharacter
The first metacharacter you should learn about is the dot or period `"."`,
better known as the __wild__ metacharacter. This metacharacter is used to
match __ANY__ character except for a new line.
For example, consider the pattern `"p.n"`, that is, _p wildcard n_. This
pattern will match _pan_, _pen_, and _pin_, but it will not match _prun_
or _plan_. The dot only matches one single character.
Let's see another example using the vector `c("not", "note", "knot", "nut")`
and the pattern `"n.t"`
```{r eval = FALSE}
not <- c("not", "note", "knot", "nut")
str_view(not, "n.t")
```
the pattern `"n.t"` matches _not_ in the first three elements, and _nut_
in the last element.
If you specify a pattern `"no."`, then just the first three elements
in `not` will be matched.
```{r eval = FALSE}
str_view(not, "no.")
```
And if you define a pattern `"kn."`, then only the third element is matched.
```{r eval = FALSE}
str_view(not, "kn.")
```
The wild metacharacter is probably the most used metacharacter, and it is
also the most abused one, being the source of many mistakes. Here is a basic
example with the regular expression formed by `"5.00"`. If you think that this
pattern will match five with two decimal places after it, you will be
surprised to find out that it not only matches _5.00_ but also _5100_ and _5-00_.
Why? Because `"."` is the metacharacter that matches absolutely anything.
You will learn how to fix this mistake in the next section, but it illustrates
an important fact about regular expressions: the challenge consists of matching
what you want, but also in matching only what you want. You don't want to
specify a pattern that is overly permissive. You want to find the thing you're
looking for, but only that thing.
## Escaping metacharacters
What if you just want to match the character dot? For example, say you
have the following vector:
```{r}
fives <- c("5.00", "5100", "5-00", "5 00")
```
If you try the pattern `"5.00"`, it will match all of the elements in `fives`.
```{r eval = FALSE}
str_view(fives, "5.00")
```
To actually match the dot character, what you need to do is __escape__ the
metacharacter. In most languages, the way to escape a metacharacter is by
adding a backslash character in front of the metacharacter: `"\."`.
When you use a backslash in front of a metacharacter you are "escaping" the
character, this means that the character no longer has a special meaning,
and it will match itself.
However, R is a bit different. Instead of using a backslash you have to use
two backslashes: `"5\\.00"`. This is because the backslash `"\"`, which is
another metacharacter, has a special meaning in R. Therefore, to match just
the element _5.00_ in `fives` in R, you do it like so:
```{r eval = FALSE}
str_view(fives, "5\\.00")
```
-----
#### Make a donation {-}
If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_donations" />
<input type="hidden" name="business" value="ZF6U7K5MW25W2" />
<input type="hidden" name="currency_code" value="USD" />
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" title="PayPal - The safer, easier way to pay online!" alt="Donate with PayPal button" />
<img alt="" border="0" src="https://www.paypal.com/en_US/i/scr/pixel.gif" width="1" height="1" />
</form>