forked from gastonstat/r4strings
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathregex1.Rmd
115 lines (89 loc) · 5.8 KB
/
regex1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# (PART) Regex {-}
# Regular Expressions {#regex1}
## Introduction
So far you have learned some basic and intermediate functions for handling and
working with text in R. These are very useful functions and they allow you
to do many interesting things. However, if you truly want to unleash the power
of strings manipulation, you need to take things to the next level and learn
about _regular expressions_.
## What are Regular Expressions?
The name "Regular Expression" does not say much. However, regular
expressions are all about text. Think about how much text is all around
you in our modern digital world: email, text messages, news articles, blogs,
computer code, contacts in your address book---all these things are text.
Regular expressions are a tool that allows us to work with these text by
describing text patterns.
A __regular expression__ is a special text string for describing a certain
amount of text. This "certain amount of text" receives the formal name of
_pattern_. In other words, a regular expression is a set of symbols that
describes a text pattern. More formally we say that a regular expression is a
_pattern that describes a set of strings_. In addition to this first
meaning, the term regular expression can also be used in a slightly different
but related way: as the formal language of these symbols that needs
to be interpreted by a regular expression processor.
Because the term "regular expression" is rather long, most people
use the word __regex__ as a shortcut term. And you will even find the
plural _regexes_.
It is also worth noting what regular expressions are not. They're not a
programming language. They may look like some sort of programming language
because they are a formal language with a defined set of rules that gets a
computer to do what we want it to do. However, there are no variables in regex
and you can't do computations like adding 1 + 1.
### What are Regular Expressions used for?
We use regular expressions to work with text. Some of its common uses involve
testing if a phone number has the correct number of digits, if a date follows a
specifc format (e.g. mm/dd/yy), if an email address is in a valid format, or if
a password has numbers and special characters. You could also use regular
expressions to search a document for _gray_ spelt either as "gray" or
"grey". You could search a document and replace all occurrences of "Will", "Bill",
or "W." with William. Or you could count the number of times in a document
that the word "analysis" is immediately preceded by the words "data",
"computer" or "statistical" only in those cases. You could use it to
convert a comma-delimited file into a tab-delimited file or to find duplicate
words in a text.
In each of these cases, you are going to use a regular expression to write up a
description of what you are looking for using symbols. In the case of a phone
number, that pattern might be three digits followed by a dash, followed by three
digits and another dash, followed by four digits. Once you have defined a
pattern then the regex processor will use our description to return matching
results, or in the case of the test, to return true or false for whether or not
it matched.
### A word of caution about regex
If you have never used regular expressions before, their syntax may seem a bit
scary and cryptic. You will see strings formed by a bunch of letters, digits,
and other punctuation symbols combined in seemingly nonsensical ways. As with
any other topic that has to do with programming and data analysis, learning
the principles of regex and becoming fluent in defining regex patterns takes
time and requires a lot of practice. The more you use them, the better you
will become at defining more complex patterns and getting the most out of them.
Regular Expressions is a wide topic and there are books entirely dedicated to
this subject. The material offered in this book is not extensive and there are
many subtopics that I don't cover here. Despite the initial barriers that you
may encounter when entering the regex world, the pain and frustration of
learning this tool will payoff in your data science career.
### Regular Expressions in R
Tools for working with regular expressions can be found in virtually all
scripting languages (e.g. Perl, Python, Java, Ruby, etc). R has some
functions for working with regular expressions but it does not provide
the wide range of capabilities that other scripting languages do. Nevertheless,
they can take you very far with some workarounds (and a bit of patience).
One of the best tools you must have in your toolkit is the R package `"stringr"`
(by Hadley Wickham). It provides functions that have similar behavior to
those of the base distribution in R. But it also provides many more facilities
for working with regular expressions.
## Regex Basics
The main purpose of working with regular expressions is to describe patterns
that are used to match against text strings. Simply put, working with regular
expressions is nothing more than __pattern matching__. The result of a match
is either successful or not.
The simplest version of pattern matching is to search for one occurrence (or
all occurrences) of some specific characters in a string. For example, we might
want to search for the word `"programming"` in a large text document, or we
might want to search for all occurrences of the string `"apply"` in a series
of files containing R scripts.
Typically, regular expression patterns consist of a combination of alphanumeric
characters as well as special characters. A regex pattern can be as simple as a
single character, or it can be formed by several characters with a more complex
structure. In all cases we construct regular expressions much in the same form
in which we construct arithmetic expressions, by using various operators to
combine smaller expressions.