forked from gastonstat/r4strings
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlogs.Rmd
245 lines (175 loc) · 7.8 KB
/
logs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# Data Log File {#logs}
In this example, we'll be using the text file `logfile.txt` located in the
`data/` folder of the book's github repository:
[https://raw.githubusercontent.com/gastonstat/r4strings/master/data/logfile.txt](https://raw.githubusercontent.com/gastonstat/r4strings/master/data/logfile.txt)
This file is a __server log file__ that contains the recorded events taking place in a web server. The content of the file is in a special format known as _common log format_. According to [wikipedia](https://en.wikipedia.org/wiki/Common_Log_Format):
> "The Common Log Format is a standardized text file format used by web servers when generating server log files."
Here's an example of a log record; the text should be in one line of code, but I've split it into 2 lines for readibility purposes:
```
pd9049dac.dip.t-dialin.net - - [01/May/2001:01:51:25 -0700]
"GET /accesswatch/accesswatch-1.33/ HTTP/1.0" 200 1004
```
- A `"-"` in a field indicates missing data.
- `pd9049dac.dip.t-dialin.net` is the IP address of the client (remote host) which made the request to the server.
- `[01/May/2001:01:51:25 -0700]` is the date, time, and time zone that the request was received, by default in strftime format` %d/%b/%Y:%H:%M:%S %z`.
- `"GET /accesswatch/accesswatch-1.33/ HTTP/1.0"` is the request line from the client.
- The method `GET, /accesswatch/accesswatch-1.33/` is the resource requested, and `HTTP/1.0` is the HTTP protocol.
- `200` is the HTTP status code returned to the client.
+ `2xx` is a successful response
+ `3xx` a redirection
+ `4xx` a client error, and
+ `5xx` a server error
- `1004` is the size of the object returned to the client, measured in bytes.
If you want to download a copy of the text file to your working directory (from within R) run the following code:
```{r eval = FALSE}
# download file
github <- "https://raw.githubusercontent.com/gastonstat/r4strings"
textfile <- "/master/data/logfile.txt"
download.file(url = paste0(github, textfile), destfile = "logfile.txt")
```
```{r may-logs, echo = FALSE}
# one option is to read in the content with 'readLines()'
logs <- readLines('data/logfile.txt')
```
## Reading the text file
The first step involves reading the data in R. How can you do this? One option is with the `readLines()` function which reads any text file into a character vector:
```{r may-logs-fake, eval = FALSE}
# one option is to read in the content with 'readLines()'
logs <- readLines('logfile.txt')
```
Let's take a peek at the content of the vector `logs`:
```{r head-logs}
# take a peek at the contents in logs
head(logs)
```
Because the file contains `r length(logs)` lines (or elements), let's get a subset by taking a random sample of size 50:
```{r}
# subset a sample of lines
set.seed(98765)
s <- sample(1:length(logs), size = 50)
sublogs <- logs[s]
```
### JPG File Requests
To begin our regex experiments, let's try to find out "how many requests involved a JPG file?".
One way to answer the previous question is by counting the number of lines containing the pattern `"jpg"`. We can use `grep()` to match or detect this pattern:
```{r}
# matching "jpg" (which lements)
grep("jpg", sublogs)
# showing value of matches
grep("jpg", sublogs, value = TRUE)
```
We can try to be more specific by defining a pattern `".jpg"` in which the `.` corresponds to the _literal_ dot character. To match the dot, we need to escape it with `"\\."`:
```{r}
# we could try to be more precise and match ".jpg"
grep("\\.jpg ", sublogs, value = TRUE)
```
A similar output of `grep()` can be obtained with `str_detect()`, which allows you to _detect_ what elements contain a match to the specified pattern:
```{r}
# matching "jpg" (which lements)
str_detect(string = sublogs, pattern = "\\.jpg")
```
We can do the same for PNG extensions (or for GIF or ICO):
```{r}
# matching "png" (which lements)
str_detect(string = sublogs, pattern = "\\.png")
```
### Extracting file extensions
Another common task when working with regular expressions has to do with pattern extraction. For this purposes, we can use `str_extract()`:
```{r}
# extracting "jpg" (which lements)
str_extract(string = sublogs, pattern = "\\.jpg")
```
`str_extract()` actually let us confirm that we are matching the desired patterns. Notice that when there is no match, `str_extract()` returns a missing value `NA`.
### Image files
Now let's try to detect all types of image files: JPG, PNG, GIF, ICO
```{r}
# looking for image file extensions
jpgs <- str_detect(logs, pattern = "\\.jpg ")
sum(jpgs)
pngs <- str_detect(logs, pattern = "\\.png ")
sum(pngs)
gifs <- str_detect(logs, pattern = "\\.gif")
sum(gifs)
icos <- str_detect(logs, pattern = "\\.ico ")
sum(icos)
```
### How to match image files with one regex pattern?
We can use character sets to define a more generic pattern. For instance, to
match `"jpg"` or `"png"`, we could join three character sets: `"[jp][pn][g]"`.
The first set `[jp]` looks for `j` or `p`, the second set `[pn]` looks for
`p` or `n`, and the third set simply looks for `g`.
```{r}
# matching "jpg" or "png"
jpg_png_lines <- str_detect(sublogs, "[jp][pn][g]")
sum(jpg_png_lines)
```
Including the dot, we can use: `"\\.[jp][pn][g]"`
```{r}
# matching "jpg" or "png"
jpg_png_lines <- str_detect(sublogs, "\\.[jp][pn][g]")
sum(jpg_png_lines)
```
We could generalize the pattern to include the GIF and ICO extensions:
```{r}
# matching "jpg" or "png" or "gif"
image_lines1 <- str_detect(sublogs, "[jpgi][pnic][gfo]")
sum(image_lines1)
```
To confirm that we are actually matching `jpg`, `png`, `gif` and `ico`, let's
use `str_extract()`
```{r}
# are we correctly extracting image file extensions?
str_extract(sublogs, "[jpgi][pnic][gfo]")
```
The previous pattern does not really work as expected: note that we are matching
the patterns formed by `"ing"` and `"inf"` which do not correspond to image file
extensions.
An alternative way to detect JPG and PNG is by grouping patterns inside
parentheses, and separating them with the metacharacter `"|"` which means _OR_:
```{r}
# detecting .jpg OR .png
jpg_png <- str_detect(sublogs, "\\.jpg|\\.png")
sum(jpg_png)
```
Here's how to detect all the extension in one single pattern:
```{r}
# matching "jpg" or "png" or "gif" or "ico"
image_lines <- str_detect(sublogs, "\\.jpg|\\.png|\\.gif|\\.ico")
sum(image_lines)
```
To make sure our regex operation is successful, let's see the output of
`str_extract()`:
```{r}
images_output <- str_extract(sublogs, "\\.jpg|\\.png|\\.gif|\\.ico")
images_output
```
There's some repetition with the dot character; we can modify our previous
pattern by placing the dot `"\\."` at the beginning:
```{r}
images_output <- str_extract(sublogs, "\\.jpg|png|gif|ico")
images_output
```
Notice that the dot only appears next to `".jpg"` but not with the other
type of extensions. What we need to do is group the file extensions by surrounding
them with parentheses:
```{r}
images_output <- str_extract(sublogs, "\\.(jpg|png|gif|ico)")
images_output
```
Now let's apply the pattern on the entire log file, to count the number of files
of each type:
```{r}
# frequencies
img_extensions <- str_extract(logs, "\\.(jpg|png|gif|ico)")
table(img_extensions)
```
-----
#### Make a donation {-}
If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.
<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_donations" />
<input type="hidden" name="business" value="ZF6U7K5MW25W2" />
<input type="hidden" name="currency_code" value="USD" />
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" title="PayPal - The safer, easier way to pay online!" alt="Donate with PayPal button" />
<img alt="" border="0" src="https://www.paypal.com/en_US/i/scr/pixel.gif" width="1" height="1" />
</form>