-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdynamic-web-scraping-RSelenium.Rmd
155 lines (126 loc) · 4.7 KB
/
dynamic-web-scraping-RSelenium.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "Dynamic Web Scraping with RSelenium"
author: "Sofia Lai, Marina Luna"
date: "4 November 2021"
output:
html_document:
toc: TRUE
df_print: paged
number_sections: FALSE
highlight: tango
theme: lumen
toc_depth: 3
toc_float: true
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE,
eval = TRUE,
error = FALSE,
message = FALSE,
warning = FALSE,
comment = NA,
results = TRUE)
```
# Setting up R Selenium
1. Download a [Selenium Server](https://www.selenium.dev/downloads/).
2. Download the [current version](https://duckduckgo.com/?q=java+download&va=z&t=hk&ia=web) of Java SE Runtime Environment
3. Install and load necessary packages
```{r, include = T}
library(tidyverse)
library(RSelenium)
library(rvest)
library(binman)
```
Or, the cooler way:
```{r, include = T}
pacman::p_load(RSelenium, tidyverse, rvest, binman)
```
# Initiating a Selenium Server
1. Check what version of Google Chrome you are using. 95
2. Check what version ChromeDriver you should use, depending on the above...
```{r, echo = FALSE}
binman::list_versions("chromedriver")
```
3. ...and use it as value for chromever:
```{r, results = "hide"}
remDr <- rsDriver(browser = "chrome", chromever = "95.0.4638.17")
rD <- remDr[['client']]
```
Did it open a new Chrome window? Pretty cool, right?
If it didn't, you are probably not using the correct version. Stop the server and try again.
# Practical example
Now that we're all set, we can start navigating a web page of our choice. Let's go back to our previous example, the European Parliament [press room](https://www.europarl.europa.eu/news/en/press-room).
```{r, echo = FALSE}
rD$navigate("https://www.europarl.europa.eu/news/en/press-room")
Sys.sleep(5)
```
Make sure to refuse or accept analytics cookies. You can do this manually, or through Selenium.
First, let's locate the "Refuse cookies" button.
```{r}
refuse_cookies <- rD$findElement(using = "xpath" ,
"//*[@id='cookie-policy']/div/div[2]/button[1]/span")
```
Then, we use Selenium to interact with the the web page.
```{r}
refuse_cookies$clickElement()
```
Check the web page. Is the message gone?
Can you guess why this is a dynamic web page?
Hint: click on "load more"... See any difference?
Perfect! Now let's load more content.
First, let's locate the "load more" button.
```{r}
load_more <- rD$findElement(using ='css selector', "#continuesLoading_button")
```
Assuming we are interested in more than just the initial 15 headlines, we want to click on the "load more" button several times. Again, we could do this manually, running the code each time, or more efficiently:
```{r, results = 'hide'}
for(i in 1:5) {
print(i)
load_more$clickElement()
Sys.sleep(10)
}
```
Check the web page that RStudio has opened. See how the content is loaded automatically?
Once we have loaded enough content, we can proceed to scrape the headlines.
One final step with Selenium...
```{r}
page_source <- rD$getPageSource()
```
...And now we can use our good old rvest tools to scrape the data!
```{r}
headlines <- read_html(page_source[[1]]) %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//a[contains(@title, 'Read more')]") %>%
html_text() %>%
trimws() %>% #some cleaning...
as_data_frame() %>%
rename(headline = value) %>%
mutate(headline = gsub("\\ \n\n", "", headline))
headlines
```
Why RSelenium then?
You see here we have a nice list of all the headlines we loaded. We could load much more content and have even more headlines.
# Scraping dynamic web pages with static tools?
What do you think would happen if we tried to scrape this page only using rvest?
```{r}
EP_press <- read_html("https://www.europarl.europa.eu/news/en/press-room")
headlines_rvest <- EP_press %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//a[contains(@title, 'Read more')]") %>%
html_text() %>%
trimws() %>%
as_data_frame() %>%
rename(headline = value)
headlines_rvest
```
See? We only get the initial 15! This happens *even if* we manually loaded more content. We would always only get the first 15 headlines. Can you explain why this happened?
Close the server.
```{r}
remDr$server$stop()
```
# So long, and thanks for all the data!
Thank you for attending this session. Here are some further resources you might find useful.
[Rselenium Github Page](https://github.com/ropensci/RSelenium)
[RSelenium documentation](https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf)
[Tutorial by Joshua McCrain](http://joshuamccrain.com/tutorials/web_scraping_R_selenium.html)
[Selenium ecosystem](https://www.selenium.dev/)