dynamic-web-scraping-RSelenium.Rmd

---
title: "Dynamic Web Scraping with RSelenium"
author: "Sofia Lai, Marina Luna"
date: "4 November 2021"
output: 
  html_document:
    toc: TRUE
    df_print: paged
    number_sections: FALSE
    highlight: tango
    theme: lumen
    toc_depth: 3
    toc_float: true
    
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      eval = TRUE,
                      error = FALSE,
                      message = FALSE,
                      warning = FALSE,
                      comment = NA,
                      results = TRUE)
```
# Setting up R Selenium 
1. Download a [Selenium Server](https://www.selenium.dev/downloads/). 
2. Download the [current version](https://duckduckgo.com/?q=java+download&va=z&t=hk&ia=web) of Java SE Runtime Environment

3. Install and load necessary packages 
```{r, include = T}
library(tidyverse)
library(RSelenium)
library(rvest)
library(binman)
```

Or, the cooler way: 
```{r, include = T}
pacman::p_load(RSelenium, tidyverse, rvest, binman)
```

# Initiating a Selenium Server
1. Check what version of Google Chrome you are using. 95
2. Check what version ChromeDriver you should use, depending on the above...
```{r, echo = FALSE}
binman::list_versions("chromedriver") 
```
3. ...and use it as value for chromever: 
```{r, results = "hide"}
remDr <- rsDriver(browser = "chrome", chromever = "95.0.4638.17")
rD <- remDr[['client']]
```

Did it open a new Chrome window? Pretty cool, right? 
If it didn't, you are probably not using the correct version. Stop the server and try again.

# Practical example 
Now that we're all set, we can start navigating a web page of our choice. Let's go back to our previous example, the European Parliament [press room](https://www.europarl.europa.eu/news/en/press-room). 
```{r, echo = FALSE}
rD$navigate("https://www.europarl.europa.eu/news/en/press-room")
Sys.sleep(5) 
```


Make sure to refuse or accept analytics cookies. You can do this manually, or through Selenium. 

First, let's locate the "Refuse cookies" button. 
```{r}
refuse_cookies <- rD$findElement(using = "xpath" ,
                                 "//*[@id='cookie-policy']/div/div[2]/button[1]/span")
```

Then, we use Selenium to interact with the the web page. 
```{r}
refuse_cookies$clickElement()
```
Check the web page. Is the message gone? 

Can you guess why this is a dynamic web page? 

Hint: click on "load more"... See any difference? 

Perfect! Now let's load more content. 
First, let's locate the "load more" button. 
```{r}
load_more <- rD$findElement(using ='css selector', "#continuesLoading_button")
```

Assuming we are interested in more than just the initial 15 headlines, we want to click on the "load more" button several times. Again, we could do this manually, running the code each time, or more efficiently: 
```{r, results = 'hide'}
for(i in 1:5) {
  print(i)
  load_more$clickElement()
  Sys.sleep(10)
}

```
Check the web page that RStudio has opened. See how the content is loaded automatically? 
Once we have loaded enough content, we can proceed to scrape the headlines. 

One final step with Selenium... 
```{r}
page_source <- rD$getPageSource()
```

...And now we can use our good old rvest tools to scrape the data! 
```{r}
headlines <- read_html(page_source[[1]]) %>%
  rvest::html_nodes('body') %>%
  xml2::xml_find_all("//a[contains(@title, 'Read more')]") %>%
  html_text() %>% 
  trimws() %>% #some cleaning...
  as_data_frame() %>% 
  rename(headline = value) %>%
  mutate(headline = gsub("\\ \n\n", "", headline))

headlines
```
Why RSelenium then? 
You see here we have a nice list of all the headlines we loaded. We could load much more content and have even more headlines. 

# Scraping dynamic web pages with static tools? 
What do you think would happen if we tried to scrape this page only using rvest? 
```{r}
EP_press <- read_html("https://www.europarl.europa.eu/news/en/press-room")
headlines_rvest <- EP_press %>%
  rvest::html_nodes('body') %>%
  xml2::xml_find_all("//a[contains(@title, 'Read more')]") %>%
  html_text()  %>% 
  trimws() %>%
  as_data_frame() %>%
  rename(headline = value)
headlines_rvest
```
See? We only get the initial 15! This happens *even if* we manually loaded more content. We would always only get the first 15 headlines. Can you explain why this happened? 

Close the server.
```{r}
remDr$server$stop()
```


# So long, and thanks for all the data! 
Thank you for attending this session. Here are some further resources you might find useful. 

[Rselenium Github Page](https://github.com/ropensci/RSelenium)

[RSelenium documentation](https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf)

[Tutorial by Joshua McCrain](http://joshuamccrain.com/tutorials/web_scraping_R_selenium.html)

[Selenium ecosystem](https://www.selenium.dev/)