more_dplyr.Rmd

```{r setup, echo=FALSE, warning=FALSE, purl=FALSE, message=FALSE}
options(repos="http://cran.rstudio.com/")
pkgs <- c("dplyr","knitr","RSQLite")
x<-lapply(pkgs, library, character.only = TRUE)
opts_chunk$set(tidy=T)
```

# More fun with `dplyr`
These materials will go over some introductory materials (i.e. from our [Intro R Workshop](https://github.com/rhodyrstats/intro_r_workshop)) as well as expand into some of the other functionality.  This presentation is heavily influenced by the [Introduction to `dplyr`](https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html), the [Two-table Verbs](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html), and [Databases](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html) vigenettes. In particular we will cover:

- [A word about pipes](#a-word-about-pipes)
- [Manipulating data](#manipulating-data)
- [Manipulating grouped data](#manipulating-grouped-data)
- [Database functionality: joins](#database-functionality-joins)
- [Database functionality: external databases](#database-functionality-external-databases)

## A word about pipes
There are many different ways to go about scripting an analyis in R (or any language for that matter).  These were discussed in the [Intro to R Workshop](https://github.com/rhodyrstats/intro_r_workshop/blob/master/lessons/03_wrangling.md#using-dplyr)(scroll down a bit), but to review, they are: using intermediate steps/objects, nest functions, or use pipes.  If you are developing new functions or packages it is probably best to not use pipes as it adds a dependency and de-bugging can be a bit of a challenge.  If you are scripting data analysis, pipes (i.e. `%>%` from `magrittr`) are, in my opinion, the way to go.  For this presentation we will use pipes for all the examples.

## Manipulating data
The `dplyr` package is first and foremost a package to help faciliatate data manipulation.  What it does can certainly be done with base R or with other packages, but it can be argued that `dplyr` makes these tasks more undertandable through its use of a consistent interface.  In particular, this is accomplished through the use of data manipulation verbs.  These verbs are:

- `select()`: selects columns from a data frame 
- `arrange()`: Arranges a data frame in ascending or descending order based on column(s). 
- `filter()`: Select observations from a data frame based on values in column(s).
- `slice()`: Selects observations based on specific rows 
- `rename()`: Rename columns in a dataframe
- `distinct()`: Get unique rows (OK, not a verb...) 
- `sample_n()`: Randomly select 'n' number of rows
- `sample_frac()`: Randomly select a fraction of rows 
- `mutate()`: Adds new columns to a data frame and keeps all other columns
- `transmutate()`: Adds new columns to a data frame and drops all other columns
- `summarise()`: Summarizes your data.

Before we move on, we need some data.  Once again, I am going to rely on the 2007 National Lakes Assessment Data:

```{r}
sites <- read.csv("http://www.epa.gov/sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv")
```

Let's look at the columns for each of these

```{r}
names(sites)
```

Given the large number of fields in these, I may want to reduce just to what I am interested in.

```{r}
sites_sel<-sites %>%
  select(SITE_ID,LAKENAME,VISIT_NO,SITE_TYPE,WSA_ECO9,AREA_HA,DEPTHMAX)
head(sites_sel)
```

We can arrange the data.

```{r}
#Ascending is default
sites_sel %>%
  arrange(DEPTHMAX) %>%
  head()
#Descending need desc()
sites_sel %>%
  arrange(desc(DEPTHMAX)) %>%
  head()
#By more than one column
sites_sel %>%
  arrange(WSA_ECO9,desc(DEPTHMAX))%>%
  head()
```

Let's filter out just some of the deeper lakes

```{r}
sites_sel %>%
  filter(DEPTHMAX >= 50)
```

Or just the deep lakes in Northern Applachians Ecoregion

```{r}
sites_sel %>%
  filter(WSA_ECO9 == "NAP", DEPTHMAX >= 50)
```

We can also grab observations by row

```{r}
sites_sel %>%
  slice(c(1,2))
#or
sites_sel %>%
  slice(seq(1,nrow(sites_sel),100))
```

Renaming columns is easy

```{r}
sites_sel %>%
  rename(Ecoregion = WSA_ECO9, MaxDepth = DEPTHMAX)%>%
  head()
```

We can identify distinct values and get those rows

```{r}
sites_sel %>%
  distinct(WSA_ECO9)
#Returns the first row with the distinct value so order has an impact
sites_sel %>%
  arrange(desc(DEPTHMAX))%>%
  distinct(WSA_ECO9)
```

Sampling by number or fraction and with or without replacment is done like:

```{r}
set.seed(72)
#By Number
sites_sel %>%
  sample_n(10)

#By Fraction
sites_sel %>%
  sample_frac(0.01)
```

To create new columns

```{r}
#Add it to the other columns
sites_sel %>%
  mutate(volume = ((10000*AREA_HA) * DEPTHMAX)/3)%>%
  head()
#Create only the new column
sites_sel %>%
  transmute(mean_depth = (((10000*AREA_HA) * DEPTHMAX)/3)/(AREA_HA*10000)) %>%
  head()
```

Lastly, we can get summaries of our data

```{r}
sites_sel %>%
  summarize(avg_depth = mean(DEPTHMAX,na.rm=T),
            n = n()) %>%
  head()
```

## Manipulating grouped data
If `dplyr` stopped there it would still be a useful package.  It, of course, does not stop there.  One of the more powerful things you can do with `dplyr` is to run grouped operations.  This is especially useful in the context of summarizing your data.  The functions we will be looking at are:

- `group_by()`: Function to create groups from column(s) in your data. 
- `summarise()`: Saw this above, but really shines when summarizing across groups. 
- `n()`: A function for use within the `summarize()` function 


To start working with grouped data you need to do

```{r}
sites_sel %>%
  group_by(WSA_ECO9)
```

So, that looks a little different that we were expecting.  What `group_by()` did was to create a special `dplyr` object.  We can see that in how this printed to the screen.  It also includes what groups we are using in this summary.  

Now to work with those groups using `summarise()`

```{r}
sites_sel %>%
  group_by(WSA_ECO9) %>%
  summarize(avg = mean(DEPTHMAX,na.rm = T),
            std_dev = sd(DEPTHMAX, na.rm = T),
            n = n())
```

Pretty cool!  

## Database functionality: joins
So far we have only worked with a single table.  That isn't always what you want to do and `dplyr` provides functions for doing most types of database joins as well as selections based on set operations.  We will be showing ony the basic join types here:  

- `left_join()`: Joins two data frames together based on a common ID. Keeps all observations from the first data frame (i.e. the one on the left) 
- `right_join()`: Same as `left_join()` except it keeps observations from the second data frame 
- `inner_join()`: Keeps only observations that are in both data frames. 
- `full_join()`: Keeps all observations.

Will let you explore the others on your own.

- `semi_join()` 
- `anti_join()` 
- `intersect()` 
- `union()` 
- `setdiff()`

We will need to add another dataset and filter our sites some to show how the different joins work.  

```{r}
wq <- read.csv("http://www.epa.gov/sites/production/files/2014-10/nla2007_chemical_conditionestimates_20091123.csv")
wq_sel<-wq %>%
  select(SITE_ID,VISIT_NO,CHLA,NTL,PTL,TURB)
head(wq_sel)
sites_sel <- sites_sel %>% 
  filter(SITE_TYPE == "PROB_Lake")
```

We can take a look at the dimension of each of these.

```{r}
dim(sites_sel)
dim(wq_sel)
```

Now lets join the water quality data to the site data and keep only those observations that are in the site data (i.e. only the PROB_Lakes).

```{r}
sites_wq <- left_join(sites_sel,wq_sel)
dim(sites_wq)
head(sites_wq)
```

Or if we go the other way

```{r}
wq_sites <- right_join(sites_sel,wq_sel)
dim(wq_sites)
head(wq_sites)
```

To get just those observations that are common to both

```{r}
#First manufacture some differences
wq_samp <- wq_sel %>%
  sample_frac(.75)
sites_samp <- sites_sel %>%
  sample_frac(.75)
dim(wq_samp)
dim(sites_samp)
#Then the inner_join
sites_wq_in <- inner_join(sites_samp,wq_samp)
dim(sites_wq_in)
head(sites_wq_in)
```

Lastly, lets join and keep it all

```{r}
sites_wq_all <- full_join(sites_sel, wq_sel)
dim(sites_wq_all)
head(sites_wq_all)
```

## Database functionality: external databases
The last `dplyr` functionality we will talk about is connecting to an existing database.  There are several databases that are supported: SQLite, PostgreSQL, MySQL, MariaDB, and Bigquery. We will just focus on SQLite as it is by far the easiest to setup and use.  

The main functions we will look at are:

- `src_sqlite()`: Creates a connection to a SQLite database (or creates a new one) 
- `tbl()`: Using a connection, allow connections to specific tables
- `collect()`: Forces computation on database and pulls results into a data frame (by default `dplyr` only pulls data from database into R when you ask)
- `translate_sql():` Handy little tool to see how R expressions can be written as SQL.
- `copy_to()`: Writes a data frame to a external database

Let's see how this works.

First we need to create the connection to the database.  

```{r}
#We need to load up the RSQLite package
library(RSQLite)
#Then connect
nla_sqlite <- src_sqlite("nla2007.sqlite3")
nla_sqlite
#List Tables
src_tbls(nla_sqlite)
```

Now we can access the tables 

```{r}
#Get it all
sites_sqlite <- tbl(nla_sqlite,"sites")
wq_sqlite <- tbl(nla_sqlite,"wq")

#Use some SQL
sites_qry <- tbl(nla_sqlite,sql("SELECT * FROM sites WHERE VISIT_NO == 1"))
sites_qry
```

And run our other `dplyr` functions on the connection.

```{r}
sites_sel_sqlite <- sites_sqlite %>% 
  select(SITE_ID,LAKENAME,VISIT_NO,SITE_TYPE,WSA_ECO9,AREA_HA,DEPTHMAX)
```

But notice the size compared to the original data frame.

```{r}
object.size(sites_sel)
object.size(sites_sel_sqlite)
```

Because of this, it allows us to work with databases that are larger than available memory as most of the data is stored on disk. Thus we can work out a fairly complicated analysis workflow and then pull back only those data we need into R (and thus into memory).

And see how to pull the table into a data frame

```{r}
sites_sel_collect <- sites_sel_sqlite %>%
  arrange(desc(AREA_HA))%>%
  collect()
```

Let's do some summaries of the data and then write back a new table to our data frame.

```{r}
#A Bootstrapped sample
ecor_depth_stats <- sites_sel_collect %>% 
  group_by(WSA_ECO9) %>%
  sample_n(1000,replace=T) %>%
  summarize(avg = mean(DEPTHMAX, na.rm=TRUE),
          sd = sd(DEPTHMAX, na.rm=TRUE),
          boot_n = n())

#And write back to the database
src_tbls(nla_sqlite)
copy_to(nla_sqlite,ecor_depth_stats)
src_tbls(nla_sqlite)
```

```{r echo=FALSE,messages=FALSE,warning=FALSE}
db_drop_table(nla_sqlite$con,table="ecor_depth_stats")
```

## Hands-on
The hands-on for this will be ad-hoc.

Make sure you can run the examples.  Once you have done that, try some different functions with different columns or summary stats.  A good place to start is with the joins.  Try your own joins and then your own grouping column (or columns!) and get a feel for using `summarize()`.