-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmore_dplyr.Rmd
340 lines (261 loc) · 10.8 KB
/
more_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
```{r setup, echo=FALSE, warning=FALSE, purl=FALSE, message=FALSE}
options(repos="http://cran.rstudio.com/")
pkgs <- c("dplyr","knitr","RSQLite")
x<-lapply(pkgs, library, character.only = TRUE)
opts_chunk$set(tidy=T)
```
# More fun with `dplyr`
These materials will go over some introductory materials (i.e. from our [Intro R Workshop](https://github.com/rhodyrstats/intro_r_workshop)) as well as expand into some of the other functionality. This presentation is heavily influenced by the [Introduction to `dplyr`](https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html), the [Two-table Verbs](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html), and [Databases](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html) vigenettes. In particular we will cover:
- [A word about pipes](#a-word-about-pipes)
- [Manipulating data](#manipulating-data)
- [Manipulating grouped data](#manipulating-grouped-data)
- [Database functionality: joins](#database-functionality-joins)
- [Database functionality: external databases](#database-functionality-external-databases)
## A word about pipes
There are many different ways to go about scripting an analyis in R (or any language for that matter). These were discussed in the [Intro to R Workshop](https://github.com/rhodyrstats/intro_r_workshop/blob/master/lessons/03_wrangling.md#using-dplyr)(scroll down a bit), but to review, they are: using intermediate steps/objects, nest functions, or use pipes. If you are developing new functions or packages it is probably best to not use pipes as it adds a dependency and de-bugging can be a bit of a challenge. If you are scripting data analysis, pipes (i.e. `%>%` from `magrittr`) are, in my opinion, the way to go. For this presentation we will use pipes for all the examples.
## Manipulating data
The `dplyr` package is first and foremost a package to help faciliatate data manipulation. What it does can certainly be done with base R or with other packages, but it can be argued that `dplyr` makes these tasks more undertandable through its use of a consistent interface. In particular, this is accomplished through the use of data manipulation verbs. These verbs are:
- `select()`: selects columns from a data frame
- `arrange()`: Arranges a data frame in ascending or descending order based on column(s).
- `filter()`: Select observations from a data frame based on values in column(s).
- `slice()`: Selects observations based on specific rows
- `rename()`: Rename columns in a dataframe
- `distinct()`: Get unique rows (OK, not a verb...)
- `sample_n()`: Randomly select 'n' number of rows
- `sample_frac()`: Randomly select a fraction of rows
- `mutate()`: Adds new columns to a data frame and keeps all other columns
- `transmutate()`: Adds new columns to a data frame and drops all other columns
- `summarise()`: Summarizes your data.
Before we move on, we need some data. Once again, I am going to rely on the 2007 National Lakes Assessment Data:
```{r}
sites <- read.csv("http://www.epa.gov/sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv")
```
Let's look at the columns for each of these
```{r}
names(sites)
```
Given the large number of fields in these, I may want to reduce just to what I am interested in.
```{r}
sites_sel<-sites %>%
select(SITE_ID,LAKENAME,VISIT_NO,SITE_TYPE,WSA_ECO9,AREA_HA,DEPTHMAX)
head(sites_sel)
```
We can arrange the data.
```{r}
#Ascending is default
sites_sel %>%
arrange(DEPTHMAX) %>%
head()
#Descending need desc()
sites_sel %>%
arrange(desc(DEPTHMAX)) %>%
head()
#By more than one column
sites_sel %>%
arrange(WSA_ECO9,desc(DEPTHMAX))%>%
head()
```
Let's filter out just some of the deeper lakes
```{r}
sites_sel %>%
filter(DEPTHMAX >= 50)
```
Or just the deep lakes in Northern Applachians Ecoregion
```{r}
sites_sel %>%
filter(WSA_ECO9 == "NAP", DEPTHMAX >= 50)
```
We can also grab observations by row
```{r}
sites_sel %>%
slice(c(1,2))
#or
sites_sel %>%
slice(seq(1,nrow(sites_sel),100))
```
Renaming columns is easy
```{r}
sites_sel %>%
rename(Ecoregion = WSA_ECO9, MaxDepth = DEPTHMAX)%>%
head()
```
We can identify distinct values and get those rows
```{r}
sites_sel %>%
distinct(WSA_ECO9)
#Returns the first row with the distinct value so order has an impact
sites_sel %>%
arrange(desc(DEPTHMAX))%>%
distinct(WSA_ECO9)
```
Sampling by number or fraction and with or without replacment is done like:
```{r}
set.seed(72)
#By Number
sites_sel %>%
sample_n(10)
#By Fraction
sites_sel %>%
sample_frac(0.01)
```
To create new columns
```{r}
#Add it to the other columns
sites_sel %>%
mutate(volume = ((10000*AREA_HA) * DEPTHMAX)/3)%>%
head()
#Create only the new column
sites_sel %>%
transmute(mean_depth = (((10000*AREA_HA) * DEPTHMAX)/3)/(AREA_HA*10000)) %>%
head()
```
Lastly, we can get summaries of our data
```{r}
sites_sel %>%
summarize(avg_depth = mean(DEPTHMAX,na.rm=T),
n = n()) %>%
head()
```
## Manipulating grouped data
If `dplyr` stopped there it would still be a useful package. It, of course, does not stop there. One of the more powerful things you can do with `dplyr` is to run grouped operations. This is especially useful in the context of summarizing your data. The functions we will be looking at are:
- `group_by()`: Function to create groups from column(s) in your data.
- `summarise()`: Saw this above, but really shines when summarizing across groups.
- `n()`: A function for use within the `summarize()` function
To start working with grouped data you need to do
```{r}
sites_sel %>%
group_by(WSA_ECO9)
```
So, that looks a little different that we were expecting. What `group_by()` did was to create a special `dplyr` object. We can see that in how this printed to the screen. It also includes what groups we are using in this summary.
Now to work with those groups using `summarise()`
```{r}
sites_sel %>%
group_by(WSA_ECO9) %>%
summarize(avg = mean(DEPTHMAX,na.rm = T),
std_dev = sd(DEPTHMAX, na.rm = T),
n = n())
```
Pretty cool!
## Database functionality: joins
So far we have only worked with a single table. That isn't always what you want to do and `dplyr` provides functions for doing most types of database joins as well as selections based on set operations. We will be showing ony the basic join types here:
- `left_join()`: Joins two data frames together based on a common ID. Keeps all observations from the first data frame (i.e. the one on the left)
- `right_join()`: Same as `left_join()` except it keeps observations from the second data frame
- `inner_join()`: Keeps only observations that are in both data frames.
- `full_join()`: Keeps all observations.
Will let you explore the others on your own.
- `semi_join()`
- `anti_join()`
- `intersect()`
- `union()`
- `setdiff()`
We will need to add another dataset and filter our sites some to show how the different joins work.
```{r}
wq <- read.csv("http://www.epa.gov/sites/production/files/2014-10/nla2007_chemical_conditionestimates_20091123.csv")
wq_sel<-wq %>%
select(SITE_ID,VISIT_NO,CHLA,NTL,PTL,TURB)
head(wq_sel)
sites_sel <- sites_sel %>%
filter(SITE_TYPE == "PROB_Lake")
```
We can take a look at the dimension of each of these.
```{r}
dim(sites_sel)
dim(wq_sel)
```
Now lets join the water quality data to the site data and keep only those observations that are in the site data (i.e. only the PROB_Lakes).
```{r}
sites_wq <- left_join(sites_sel,wq_sel)
dim(sites_wq)
head(sites_wq)
```
Or if we go the other way
```{r}
wq_sites <- right_join(sites_sel,wq_sel)
dim(wq_sites)
head(wq_sites)
```
To get just those observations that are common to both
```{r}
#First manufacture some differences
wq_samp <- wq_sel %>%
sample_frac(.75)
sites_samp <- sites_sel %>%
sample_frac(.75)
dim(wq_samp)
dim(sites_samp)
#Then the inner_join
sites_wq_in <- inner_join(sites_samp,wq_samp)
dim(sites_wq_in)
head(sites_wq_in)
```
Lastly, lets join and keep it all
```{r}
sites_wq_all <- full_join(sites_sel, wq_sel)
dim(sites_wq_all)
head(sites_wq_all)
```
## Database functionality: external databases
The last `dplyr` functionality we will talk about is connecting to an existing database. There are several databases that are supported: SQLite, PostgreSQL, MySQL, MariaDB, and Bigquery. We will just focus on SQLite as it is by far the easiest to setup and use.
The main functions we will look at are:
- `src_sqlite()`: Creates a connection to a SQLite database (or creates a new one)
- `tbl()`: Using a connection, allow connections to specific tables
- `collect()`: Forces computation on database and pulls results into a data frame (by default `dplyr` only pulls data from database into R when you ask)
- `translate_sql():` Handy little tool to see how R expressions can be written as SQL.
- `copy_to()`: Writes a data frame to a external database
Let's see how this works.
First we need to create the connection to the database.
```{r}
#We need to load up the RSQLite package
library(RSQLite)
#Then connect
nla_sqlite <- src_sqlite("nla2007.sqlite3")
nla_sqlite
#List Tables
src_tbls(nla_sqlite)
```
Now we can access the tables
```{r}
#Get it all
sites_sqlite <- tbl(nla_sqlite,"sites")
wq_sqlite <- tbl(nla_sqlite,"wq")
#Use some SQL
sites_qry <- tbl(nla_sqlite,sql("SELECT * FROM sites WHERE VISIT_NO == 1"))
sites_qry
```
And run our other `dplyr` functions on the connection.
```{r}
sites_sel_sqlite <- sites_sqlite %>%
select(SITE_ID,LAKENAME,VISIT_NO,SITE_TYPE,WSA_ECO9,AREA_HA,DEPTHMAX)
```
But notice the size compared to the original data frame.
```{r}
object.size(sites_sel)
object.size(sites_sel_sqlite)
```
Because of this, it allows us to work with databases that are larger than available memory as most of the data is stored on disk. Thus we can work out a fairly complicated analysis workflow and then pull back only those data we need into R (and thus into memory).
And see how to pull the table into a data frame
```{r}
sites_sel_collect <- sites_sel_sqlite %>%
arrange(desc(AREA_HA))%>%
collect()
```
Let's do some summaries of the data and then write back a new table to our data frame.
```{r}
#A Bootstrapped sample
ecor_depth_stats <- sites_sel_collect %>%
group_by(WSA_ECO9) %>%
sample_n(1000,replace=T) %>%
summarize(avg = mean(DEPTHMAX, na.rm=TRUE),
sd = sd(DEPTHMAX, na.rm=TRUE),
boot_n = n())
#And write back to the database
src_tbls(nla_sqlite)
copy_to(nla_sqlite,ecor_depth_stats)
src_tbls(nla_sqlite)
```
```{r echo=FALSE,messages=FALSE,warning=FALSE}
db_drop_table(nla_sqlite$con,table="ecor_depth_stats")
```
## Hands-on
The hands-on for this will be ad-hoc.
Make sure you can run the examples. Once you have done that, try some different functions with different columns or summary stats. A good place to start is with the joins. Try your own joins and then your own grouping column (or columns!) and get a feel for using `summarize()`.