This repository has been archived by the owner on Sep 18, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 370
/
Copy pathblock006_care-feeding-data.html
458 lines (410 loc) · 24.7 KB
/
block006_care-feeding-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>Basic care and feeding of data in R</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68219208-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
href="libs/highlight/default.css"
type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<link rel="stylesheet" href="libs/local/main.css" type="text/css" />
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
<link rel="stylesheet" href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script src="libs/navigation-1.1/tabsets.js"></script>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">Basic care and feeding of data in R</h1>
</div>
<div id="TOC">
<ul>
<li><a href="#buckle-your-seatbelt">Buckle your seatbelt</a></li>
<li><a href="#data-frames-are-awesome">Data frames are awesome</a></li>
<li><a href="#get-the-gapminder-data">Get the Gapminder data</a></li>
<li><a href="#meet-the-gapminder-data-frame-or-tibble">Meet the <code>gapminder</code> data frame or “tibble”</a></li>
<li><a href="#look-at-the-variables-inside-a-data-frame">Look at the variables inside a data frame</a></li>
<li><a href="#recap">Recap</a></li>
</ul>
</div>
<div id="buckle-your-seatbelt" class="section level3">
<h3>Buckle your seatbelt</h3>
<p><em>Ignore if you don’t need this bit of support.</em></p>
<p>Now is the time to make sure you are working in an appropriate directory on your computer, probably through the use of an <a href="block002_hello-r-workspace-wd-project.html">RStudio Project</a>. Enter <code>getwd()</code> in the Console to see current working directory or, in RStudio, this is displayed in the bar at the top of Console.</p>
<p>You should clean out your workspace. In RStudio, click on the “Clear” broom icon from the Environment tab or use <em>Session > Clear Workspace</em>. You can also enter <code>rm(list = ls())</code> in the Console to accomplish same.</p>
<p>Now restart R. This will ensure you don’t have any packages loaded from previous calls to <code>library()</code>. In RStudio, use <em>Session > Restart R</em>. Otherwise, quit R with <code>q()</code> and re-launch it.</p>
<p>Why do we do this? So that the code you write is complete and re-runnable. If you return to a clean slate often, you will root out hidden dependencies where one snippet of code only works because it relies on objects created by code saved elsewhere or, much worse, never saved at all. Similarly, an aggressive clean slate approach will expose any usage of packages that have not been explicitly loaded.</p>
<p>Finally, open a new R script and develop and run your code from there. In RStudio, use <em>File > New File > R Script</em>. Save this script with a name ending in <code>.r</code> or <code>.R</code>, containing no spaces or other funny stuff, and that evokes whatever it is we’re doing today. Example: <code>cm004_data-care-feeding.r</code>.</p>
<p>Another great idea is to do this in an R Markdown document. See <a href="block007_first-use-rmarkdown.html">Test drive R Markdown</a> for a refresher.</p>
</div>
<div id="data-frames-are-awesome" class="section level3">
<h3>Data frames are awesome</h3>
<p>Whenever you have rectangular, spreadsheet-y data, your default data receptacle in R is a data frame. Do not depart from this without good reason. Data frames are awesome because…</p>
<ul>
<li>Data frames package related variables neatly together,
<ul>
<li>keeping them in sync vis-a-vis row order</li>
<li>applying any filtering of observations uniformly.</li>
</ul></li>
<li>Most functions for inference, modelling, and graphing are happy to be passed a data frame via a <code>data =</code> argument. This has been true in base R for a long time.</li>
<li>The set of packages known as the <a href="https://github.com/hadley/tidyverse"><code>tidyverse</code></a> takes this one step further and explicitly prioritizes the processing of data frames. This includes popular packages like <code>dplyr</code> and <code>ggplot2</code>. In fact the tidyverse prioritizes a special flavor of data frame, called a “tibble.”</li>
</ul>
<p>Data frames – unlike general arrays or, specifically, matrices in R – can hold variables of different flavors, such as character data (subject ID or name), quantitative data (white blood cell count), and categorical information (treated vs. untreated). If you use homogenous structures, like matrices, for data analysis, you are likely to make the terrible mistake of spreading a dataset out over multiple, unlinked objects. Why? Because you can’t put character data, such as subject name, into the numeric matrix that holds white blood cell count. This fragmentation is a Bad Idea.</p>
</div>
<div id="get-the-gapminder-data" class="section level3">
<h3>Get the Gapminder data</h3>
<p>We will work with some of the data from the <a href="http://www.gapminder.org">Gapminder project</a>. I’ve released this as an R package, so we can install it from CRAN like so:</p>
<pre class="r"><code>install.packages("gapminder")</code></pre>
<p>Now load the package:</p>
<pre class="r"><code>library(gapminder)</code></pre>
</div>
<div id="meet-the-gapminder-data-frame-or-tibble" class="section level3">
<h3>Meet the <code>gapminder</code> data frame or “tibble”</h3>
<p>By loading the <code>gapminder</code> package, we now have access to a data frame by the same name. Get an overview of this with <code>str()</code>, which displays the structure of an object.</p>
<pre class="r"><code>str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...</code></pre>
<p><code>str()</code> will provide a sensible description of almost anything and, worst case, nothing bad can actually happen. When in doubt, just <code>str()</code> some of the recently created objects to get some ideas about what to do next.</p>
<p>We could print the <code>gapminder</code> object itself to screen. However, if you’ve used R before, you might be reluctant to do this, because large datasets just fill up your Console and provide very little insight.</p>
<p>This is the first big win for <strong>tibbles</strong>. The <a href="https://github.com/hadley/tidyverse"><code>tidyverse</code></a> offers a special case of R’s default data frame: the “tibble”, which is a nod to the actual class of these objects, <code>tbl_df</code>.</p>
<p>If you have not already done so, install the <code>tidyverse</code> meta-package now:</p>
<pre class="r"><code>install.packages("tidyverse")</code></pre>
<p>Now load it:</p>
<pre class="r"><code>library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats</code></pre>
<p>Now we can boldly print <code>gapminder</code> to screen! It is a tibble (and also a regular data frame) and the <code>tidyverse</code> provides a nice print method that shows the most important stuff and doesn’t fill up your Console.</p>
<pre class="r"><code>## see? it's still a regular data frame, but also a tibble
class(gapminder)
## [1] "tbl_df" "tbl" "data.frame"
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows</code></pre>
<p>If you are dealing with plain vanilla data frames, you can rein in data frame printing explicitly with <code>head()</code> and <code>tail()</code>. Or turn it into a tibble with <code>as_tibble()</code>!</p>
<pre class="r"><code>head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
tail(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Zimbabwe Africa 1982 60.363 7636524 788.8550
## 2 Zimbabwe Africa 1987 62.351 9216418 706.1573
## 3 Zimbabwe Africa 1992 60.377 10704340 693.4208
## 4 Zimbabwe Africa 1997 46.809 11404948 792.4500
## 5 Zimbabwe Africa 2002 39.989 11926563 672.0386
## 6 Zimbabwe Africa 2007 43.487 12311143 469.7093
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows</code></pre>
<p>More ways to query basic info on a data frame:</p>
<pre class="r"><code>names(gapminder)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
ncol(gapminder)
## [1] 6
length(gapminder)
## [1] 6
dim(gapminder)
## [1] 1704 6
nrow(gapminder)
## [1] 1704</code></pre>
<p>A statistical overview can be obtained with <code>summary()</code></p>
<pre class="r"><code>summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
## </code></pre>
<p>Although we haven’t begun our formal coverage of visualization yet, it’s so important for smell-testing dataset that we will make a few figures anyway. Here we use only base R graphics, which are very basic.</p>
<pre class="r"><code>plot(lifeExp ~ year, gapminder)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/first-plots-base-R-1.png" /><!-- --></p>
<pre class="r"><code>plot(lifeExp ~ gdpPercap, gapminder)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/first-plots-base-R-2.png" /><!-- --></p>
<pre class="r"><code>plot(lifeExp ~ log(gdpPercap), gapminder)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/first-plots-base-R-3.png" /><!-- --></p>
<!-- This is a non-sequitur here ... where came from originally?
> Sidebar on equals: A single equal sign `=` is most commonly used to specify values of arguments when calling functions in R, e.g. `group = continent`. It can be used for assignment but we advise against that, in favor of `<-`. A double equal sign `==` is a binary comparison operator, akin to less than `<` or greater than `>`, returning the logical value `TRUE` in the case of equality and `FALSE` otherwise. Although you may not yet understand exactly why, `subset = country == "Colombia"` restricts operation -- scatterplotting, in above examples -- to observations where the country is Colombia.
-->
<p>Let’s go back to the result of <code>str()</code> to talk about what a data frame is.</p>
<pre class="r"><code>str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...</code></pre>
<p>A data frame is a special case of a <em>list</em>, which is used in R to hold just about anything. Data frames are a special case where the length of each list component is the same. Data frames are superior to matrices in R because they can hold vectors of different flavors, e.g. numeric, character, and categorical data can be stored together. This comes up alot!</p>
</div>
<div id="look-at-the-variables-inside-a-data-frame" class="section level3">
<h3>Look at the variables inside a data frame</h3>
<p>To specify a single variable from a data frame, use the dollar sign <code>$</code>. Let’s explore the numeric variable for life expectancy.</p>
<pre class="r"><code>head(gapminder$lifeExp)
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
summary(gapminder$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 48.20 60.71 59.47 70.85 82.60
hist(gapminder$lifeExp)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/histogram-lifeExp-1.png" /><!-- --></p>
<p>The year variable is an integer variable, but since there are so few unique values it also functions a bit like a categorical variable.</p>
<pre class="r"><code>summary(gapminder$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1952 1966 1980 1980 1993 2007
table(gapminder$year)
##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## 142 142 142 142 142 142 142 142 142 142 142 142</code></pre>
<p>The variables for country and continent hold truly categorical information, which is stored as a <em>factor</em> in R.</p>
<pre class="r"><code>class(gapminder$continent)
## [1] "factor"
summary(gapminder$continent)
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5</code></pre>
<p>The <strong>levels</strong> of the factor <code>continent</code> are “Africa”, “Americas”, etc. and this is what’s usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like “male/female” and “control/treated”. But <em>never ever ever</em> forget that, under the hood, R is really storing integer codes 1, 2, 3, etc. Look at the result from <code>str(gapminder$continent)</code> if you are skeptical.</p>
<pre class="r"><code>str(gapminder$continent)
## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...</code></pre>
<p>This <a href="http://en.wikipedia.org/wiki/Janus">Janus</a>-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you resolve to learn how to <a href="block014_factors.html">properly care and feed for factors</a>. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit.</p>
<p>Here we count how many observations are associated with each continent and, as usual, try to portray that info visually. This makes it much easier to quickly see that African countries are well represented in this dataset.</p>
<pre class="r"><code>table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
barplot(table(gapminder$continent))</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/tabulate-continent-1.png" /><!-- --></p>
<p>In the figures below, we see how factors can be put to work in figures. The <code>continent</code> factor is easily mapped into “facets” or colors and a legend by the <code>ggplot2</code> package. <em>Making figures with <code>ggplot2</code> is covered elsewhere so feel free to just sit back and enjoy these plots or blindly copy/paste.</em></p>
<pre class="r"><code>## we exploit the fact that ggplot2 was installed and loaded via the tidyverse
p <- ggplot(filter(gapminder, continent != "Oceania"),
aes(x = gdpPercap, y = lifeExp)) # just initializes
p <- p + scale_x_log10() # log the x axis the right way
p + geom_point() # scatterplot
p + geom_point(aes(color = continent)) # map continent to color
p + geom_point(alpha = (1/3), size = 3) + geom_smooth(lwd = 3, se = FALSE)
p + geom_point(alpha = (1/3), size = 3) + facet_wrap(~ continent) +
geom_smooth(lwd = 1.5, se = FALSE)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots-1.png" width="49%" /><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots-2.png" width="49%" /><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots-3.png" width="49%" /><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots-4.png" width="49%" /></p>
<!---
EXERCISES I have used and let languish.
Let's get the data for just 2007.
How many rows?
How many observations per continent?
Scatterplot life expectancy against GDP per capita.
Variants of that: indicate continent by color, do for just one continent, do for multiple continents at once but in separate plots
```r
hDat <- subset(gapminder, subset = year == 2007)
str(hDat)
## Classes 'tbl_df', 'tbl' and 'data.frame': 142 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ lifeExp : num 43.8 76.4 72.3 42.7 75.3 ...
## $ pop : int 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
## $ gdpPercap: num 975 5937 6223 4797 12779 ...
table(hDat$continent)
##
## Africa Americas Asia Europe Oceania
## 52 25 33 30 2
#xyplot(lifeExp ~ gdpPercap, hDat)
#xyplot(lifeExp ~ gdpPercap, hDat, group = continent, auto.key = TRUE)
#xyplot(lifeExp ~ gdpPercap | continent, hDat)
```
## if you want just some rows and/or just some variables, for inspection or to
## assign as a new object, use subset()
subset(gapminder, subset = country == "Cambodia")
subset(gapminder, subset = country %in% c("Japan", "Belgium"))
subset(gapminder, subset = year == 1952)
subset(gapminder, subset = country == "Uruguay", select = c(country, year, lifeExp))
plot(lifeExp ~ year, gapminder, subset = country == "Zimbabwe")
plot(lifeExp ~ log(gdpPercap), gapminder, subset = year == 2007)
## exercise:
## get data for which life expectancy is less than 32 years
## assign to an object
## how many rows? how many observations per continent?
--->
</div>
<div id="recap" class="section level3">
<h3>Recap</h3>
<p>Use data frames!!!</p>
<p>Use the <a href="https://github.com/hadley/tidyverse"><code>tidyverse</code></a>!!! This will provide a special type of data frame called a “tibble” that has nice default printing behavior, among other benefits.</p>
<p>When in doubt, <code>str()</code> something or print something.</p>
<p>Always understand the basic extent of your data frames: number of rows and columns.</p>
<p>Understand what flavor the variables are.</p>
<p>Use factors!!! But with intention and care.</p>
<p>Do basic statistical and visual sanity checking of each variable.</p>
<p>Refer to variables by name, e.g., <code>gapminder$lifeExp</code>, not by column number. Your code will be more robust and readable.</p>
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>