DescriptiveStatistics/SessionA_Solutions.rmd at main · DCS-training/DescriptiveStatistics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "Session A – What Are Summary Statistics? (SOLUTIONS)"
author: "<INSERT NAME>"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

In this session we will learn how to summarise data using measures of central tendency (mean, median, mode) and measures of spread (range, IQR, standard deviation). This helps reduce thousands of data points into a small set of meaningful numbers.

We will work with the palmerpenguins dataset — a fun and widely used dataset containing measurements from penguins in Antarctica.


```{r PackageInstall}
# install.packages("tidyverse")
```

You will notice in your console that the section above appeared in a 'grey' area. This is because this is an R Markdown document. This is represented by the file extension .Rmd. Meanwhile there are also R Scripts (represented by .R extensions). Typically when writing code for publishable purposes or for software we have a series of R Scripts, however R markdown files are becomming more flavoursome when examining unqiue datasets since a neater overview can be given, it also allows for a neat PDF (or other document type) to be an output showing text, code and code output.

Having installed tidyverse earlier, we must still load in the package for our system to use it. We do that with the 'library' command - this is a built in R function. Let's do that now below, noting we do not need to use the quotes to load it in:

```{r PackageLoading}
library(tidyverse)
```

You may have additionally noticed that to create a chunk of code in an R Markdown document we must use two sets of 3 `s with a set of curly braces following the first set of 3 with 'r' written in it. This is telling the interpreter that this is a block of R code.

### Exercise 1
Now that you know the basics of how to install and load packages, have a go at writing a code block below to install and load the package called “palmerpenguins”. This will be the dataset we will work with for the rest of this tutorial.

Your Answer:
```{r Penguins Loading}
# install.packages("palmerpenguins")
library(palmerpenguins)
```

-------------


```{r PenguinsOverview}
str(penguins)
head(penguins)
tail(penguins)
summary(penguins)
```

### Exercise 2
Use the information from the code above, to provide an overview of the penguins dataset. In particular, discuss how many observations and variables there are, as well as giving their data types.

Your Answer:
There are 344 observations and 8 variables in the penguins dataset. The variables include both numeric types (like bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and categorical types (like species, island, sex). Some variables may also contain missing values, as indicated in the summary output.

-------------

Using the penguins data, calculate the mean, median and mode of the variable flipper_length_mm.

```{r PenguinsStats}
PenguinFlipperSummary <- penguins %>%
    summarise(penguin_flipper_length_mean = mean(flipper_length_mm, na.rm = TRUE), penguin_flipper_length_sd = sd(flipper_length_mm, na.rm = TRUE))
PenguinFlipperSummary
```

Did you notice some weird syntax above? Let's address that. In R we use '<-' to assign a value to a variable. Above we have also started to use the tidyverse. '%>%' is known as a pipe and allows us to more intuatively access the data we have by 'piping into it' and then applying some functions.

### Exercise 3
Analogously to above, provide a summary for the bill length variable in the penguins data set.

Your Answer:
```{r PenguinsStats2}
PenguinBillLengthSummary <- penguins %>%
    summarise(penguin_bill_length_mean = mean(bill_length_mm, na.rm = TRUE), penguin_bill_length_sd = sd(bill_length_mm, na.rm = TRUE))
PenguinBillLengthSummary
```

### Exercise 4
Compare different measures of spread for the flipper_length_mm variable.
Calculate the range, IQR, and standard deviation.

Your Answer:
```{r PenguinsSpread}
PenguinFlipperSpread <- penguins %>%
    summarise(penguin_flipper_length_range = range(flipper_length_mm, na.rm = TRUE),
              penguin_flipper_length_IQR = IQR(flipper_length_mm, na.rm = TRUE),
              penguin_flipper_length_sd = sd(flipper_length_mm, na.rm = TRUE))
PenguinFlipperSpread
```
Here we calculated the range, IQR, and standard deviation for the flipper_length_mm variable. The range gives us the minimum and maximum values, the IQR provides the spread of the middle 50% of the data, and the standard deviation indicates how much the values deviate from the mean on average.

### Exercise 5
Choose any numeric variable from the penguins dataset and calculate:
- Mean
- Median
- Mode
- Range
- IQR
- Standard deviation