@laneharrison
2/11/2017
Welcome to the R section of your "bootcamp"!
R is a programming language geared towards statistical computing.
What makes R exciting is its community. Research groups and statisticians regularly publish R packages that allow you to reproduce and extend their analyses. R is an excellent choice for exploratory and confirmatory data analysis.
In this class we have three weeks together. And while there is some flexibility in what we cover, most of our exercises will focus on a small set of learning goals. Specifically, upon successful completion of this section, you should be able to:
- Use R for routine analytical tasks: invoking stastical functions, loading data, producing documents / figures, installing packages, etc.
- Visualize any given dataset in multiple ways, through a rudimentary understanding of the "Grammar of Graphics", and
ggplot2. - Manipulate, arrange, and mutat datasets you encounter using
dplyr.
This will provide you with a foundation for using R in your analysis workflow.
The book for this section is R for Data Science (R4D) by Hadley Wickham.
Each week's learning activities will center around an assigned reading and exercises from R4D.
R4D is up-to-date, good, and free. I recommend exploring this book well beyond the assigned exercises in this section.
To assess your progress, each week we'll assign a set of exercises from the R for Data Science book.
Guidelines: Some questions will require you to type out your reasoning or predictions. For example, question 5 in exercises 3.6.1 asks, "Will these two graphs look different? Why/why not?". For these types of questions, provide text along with any code you used to help answer them.
To turn these in, simply output an R Markdown document to PDF and upload both the PDF and your Rmd to Canvas.
I'll introduce each topic, but most in-class time will be dedicated to completing the assigned exercises and getting help where you need it.
In our first week, we'll become acquainted with R, R Studio, and learn to plot data we encounter with ggplot2.
- How WPI VIEW uses R + R Markdown for data visualization research.
- How your tools shape to you.
- Deliberate Practice (vs. Learning in The Trenches (vs. Shortcuts in the Trenches))
- Open Science: Challenges and Benefits
R for Data Science, Chapters 26-27.3, 1-3
-
27.2.1 (10m)
-
3.2.4 (10m)
-
3.3.1 (15m)
-
3.5.1 (25m)
This week we'll tackle data wrangling with dplyr, which will probably take up more of your time than actual statistics, modeling, and visualization!.
- Why data wrangling is so pervasive.
- Tidy data concepts
dplyr, the tidy data tool
R for Data Science, Chapters 5. (If you want to really dive in, do 9-12, too.)
- 5.5.2 question 1 (10m)
- 5.6.7 (30m)
- 5.7.1 (30m)
This week we're covering the last loop: modeling. You now know a bit about visualization, and a bit about transformation and wrangling.
Modeling is about capturing the essence of your data through tiny bits of math. Expect your modeling life to be iterative, with transformation and modeling definitely in the loop.
Produce an original document with the following features:
- Use data from your own research or your labs research.
- (You may use data online, but only as a last resort.)
- (The Data) Describe the data in its original form. For example: How large is it? How many columns? How many rows? What types of columns? Is there missing data?
- (EDA) Conduct an exploratory analysis of the dataset. This should be your longest section. Include text, plots, tables, statistics, and possibly statistical tests.
- (Hypothesis) A brief section on some hypothesis you've generated about the data. Maybe you found something interesting in EDA that you would like to write up further, or you want to run statistical tests + charts to prove something. Sometimes you'll need to bring in additional data. This should be a result of your EDA.
- (Transformation) A brief section on the transformations you're making to the data to explore your hypothesis.
- (Results/Summary) Write up your results in a way that a newcomer can immediately grasp what was done, without any of the other sections as background. The visualizations you generate here should be well-designed and captioned such that they can exist on their own.