-
-
Notifications
You must be signed in to change notification settings - Fork 17
Data Analysis With R
This document is designed to provide essential resources and tutorials to help you become proficient in using R for data analysis. Whether you're just starting your journey or looking to enhance your skills, this guide offers a curated list of resources that are both practical and insightful, tailored to the needs of data scientists working with Hack For LA.
R is a powerful, open-source programming language and software environment specifically designed for statistical computing and graphics. It is widely used among statisticians, data analysts, and data scientists for developing statistical software and performing data analysis. One of the key strengths of R is its extensive library of packages, which provide a wide range of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more. R's syntax is user-friendly and highly expressive, making it an excellent tool for both beginners and experienced users. Additionally, R's active community continually contributes to its development, ensuring that it remains at the cutting edge of data science and statistical analysis.
R offers significant coding convenience, including vectorized operations and the ability to read and write data in many file formats, as well as call other command line programs. It is efficient, with parallel support for multicore processors, GPUs, and MPI. The flexibility of R is evident in its customizable software and object-oriented design.
Installing R and RStudio is a crucial first step for any new data scientist looking to leverage the power of R for data analysis. R, an open-source statistical programming language, provides a robust environment for data manipulation, statistical computing, and graphical representation. To maximize its potential, RStudio is recommended as an integrated development environment (IDE) that enhances the user experience with features like syntax highlighting, direct code execution, and a comprehensive workspace management system. By installing RStudio alongside R, users benefit from an organized and efficient setup that streamlines coding, debugging, and visualization tasks, making it easier to focus on data-driven insights and project outcomes.
Installing R and RStudio is a crucial first step for any new data scientist looking to leverage the power of R for data analysis. R, an open-source statistical programming language, provides a robust environment for data manipulation, statistical computing, and graphical representation. To maximize its potential, RStudio is recommended as an integrated development environment (IDE) that enhances the user experience with features like syntax highlighting, direct code execution, and a comprehensive workspace management system. By installing RStudio alongside R, users benefit from an organized and efficient setup that streamlines coding, debugging, and visualization tasks, making it easier to focus on data-driven insights and project outcomes.
1. Download R:
Go to The Comprehensive R Archive Network website(https://cran.r-project.org).
Download the R version that suits your operating system version (Windows, macOS, or Linux)
2. Install R on Windows:
Click on the "base" link and download the installer. Run the downloaded installer and follow the on-screen instructions to complete the installation.
3. Install R on macOS:
Download the .pkg file for the latest R version.
Open the downloaded file and follow the installation instructions.
1. Download RStudio:
Go to the RStudio Download website. (https://posit.co/download/rstudio-desktop)
Click on the "Download RStudio Desktop" button under Install RStudio or download RStudio file for your Operating System:
2. Install RStudio on Windows:
Download the installer and run it.
Follow the on-screen instructions to complete the installation.
3. Install RStudio on macOS:
Download the .dmg file and open it.
Drag the RStudio icon to the Applications folder.
Here is a simple video about installing R and RStudio on Windows:
https://youtu.be/YrEe2TLr3MI?si=LRXDA0G6FquejNdC
Here's a step-by-step guide to help you get started:
1. Open RStudio:
Launch RStudio from your applications or start menu.
2. Create a New Script:
Go to File New File R Script or use the shortcut Ctrl+Shift+N (Windows/Linux) or Cmd+Shift+N (macOS).
3. Write Your Script:
Enter the following basic R code into the script editor:
# My First R Script
# Print a message to the console
print("Hello, world!")
# Create a numeric variable
x = 10
# Perform a simple arithmetic operation
y = x * 2
# Print the result
print(y)
# Create a vector
numbers <- c(1, 2, 3, 4, 5)
# Calculate the mean of the vector
mean_value <- mean(numbers)
# Print the mean value
print(mean_value)
4. Save Your Script:
Save your script by going to File Save or using the shortcut Ctrl+S (Windows/Linux) or Cmd+S (macOS).
Choose a location on your computer and name your script (e.g., first_script.R).
1. Run the Entire Script:
To run the entire script, you can either click the Source button in the top-right corner of the script editor or use the shortcut Ctrl+Shift+Enter (Windows/Linux) or Cmd+Shift+Enter (macOS).
The script will execute, and you will see the output in the console at the bottom of RStudio.
2. Run Selected Lines:
To run specific lines of code, highlight the lines you want to execute and press Ctrl+Enter (Windows/Linux) or Cmd+Enter (macOS).
The selected lines will execute, and the output will appear in the console.
3. Viewing the Output
The output of your script will be displayed in the console. You should see the printed messages and results from your script, such as:
[1] "Hello, world!"
[1] 20
[1] 3
By following these steps, you can write, save, and run your first R script in RStudio. This process allows you to automate repetitive tasks, analyze data, and generate reports efficiently. As you become more familiar with R, you'll be able to write more complex scripts to tackle various data analysis challenges.
Here is a simple video showing how to use RStudio:
https://youtu.be/FIrsOBy5k58?si=R7O3i1gI07X-0zWx
Tidyverse is a collection of R packages designed for data science, sharing an underlying design philosophy, grammar, and data structures that make working with data easier. Volunteers can analyze Hack for LA data using Tidyverse, which offers a powerful and cohesive set of tools for efficient data cleaning, transformation, and visualization. Tidyverse's consistent syntax and integrated workflows streamline the entire data analysis process, from importing data with readr to creating insightful visualizations with ggplot2. Productivity is further enhanced by the functional programming capabilities of purrr and the string and factor management provided by stringr and forcats. These features enable volunteers to effectively analyze and present data on critical issues such as homelessness, expungement, and food insecurity, ultimately supporting informed decision-making and impactful community interventions.
Here's an overview of the core packages in the Tidyverse and their primary functions:
-
ggplot2: Used for data visualization, it implements the grammar of graphics, providing a powerful and flexible system for creating a wide range of visualizations.
-
dplyr: Provides a set of functions for data manipulation, including filtering rows, selecting columns, rearranging rows, and summarizing data.
-
tidyr: Helps tidy data, ensuring that data sets are consistent and easy to work with by transforming them into a tidy format where each variable is a column, each observation is a row, and each type of observational unit is a table.
-
readr: Facilitates the reading of rectangular data, such as CSV files, into R. It is designed to be fast and to handle a wide range of data formats.
-
purrr: Enhances R's functional programming tools, making it easier to apply functions to data and work with lists.
-
tibble: Provides a modern take on data frames, offering a data structure that is simpler and more user-friendly than base R data frames.
-
stringr: Simplifies string manipulation by providing a consistent set of functions designed to make working with strings easier and more intuitive.
-
forcats: Aims to make working with categorical data (factors) easier, providing a suite of tools for creating, modifying, and analyzing factors.
To get started with Tidyverse, you can install it in R using the following command:
install.packages("tidyverse")
Once installed, you can load the Tidyverse packages with:
library(tidyverse)
This command will load ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, along with any other packages they depend on.
CRAN, which stands for the Comprehensive R Archive Network, is a repository for R, a programming language and environment for statistical computing and graphics. It is one of the main resources for R users to find and download R packages, which are collections of R functions, data, and compiled code that extend the functionality of the base R environment.
-
Package Repository: CRAN hosts thousands of R packages covering a wide range of topics, from data manipulation and visualization to machine learning and bioinformatics. These packages are contributed by the R community and are regularly updated.
-
Mirrors: CRAN is mirrored across the globe, meaning there are multiple servers around the world that host copies of CRAN to ensure fast access and reliability for users worldwide.
-
Documentation: Each package on CRAN comes with extensive documentation, including manuals, vignettes, and examples that help users understand how to use the package effectively.
-
CRAN Task Views: These are curated lists of packages grouped by topic, providing an easy way to find relevant packages for specific tasks like Bayesian analysis, econometrics, or machine learning.
-
Version Control: CRAN maintains different versions of R packages, allowing users to install a specific version if needed.
-
Quality Control: Packages on CRAN are subjected to rigorous checks and must pass several automated tests before they are accepted. This ensures that packages are reliable and compatible with the R environment.
- Installing Packages:
R users can install packages from CRAN using the install.packages()
function in R. For example, install.packages("ggplot2")
will install the ggplot2 package from CRAN.
- Browsing Packages:
Users can browse available packages on the CRAN website, where they can search for packages by name or topic.
- Updating Packages:
Installed packages can be updated to their latest versions using the update.packages()
function.
CRAN plays a central role in the R ecosystem, providing a robust and reliable platform for the distribution of R packages, which is crucial for the development of statistical methods and data analysis workflows.
Guide to R packages installation: https://www.datacamp.com/tutorial/r-packages-guide
- One of the most common way to store data is saving it as files.
- Files can be of many formats like plain text, csv, excel spreadsheet, RData etc.
- It helps in transferring the data from one computer system to other.
- Useful in loading data in R environment and performing necessary analysis on it.
Reading data from files
- Setup steps
- Download file cars.txt https://drive.google.com/file/d/1vFqrSz4v0StwmqIMdFZhFqO3sR7T78BD/view?usp=share_link
- Move it to folder Rtutorial
- Open RStudio, set the working directory to Rtutorial
setwd("your/path/to/Rtutorial")
- In Console area, type the code as shown in the image below:
cars = read.table(file = "cars.txt", sep = "\t", header = TRUE,stringsAsFactors = FALSE)
file: filename to read, e.g. cars.txt
header: logical (TRUE/ FALSE). TRUE means the first line contains the header or name of the variables, e.g. ‘mgp’,‘cyl’, ‘disp’
sep: data separator, "\t" means data is separated by tab. It can be whitespace, comma, newline or carriage returns.
To see the other parameters, type ?read.table() for further details.
cars <- read.table("cars.txt", header = T, sep = "\t")
nrow(cars) # get the number of rows
ncol(cars) # get the number of columns
head(cars) # preview the data
read.table()
documentation: https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table
cars[1, ] # extract the first row
cars[1:3, ] # extract the first 3 rows
cars[,"Transmission"] # extract the column named Cylinders
cars[ ,1] # extract the first column
cars[2,3] # extract the element at row 2 and column 3
cars[2:6, 2:4] # extract the elements at that range
write.table(x, file, row.names, col.names, quote, sep )
x: data you want to write on a file
file: name of the file on which you want to write
row.names / col.names: can be logical (TRUE/FALSE) or you can give values
quote: logical (TRUE/FALSE) whether you want to add quotes on your data frame
sep: how do you want to separate the data. Can be space or "\t" (tab)
write.table(x = cars,file = "modified_cars.txt",row.names = FALSE,sep= "\t")
A csv file can be read in the following two ways:
-
data <- read.table("filename.csv", header = T, sep = ",")
Note: we only changed name of the file from .txt to .csv and sep is changed to ",".
-
data <- read.csv("filename.csv", header = T)
All the read.table()
parameters are also applicable for read.csv()
.
Data can be saved as a csv file in the following two ways:
-
write.table(x = data, file = "filename.csv", sep = ",")
-
write.csv(x = data, file = "filename.csv")
All the write.table()
parameters are also applicable for write.csv()
.
R provides its own format for saving or loading the data or R objects.
The two formats are RData or RDS.
These formats are useful in compressing the data and uses less space on computer disk.
You can save one or more r objects or variables as RData.
# save(data you want to save, file = "name of the file.RData")
head(cars)
x = 1:10
x
save(list = c("cars","x"),file = "cars.RDdat")
# or
save(cars,x,file = "cars.RDdat")
You can save only one r object or a variable in RDS format.
# saveRDS(data you want to save, file = "name of the file.RDS")
saveRDS(cars, file = "cars.RDS")
Note: You can use compress = FALSE
as a parameter in save function to not
compress the file.
# load(file = "filename to load.RData")
load(file = "cars.RDdata")
# readRDS( file = "filename toread.RDS")
newcar = readRDS( file = "cars.RDS")
You can assign a new name to the RDS data.
A package stores functions of a certain domain. For example, stats package contains functions in statistics.
Many functions are stored in standard packages which are delivered with R. But when you work with a particular domain (DNA microarray, for instance), usually you need to install outside packages.
Example: install a package named ggplot2
. ggplot2
is an R package for producing statistical, or data, graphics. Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar of Graphics, that allows you to compose graphs by combining independent components.
You can install ggplot2
R package by running:
install.packages("ggplot2")
# then following the download and install instructions.
Every time you run R/RStudio, it only loads standard packages. In order to use other installed packages, we need to load them.
library("ggplot2")
Setup steps: create a vector of random numbers:
# Load the required library
library(ggplot2)
# Generate two vectors 'x' and 'y' with 10 random numbers each, in the range [5, 6]
x <- runif(10, min = 5, max = 6)
y <- runif(10, min = 5, max = 6)
# Combine the vectors into a data frame for easier handling with ggplot2
data <- data.frame(x = x, y = y)
# Display the generated data
print(data)
# Create a scatter plot using ggplot2
ggplot(data, aes(x = x, y = y)) +
geom_point()
The plot will display in R's default plotting window. In RStudio, the plot will appear in the Plots pane (typically in the lower-right corner). You can export it as an image or PDF using the "Export" button.
For further information about ggplot2
R package, see: https://ggplot2-book.org
We use conditional statements when we want to execute some commands only when certain conditions are met.
Use if-statement to make a decision and execute different parts of the program based on the condition.
if (condition) {
# execute this part if conditionholds true
}
All the commands you want to execute should be written here within curly braces
Example:
n = 10
if ( n %% 2 == 0 ) {
print(paste(n, "is a even number"))
}
%% is the modulo operator and calculates the remainder. The line if(n %% 2==0) in essence means "Is the remainder of n divided by 2 equals to 0 ?", If n=10, then yes, it is. If n=11, or any odd number for that matter, then the remainder is not 0 and thus does not meet the condition.
ex: of conditions:
x <- 20
x > 10
x == 10
x %%10 ==2
Use conditions like above in if statement ex:
x <- 10
if ( x == 10 ) {
y = c(50, 80, 20)
print( y%/%x )
}
We saw examples where condition is true, suppose we also want to execute different commands when the condition is false.
For that we add an else statement.
if (condition) {
# execute this part if conditionholds true
}else {
# execute this part if conditionis false
}
Example:
n <- 101
if( n %% 2 == 0 ){
print(paste(n,"is a even number"))
}else {
print(paste(n,"is an odd number"))
}
if (condition_1) { # 1-1
if (condition_2) { # 2-1
# executes this part when both condition_1 and condition_2 are
# true
} # 2-2
else { #3-1
# executes this part when condition_1 is true but condition_2 is
# false
} # 3-2
} # 1-2
Curly braces are important here, they defines the scope of the statement for Example, curly braces 1-1 & 1-2 is the opening and closing of the first if statement which means, both if else are part of it and executed only when the first if is true.
Another way:
if (condition_1) { # 1-1
# condition_1
} # 1-2
else { # 2-1
if (condition_2) { # 3-1
# condition_2
} # 3-2
} # 2-2
You can directly use 'else if (condition)':
if (condition_1) {
# condition_1
} else if (condition_2) {
# condition_2
} else{
# condition_2
}
Example:
The letter grades of a class are evaluated based on the numeric grades:
Numeric grade | Letter grade |
---|---|
90-100 | A |
80-89 | B |
70-79 | C |
60-69 | D |
< 60 | F |
(Some of these issues may be closed or open/in progress.)
Xuye Luo