forked from uo-datasci-specialization/c4-ml-fall-2023
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlecture-1.Rmd
339 lines (232 loc) · 16.2 KB
/
lecture-1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: "Introduction to Toy Datasets"
author:
- name: Cengiz Zopluoglu
affiliation: University of Oregon
date: 06/23/2022
output:
distill::distill_article:
self_contained: true
toc: true
toc_float: true
theme: theme.css
---
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(position=c('top','right'))
```
```{=html}
<style>
.list-group-item.active, .list-group-item.active:focus, .list-group-item.active:hover {
z-index: 2;
color: #fff;
background-color: #FC4445;
border-color: #97CAEF;
}
</style>
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = "",fig.align='center')
require(here)
require(ggplot2)
require(plot3D)
require(kableExtra)
require(knitr)
require(giski)
require(magick)
options(scipen=99)
```
`r paste('[Updated:',format(Sys.time(),'%a, %b %d, %Y - %H:%M:%S'),']')`
There are two datasets we will use throughout this course. The first dataset has a continuous outcome and the second dataset has a binary outcome. We will apply several methods and algorithms to these two datasets during the course. We will have an opportunity to compare and contrast the prediction outcomes from several models and methods on the same datasets.
This section provides some background information and context for these two datasets.
# Readability
The readability dataset comes from a recent [Kaggle Competition (CommonLit Readability Prize)](https://www.kaggle.com/c/commonlitreadabilityprize/). You can directly download the training dataset from the competition website, or you can import it from the course website.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show'}
readability <- read.csv(here('data/readability.csv'),header=TRUE)
str(readability)
```
There is a total of 2834 observations. Each observation represents a reading passage. The most significant variables are the `excerpt` and `target` columns. The excerpt column includes plain text data, and the target column includes a corresponding measure of readability for each excerpt.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show'}
readability[1,]$excerpt
readability[1,]$target
```
[According to the data owner](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423), '*the target value is the result of a Bradley-Terry analysis of more than 111,000 pairwise comparisons between excerpts. Teachers spanning grades 3-12 served as the raters for these comparisons.*' A lower target value indicates a more challenging text to read. The highest target score is equivalent to the 3rd-grade level, while the lowest target score is equivalent to the 12th-grade level. The purpose is to develop a model that predicts a readability score for a given text to identify an appropriate reading level.
In the following weeks, we will talk a little bit about the pre-trained language models (e.g., [RoBerta](https://arxiv.org/abs/1907.11692)). Our coverage of this material will be at the surface level. We will primarily cover how we obtain numerical vector representations (sentence embeddings) for given text input from a pre-trained language model using Python through R. Then, we will use the sentence embeddings as features to predict the target score in this dataset using various modeling frameworks.
# Recidivism
The Recidivism dataset comes from The National Institute of Justice's (NIJ) [Recidivism Forecasting Challenge](https://nij.ojp.gov/funding/recidivism-forecasting-challenge). The challenge aims to increase public safety and improve the fair administration of justice across the United States. This challenge had three stages of prediction, and all three stages require modeling a binary outcome (recidivated vs. not recidivated in Year 1, Year 2, and Year 3). In this class, we will only work on the second stage and develop a model for predicting the probability of an individual's recidivism in the second year after initial release.
You can download the training dataset directly from [the competition website](https://data.ojp.usdoj.gov/Courts/NIJ-s-Recidivism-Challenge-Full-Dataset/ynf5-u8nk), or from the course website. Either way, please read the [Terms of Use at this link](https://data.ojp.usdoj.gov/stories/s/NIJ-s-Recidivism-Challenge-Overview-and-Term-of-Us/gyxv-98b2/) before working with this dataset.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show'}
recidivism <- read.csv(here('data/recidivism_full.csv'),header=TRUE)
str(recidivism)
```
There are 25,835 observations in the training set and 54 variables, including a unique ID variable, four outcome variables (Recidivism in Year 1, Recidivism in Year 2, and Recidivism in Year 3, Recidivism within three years), and a filter variable to indicate whether an observation was included in the training dataset or test dataset. The remaining 48 variables are potential predictive features. A complete list of these variables can be found at [this link](https://nij.ojp.gov/funding/recidivism-forecasting-challenge#recidivism-forecasting-challenge-database-fields-defined).
We will work on developing a model to predict the outcome variable `Recidivism_Arrest_Year2` using the 48 potential predictive variables. Before moving forward, we must remove the individuals who had already recidivated in Year 1. As you can see below, about 29.9% of the individuals recidivated in Year 1. I am removing these individuals from the dataset.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show'}
table(recidivism$Recidivism_Arrest_Year1)
recidivism2 <- recidivism[recidivism$Recidivism_Arrest_Year1 == FALSE,]
```
I will also recode some variables before saving the new dataset for later use in class.
- First, some variables in the dataset are coded as TRUE and FALSE. When these variables are imported into R, R automatically recognizes them as logical variables. I will recode all these variables such that FALSE = 0 and TRUE = 1.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show'}
# Find the columns recognized as logical
cols <- sapply(recidivism, is.logical)
# Convert them to numeric 0s and 1s
recidivism2[,cols] <- lapply(recidivism2[,cols], as.numeric)
```
- Second, the highest value for some variables are coded as **3 or more**, **4 or more**, **10 or more**, etc. These variables can be considered as numeric, but R recognizes them as character vectors due to phrase **or more** for the highest value. We will recode these variables so 'X or more' will be equal to X.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show',warning=FALSE,message=FALSE}
require(dplyr)
# Dependents
recidivism2$Dependents <- recode(recidivism2$Dependents,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Arrest Episodes Felony
recidivism2$Prior_Arrest_Episodes_Felony <- recode(recidivism2$Prior_Arrest_Episodes_Felony,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5'=5,
'6'=6,
'7'=7,
'8'=8,
'9'=9,
'10 or more'=10)
# Prior Arrest Episods Misd
recidivism2$Prior_Arrest_Episodes_Misd <- recode(recidivism2$Prior_Arrest_Episodes_Misd,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5'=5,
'6 or more'=6)
# Prior Arrest Episodes Violent
recidivism2$Prior_Arrest_Episodes_Violent <- recode(recidivism2$Prior_Arrest_Episodes_Violent,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Arrest Episods Property
recidivism2$Prior_Arrest_Episodes_Property <- recode(recidivism2$Prior_Arrest_Episodes_Property,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5 or more'=5)
# Prior Arrest Episods Drug
recidivism2$Prior_Arrest_Episodes_Drug <- recode(recidivism2$Prior_Arrest_Episodes_Drug,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5 or more'=5)
# Prior Arrest Episods PPViolationCharges
recidivism2$Prior_Arrest_Episodes_PPViolationCharges <- recode(recidivism2$Prior_Arrest_Episodes_PPViolationCharges,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5 or more'=5)
# Prior Conviction Episodes Felony
recidivism2$Prior_Conviction_Episodes_Felony <- recode(recidivism2$Prior_Conviction_Episodes_Felony,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Conviction Episodes Misd
recidivism2$Prior_Conviction_Episodes_Misd <- recode(recidivism2$Prior_Conviction_Episodes_Misd,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4 or more'=4)
# Prior Conviction Episodes Prop
recidivism2$Prior_Conviction_Episodes_Prop <- recode(recidivism2$Prior_Conviction_Episodes_Prop,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Conviction Episodes Drug
recidivism2$Prior_Conviction_Episodes_Drug <- recode(recidivism2$Prior_Conviction_Episodes_Drug,
'0'=0,
'1'=1,
'2 or more'=2)
# Delinquency Reports
recidivism2$Delinquency_Reports <- recode(recidivism2$Delinquency_Reports,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4 or more'=4)
# Program Attendances
recidivism2$Program_Attendances <- recode(recidivism2$Program_Attendances,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5'=5,
'6'=6,
'7'=7,
'8'=8,
'9'=9,
'10 or more'=10)
# Program Unexcused Absences
recidivism2$Program_UnexcusedAbsences <- recode(recidivism2$Program_UnexcusedAbsences,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Residence Changes
recidivism2$Residence_Changes <- recode(recidivism2$Residence_Changes,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
#############################################################
str(recidivism2)
```
Now, we export the final version of the dataset.
```{r, echo=TRUE,eval=knitr::is_html_output(),class.source='klippy',class.source = 'fold-show'}
write.csv(recidivism2,
here('data/recidivism_y1 removed and recoded.csv'),
row.names = FALSE)
```
In future weeks, we will work with this version of the dataset.
# Installing Reticulate, Miniconda, and Sentence Transformers
You will need to install the `reticulate` package and `sentence_transformers` module for the following weeks. You can run the following code in your computer to get prepared for the following weeks. Note that you only have to run the following code once to install the necessary packages.
If you are having troubles about installing these packages in your computer, I highly recommend using a Kaggle R notebook which these packages are already installed (I will give more information about this in class).
```{r, echo=TRUE,eval=FALSE,class.source='klippy',class.source = 'fold-show'}
# Install the reticulate package
install.packages(pkgs = 'reticulate',
dependencies = TRUE)
# Install Miniconda
install_miniconda()
```
Once you install the reticulate package, run the following code to get python configurations and make sure everything is properly installed.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show'}
# Load the reticulate package
require(reticulate)
conda_list()
```
You should see `r-reticulate` under the name column as one of your virtual Python environment. Finally, you will also need to install the sentence transformers module. The following code will install the sentence transformers module to the virtual Python environment `r-reticulate`.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show'}
# Install the sentence transformer module
use_condaenv('r-reticulate')
conda_install(envname = 'r-reticulate',
packages = 'sentence_transformers',
pip = TRUE)
# try pip=FALSE, if it gives an error message
```
Once you install the Python packages using the code above, you can run the following code. If you are seeing the same output as below, you should be all set to explore some very exciting NLP tools using the Readability dataset.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',results='hold'}
require(reticulate)
# Import the sentence transformer module
reticulate::import('sentence_transformers')
```