-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathModule2_Linear Regression-2_HWSubmission_Yangxin_Fan.Rmd
486 lines (379 loc) · 19.8 KB
/
Module2_Linear Regression-2_HWSubmission_Yangxin_Fan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: "Module 2 Assignment on Linear Regression - 2 - V1"
author: "Yangxin Fan, Graduate Student"
date: "02/21/2020"
#output: pdf_document
output:
pdf_document: default
df_print: paged
#html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy=TRUE, tidy.opts=list(width.cutoff=80))
```
***
**Read and Delete This Part When Typing**
- Give a name to this rmd file: `ModuleNumber_ModuleName_HWSubmission_FirstName_LastName` (for example, `Module0_Reviews_HWSubmission_Yusuf_Bilgic.rmd`).
- First, read the slides, review the notes, and run the lab codes. Then do the assignment, type the solution here. Knit (generate the pdf) the file. Check if it looks good.
- **Especially this Module 2, find a pair, work together, split parts among each of you, explain each other, make sure you understand all pair solutions, combine solutions, and submit separately. It is fine if your codes and results are the same. I expect comments will be your own.**
- You will then submit two files to Blackboard:
1) `ModuleNumber_ModuleName_HWSubmission_FirstName_LastName.pdf` and
2) `ModuleNumber_ModuleName_HWSubmission_FirstName_LastName.rmd`.
- Grading will be based on the pdf file uploaded (avoid uploading extra docs). Make it easy and readable. Grader or me may take a look at the rmd file.
- Unless otherwise specified, use a 5% level for statistical significance.
- Always include your comments on results: don't just leave the numbers without explanations. Use full sentences, structured paragraphs if needed, correct grammar, and proofreading.
- Don't include irrelevant and uncommented outputs. Don't include all codes: use `echo=False, results='hide'` for most of time. You can include the codes when your solution becomes easier to follow. Also, include useful results. Try to call the outputs from $'r~xyz'$.
- Show your knowledge with detailed work in `consistency` with course materials though tons of other ways may exist.
- Each part is 1 pt, so the the total is 20 pt (4 pt is baseline score). If the response is not full or not reflecting the correct answer as expected, you may still earn 0.5 or just get 0.0 pt. Your TA will grade your work. Any questions, you can write directly to your TA and cc me. Visit my office hours on TWR. Thanks!
***
\newpage{}
***
## Module Assignment Questions
In this assignment, you will use the `Auto` data set with $7$ variables (one response `mpg` and six numerical) and $n=392$ vehicles. For sake of simplicity, categorical variables were excluded. Before each randomization used, use `set.seed(99)` so the test results are comparable.
## Q1) (*Forward and Backward Selection*)
In `Module 1 Assignment`, `Q2`, you fitted `Model 3` with `mpg` as the response and the six numerical variables as predictors. This question involves the use of `forward` and `backward` selection methods on the same data set.
a. Using `OLS`, fit the model with all predictors on `mpg`. Report the predictors' coefficient estimates, $R_{adj}$, and $MSE$. Note: The method in `lm()` is called ordinary least squares (OLS).
```{r eval=FALSE, echo=TRUE, results='hide'}
#This is setup to start
library(ISLR)
Model_3 = mpg ~ horsepower+year+cylinders+displacement+weight+acceleration
Model_3.fit = lm(Model_3, data=Auto)
summary(Model_3.fit)
# Or, prefer this restructuring way
# by excluding categorical variables:
# Make sure AutoNum is a data.frame
AutoNum = Auto[, !(colnames(Auto) %in% c("origin", "name"))]
Model_Full = mpg ~ . #you can write models in this way to call later
Model_Full.fit = lm(Model_Full, data=AutoNum)
summary(Model_Full.fit)
```
b. Using `forward selection method` from `regsubsets()` and `method="forward"`, fit MLR models and select the `best` subset of predictors. Report the best model obtained from the default setting by including the predictors' coefficient estimates, $R_{adj}$, and $MSE$.
```{r echo=TRUE, eval=FALSE}
# helpful code from the r lab: review it
Model_Full = mpg ~ .
regfit.m1=regsubsets(Model_Full, data=AutoNum, nbest=1,
nvmax=6, method="forward")
reg.summary=summary(regfit.m1)
reg.summary
names(reg.summary)
reg.summary$adjr2
coef(regfit.m2, 1:6) #coefficients of all models built
```
c. What criterion had been employed to find the best subset? What other criteria exist? Explain.
d. Using `backward selection method` from `regsubsets()` and `method="backward"`, fit MLR models and select the `best` subset of predictors. Report the best model obtained from the default setting by including predictors, their coefficient estimates, $R_{adj}$, and $MSE$.
e. Compare the results obtained from `OLS`, `forward` and `backward` selection methods (parts a, b and d): What changed? Which one(s) is better? Comment and justify.
## Q2) (*Cross-Validated with k-Fold*)
What changes in model selection results and the coefficient estimates when cross-validated set approach is employed? Specifically, we will use $k$-fold cross-validation (`k-fold CV`) here.
a. Using the $5$-fold CV approach, fit the OLS MLR model on `mpg` including all the predictors. Report the all predictors' coefficient estimates, $MSE_{train}$, and $MSE_{test}$.
b. Using the $5$-fold CV approach and `forward selection method`, fit MLR models on `mpg` and select the `best` subset of predictors. Report the best model obtained from the default setting by including the predictors' coefficient estimates, the averaged $MSE_{train}$, and the averaged $MSE_{test}$.
c. Compare the $MSE_{test}$'s. Explain.
d. Using the $5$-fold CV approach and `backward selection method`, fit MLR models on `mpg` and select the `best` subset of predictors. Report the best model obtained from the default setting by including the predictors' coefficient estimates, the averaged $MSE_{train}$, $MSE_{test}$.
e. Did you come up with a different model on parts b and d? Are the predictors and their coefficient estimates same? Compare and explain.
f. Which fitted model is better among parts a, b, and d? Why? Justify.
## Q3) (*Shrinkage Methods*)
Results for `OLS`, `lasso`, and `ridge` regression methods can be comparable. Now, you are expected to observe that ridge and lasso regression methods may reduce some coefficients to zero (so in this way, these features are eliminated) and shrink coefficients of other variables to low values.
In this exercise, you will analyze theses estimation and prediction methods (OLS, ridge, lasso) on the `mpg` in the Auto data set using $k-fold$ cross-validation test approach.
a. Fit a ridge regression model on the entire data set (including all six predictors, don't use yet any validation approach), with the optimal $\lambda$ chosen by `cv.glmnet()`. Report $\hat \lambda$, the predictors' coefficient estimates, and $MSE$.
b. Fit a lasso regression model on the entire data set (including all six predictors, don't use yet any validation approach), with the optimal $\lambda$ chosen by `cv.glmnet()`. Report $\hat \lambda$, the predictors' coefficient estimates, and $MSE$.
c. Compare the parts a and b in Q3 to part a in Q1. What changed? Comment.
d. How accurately can we predict `mpg`? Using the three methods (OLS, ridge and lasso) with all predictors, you will fit and test using $5$-fold cross-validation approach with the optimal $\lambda$ chosen by `cv.glmnet()`. For each, report the averaged train and test errors ($MSE_{train}$, $MSE_{test}$):
1) Fit an `OLS` model.
2) Fit a `ridge` regression model.
3) Fit a `lasso` regression model.
e. Write an overall report on part d by addressing the inquiry, `how accurately can we predict mpg?`. Is there much difference among the test errors resulting from these three approaches? Show your comprehension.
f. (BONUS) Propose a different model (or set of models) that seem to perform well on this data set, and justify your answer.
g. (BONUS) Include categorical variables to the models you built in part d, Q3. Report.
h. (GOLDEN BONUS) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using $5$-fold cross-validation approach. You can transform the data, scale and try any methods. When $MSE_{test}$ is the lowest (under the setting of Q3, part d) in the class, your HW assignment score will be 100% (20 pts).
i. (BONUS) You can make a hybrid design in model selection using all the methods here in a way that yields better results. Show your work, justify and obtain better results in part d, Q3.
\newpage
***
## Your Solutions
## Q1)
Part a:
```{r echo=TRUE}
library(ISLR)
AutoNum = Auto[, !(colnames(Auto) %in% c("origin", "name"))]
Model_Full = mpg ~ . #you can write models in this way to call later
Model_Full.fit = lm(Model_Full, data=AutoNum)
SSE = sum(Model_Full.fit$residuals**2)
MSE = SSE / 392
adj.r = summary(Model_Full.fit)$adj.r.squared
MSE
adj.r
```
The MSE is 11.59017 and adjusted r-square is 0.8062826.
***
Part b:
```{r echo=TRUE}
library(leaps)
Model_Full = mpg ~ .
regfit.m1=regsubsets(Model_Full, data=AutoNum, nvmax=6, method="forward")
reg.summary=summary(regfit.m1)
reg.summary
which.max(reg.summary$adjr2)
coef(regfit.m1, 2)
reg.summary$adjr2[2]
MSE = reg.summary$rss[2]/392
MSE
```
We see the model with 2 variables has highest adjr2. Hence, the best subset of predictors is weight and year. The coefficients of weight and year are -0.006632075 and 0.757318281. The MSE of the model is 11.65549 and adjusted r square is 0.8071941.
***
Part c:
Adjusted r square was used to find the best subset. BIC, AIC, Cp, validation set, and cross validation can also be used to find the best subset.
***
Part d:
```{r echo=TRUE}
library(leaps)
Model_Full = mpg ~ .
regfit.m2=regsubsets(Model_Full, data=AutoNum,nvmax=6, method="backward")
reg.summary=summary(regfit.m2)
reg.summary
which.max(reg.summary$adjr2)
coef(regfit.m2, 2)
MSE = reg.summary$rss[2]/392
MSE
```
We see the model with 2 variables has highest adjr2. Hence, the best subset of predictors is weight and year. The coefficients of weight and year are -0.006632075 and 0.757318281. The MSE of the model is 11.65549 and adjusted r square is 0.8071941.
***
Part e:
b and d are the same and both better than a. b and d include the same set of variables weight and year have higher adjr2 and lower BIC than a. a is very likely overfit the training dataset.
***
\newpage
## Q2)
Part a:
```{r echo=TRUE}
k = 5
set.seed(1)
folds = sample(1:k,nrow(Auto),replace = TRUE)
test.errors = rep(0,5)
train.errors = rep(0,5)
for(j in 1:k){
Model_3 = mpg ~ horsepower+year+cylinders+displacement+weight+acceleration
train.fit = lm(Model_3, data=Auto[folds!=j,])
pred = predict(train.fit,Auto[folds==j,])
test.errors[j] = mean((Auto$mpg[folds==j]-pred)^2)
pred2 = predict(train.fit,Auto[folds!=j,])
train.errors[j] = mean((Auto$mpg[folds!=j]-pred2)^2)
}
mean(test.errors)
mean(train.errors)
Model_3 = mpg ~ horsepower+year+cylinders+displacement+weight+acceleration
train.fit = lm(Model_3, data=Auto)
coef(train.fit)
```
The coefficients of cylinders, displacement, horsepower, weight, acceleration, and year are
-0.3298591, 7.678430e-03, -3.913556e-04, -6.794618e-03, 8.527325e-02, and 7.533672e-01.
The MSE train (data other the kth fold) is 11.51806 and the MSE test is 12.35496.
***
Part b:
```{r echo=TRUE}
predict.regsubsets = function(object, newdata, id){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
k = 5
set.seed(1)
folds = sample(1:k,nrow(Auto),replace = TRUE)
cv.errors_train = matrix(NA,k,6,dimnames=list(NULL,paste(1:6)))
cv.errors_test = matrix(NA,k,6,dimnames=list(NULL,paste(1:6)))
for(j in 1:k){
best.fit=regsubsets(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration, data=Auto[folds!=j,],nvmax=6,method="forward")
for(i in 1:6){
pred = predict.regsubsets(best.fit,Auto[folds==j,],id=i)
pred2 = predict.regsubsets(best.fit,Auto[folds!=j,],id=i)
cv.errors_test[j,i]=mean((Auto$mpg[folds==j]-pred)^2)
cv.errors_train[j,i]=mean((Auto$mpg[folds!=j]-pred2)^2)
}
}
mean.MSE.train = apply(cv.errors_train,2,mean)
mean.MSE.test = apply(cv.errors_test,2,mean)
mean.MSE.train
mean.MSE.test
reg.best = regsubsets(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration, data=Auto,nvmax=6,method="forward")
coef(reg.best,2)
```
The best prediction model (with lowest MSE test) is the model with 2 predictors weight and year. The MSE train is 11.62668 and MSE test is 12.03298. The coefficients of year and weight are 0.757318281 and -0.006632075.
***
Part c:
Comparing MSE test of a and b, b's MSE test is 12.03298 lower than a's MSE test 12.35496. Despite a perform better in training data with a lower MSE train, it seems overfit the data. b is a better choice.
***
Part d:
```{r echo=TRUE}
predict.regsubsets = function(object, newdata, id){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
k = 5
set.seed(1)
folds = sample(1:k,nrow(Auto),replace = TRUE)
cv.errors_train = matrix(NA,k,6,dimnames=list(NULL,paste(1:6)))
cv.errors_test = matrix(NA,k,6,dimnames=list(NULL,paste(1:6)))
for(j in 1:k){
best.fit=regsubsets(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration, data=Auto[folds!=j,],nvmax=6,method="backward")
for(i in 1:6){
pred = predict.regsubsets(best.fit,Auto[folds==j,],id=i)
pred2 = predict.regsubsets(best.fit,Auto[folds!=j,],id=i)
cv.errors_test[j,i]=mean((Auto$mpg[folds==j]-pred)^2)
cv.errors_train[j,i]=mean((Auto$mpg[folds!=j]-pred2)^2)
}
}
mean.MSE.train = apply(cv.errors_train,2,mean)
mean.MSE.test = apply(cv.errors_test,2,mean)
mean.MSE.train
mean.MSE.test
reg.best = regsubsets(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration, data=Auto,nvmax=6,method="backward")
coef(reg.best,2)
```
The best prediction model (with lowest MSE test) is the model with 2 predictors weight and year. The MSE train is 11.62668 and MSE test is 12.03298. The coefficients of year and weight are 0.757318281 and -0.006632075.
***
Part e:
The models on part of b and d are the same with same MSE test, MSE train, and coefficients of parameters.
***
Part f:
models b and d are the same both better than model a since they have lower MSE test in the five-fold cross validation.
***
\newpage
## Q3)
Part a:
```{r echo=TRUE}
library(glmnet)
set.seed(1)
x=model.matrix(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration,data=Auto)[,-1]
y=Auto$mpg
cv.out=cv.glmnet(x,y,alpha=0)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
cv.out
ridge.mod=glmnet(x,y,alpha=0,lambda=bestlam)
coef(ridge.mod)
```
The optimal lambda is 0.6487382, its corresponding MSE is 12.46. The coefficients of horsepower, year, cylinders, displacement, weight, acceleration are -0.022405596, 0.660273457,
-0.450035139, -0.006941634, -0.004033276, and -0.069233569.
***
Part b:
```{r echo=TRUE}
library(glmnet)
set.seed(1)
x=model.matrix(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration,data=Auto)[,-1]
y=Auto$mpg
cv.out=cv.glmnet(x,y,alpha=1)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
cv.out
ridge.mod=glmnet(x,y,alpha=1,lambda=bestlam)
coef(ridge.mod)
```
The optimal lambda is 0.03889083, its correpsonding MSE is 11.93. Two of variables horsepower and displacement's weights are zero. The coefficients of year, cylinder, weight, and acceleration are 0.739440725, -0.090905726, -0.006378665, and 0.049887207.
***
Part c:
Comparing part a and b, we found that:
1. Lasso (part b) has a lower MSE in cross validation than Ridge (part a) which means that Lasso might be a better model.
2. Lasso help us in feature selections via setting two of our variables' coefficients to be zero while in Ridge all variables' coefficients are non zeros.
3. Interstingly, in Lasso, the coefficient of acceleration is positive while in Ridge it's negative.
***
Part d:
```{r echo=TRUE}
# Ridge and Lasso
k = 5
set.seed(1)
x=model.matrix(mpg ~ horsepower+year+cylinders+displacement+weight+acceleration,data=Auto)[,-1]
y=Auto$mpg
folds = sample(1:k,nrow(x),replace = TRUE)
Lasso.test.errors = rep(0,5)
Lasso.train.errors = rep(0,5)
Ridge.test.errors = rep(0,5)
Ridge.train.errors = rep(0,5)
for(j in 1:k){
lasso.fit = glmnet(as.matrix(x[folds!=j,]),y[folds!=j],alpha=1,lambda=0.03889083)
pred = predict(lasso.fit,x[folds==j,])
Lasso.test.errors[j] = mean((y[folds==j]-pred)^2)
pred2 = predict(lasso.fit,x[folds!=j,])
Lasso.train.errors[j] = mean((y[folds!=j]-pred2)^2)
ridge.fit = glmnet(x[folds!=j,],y[folds!=j],alpha=0,lambda=0.6487382)
pred = predict(ridge.fit,x[folds==j,])
Ridge.test.errors[j] = mean((y[folds==j]-pred)^2)
pred2 = predict(ridge.fit,x[folds!=j,])
Ridge.train.errors[j] = mean((y[folds!=j]-pred2)^2)
}
Lasso.test.errors
mean(Lasso.test.errors)
Lasso.train.errors
mean(Lasso.train.errors)
Ridge.test.errors
mean(Ridge.test.errors)
Ridge.train.errors
mean(Ridge.train.errors)
```
1. OLS model: the results can be derived from Q2, we see model with two variables weight and year giving us a 5-fold CV training MSE 11.62668 and test MSE 12.03298.The model with full six variables gives us a 5-fold CV training MSE 11.51806 and test MSE 12.35496.
2. Ridge regression: 5-fold CV Averaged training MSE is 12.09008 and test MSE is 12.75366
3. Lasso regression: 5-fold CV Averaged training MSE is 11.57704 and test MSE is 12.20715
***
Part e:
There is some difference among test erros of these three models.
In terms of test MSE, accuracy of model is: OLS (best variables combo: weight and year) > Lasso regression > Ridge regression or Lasso regression > OLS (full six variables model) > Ridge regression.
The model performs best is OLS with 2 variables (weight+year) with test MSE 12.03298 and followed by lasso regression with test MSE 12.20715.
Note that: optimal lambdas in both lasso and ridge are chosen using cv.glmnet() given its default is 10-fold CV. However, we fit and test using 5-fold CV in part d.
***
Part f (bonus):
```{r echo=TRUE}
library(e1071)
AutoNum = Auto[, !(colnames(Auto) %in% c("origin", "name"))]
tc = tune.control(sampling = c("cross"), cross = 5)
tune.out=tune(svm,mpg~.,data=AutoNum,kernel="linear", ranges=list(cost=c(0.5,0.75,1,1.25,1.5,1.75,2)),tunecontrol = tc)
summary(tune.out)
```
I propose SVM (support vector machines) with linear kernel as an another decent method which has test MSE 12.80147 in five-fold cross validation.
***
Part h (Golden Bonus):
```{r echo=TRUE}
# Ridge and Lasso
k = 5
set.seed(1)
x=model.matrix(mpg ~ year+weight,data=Auto)[,-1]
y=Auto$mpg
folds = sample(1:k,nrow(x),replace = TRUE)
Lasso.test.errors = rep(0,5)
Lasso.train.errors = rep(0,5)
Ridge.test.errors = rep(0,5)
Ridge.train.errors = rep(0,5)
for(j in 1:k){
lasso.fit = glmnet(as.matrix(x[folds!=j,]),y[folds!=j],alpha=1,lambda=0.03889083)
pred = predict(lasso.fit,x[folds==j,])
Lasso.test.errors[j] = mean((y[folds==j]-pred)^2)
pred2 = predict(lasso.fit,x[folds!=j,])
Lasso.train.errors[j] = mean((y[folds!=j]-pred2)^2)
ridge.fit = glmnet(x[folds!=j,],y[folds!=j],alpha=0,lambda=0.6487382)
pred = predict(ridge.fit,x[folds==j,])
Ridge.test.errors[j] = mean((y[folds==j]-pred)^2)
pred2 = predict(ridge.fit,x[folds!=j,])
Ridge.train.errors[j] = mean((y[folds!=j]-pred2)^2)
}
Lasso.test.errors
mean(Lasso.test.errors)
Lasso.train.errors
mean(Lasso.train.errors)
Ridge.test.errors
mean(Ridge.test.errors)
Ridge.train.errors
mean(Ridge.train.errors)
```
The best result I find is Lasso regression (regress mpg onto two variables weight and year).
This model produces average test MSE 12.02588 in five-fold cross validation, which is better than any of previous model's performance in test MSE under same number of folds cross validation.
\newpage
## Write comments, questions: ...
***
I hereby write and submit my solutions without violating the academic honesty and integrity. If not, I accept the consequences. --Yangxin Fan
### List the fiends you worked with (name, last name): I complete the homework myself.
### Disclose the resources or persons if you get any help: ISLR book
### How long did the assignment work take?: 6-8 hours
***
## References
...