Midterm2_Submission_Yangxin_Fan.Rmd

---
title: "Midterm-2 Project Portion - Instruction"
author: "Yangxin Fan"
date: "Submission Date: 04/08/2021"
#output: pdf_document
output:
  pdf_document: default
  df_print: paged
  #html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy=TRUE, tidy.opts=list(width.cutoff=80))
```

***

\newpage{}

## Midterm-2 Project Instruction

In `Midterm-1 Project`, you have built predictive models using train and test data sets about college students' academic performances and retention status. You fitted four regression models on \textbf{Term.GPA} and four classification models on \textbf{Persistence.NextYear}. the lowest test score of $MSE_{test}$ achieved on the regression problem was $.991$ using a simple linear regression, and the highest `accuracy` and `F1` scores obtained were $91.15$% and $95.65$%, respectively, with the fit of a multiple logistic regression model (equivalently, LDA and QDA give similar performances). Let's call these scores as baseline test scores.

In `Midterm-2 Project`, you will use tree-based methods (trees, random forests, boosting) and artificial neural networks (Modules 5, 6, and 7) to improve the baseline results. There is no any answer key for this midterm: your efforts and justifications will be graded, pick one favorite optimal tree-based method and one optimal ANN architecture for each regression and classification problem (a total of two models for classification and two models for regression), and fit and play with hyperparameters until you get satisfactory improvements in the test data set.

Keep in mind that $Persistence.NextYear$ is not included in as predictor the regression models so use all the predictors except that on the regression. For the classification models, use all the predictors including the term gpa.

First of all, combine the train and test data sets, create dummies for all categorical variables, which include `Entry_Term`, `Gender`, and `Race_Ethc_Visa`, so the data sets are ready to be separated again as train and test. (Expect help on this portion!) You will be then ready to fit models. 


***

\section{A. Improving Regression Models - 15 pts}

- Explore tree-based methods, choose the one that is your favorite and yielding optimal results, and then search for one optimal ANN architecture for the regression problem (so two models to report). Fit and make sophisticated decisions by justifying and writing precisely. Report `the test MSE` results in a comparative table along with the methods so the grader can capture all your efforts on building various models in one table.

\section{B. Improving Classification Models - 20 pts}

- Explore tree-based methods, choose the one that is your favorite and yielding optimal results, and then search for one optimal ANN architecture for the classification problem (so two models to report). Fit and make sophisticated decisions by justifying and writing precisely. Report `the test accuracy` and `the test F1` results in a comparative table along with the methods so the grader can capture all your efforts in one table.


\section{C. Importance Analyses - 15 pts}

- Part a. Perform an importance analysis on the best regression model: which three predictors are most important or effective to explain the response variable? Find the relationship and dependence of these predictors with the response variable. Include graphs and comments.

- Part b. Perform an importance analysis on the best classification model: which three predictors are most important or effective to explain the response variable? Find the relationship and dependence of these predictors with the response variable. Include graphs and comments.

- Part c. Write a conclusion paragraph. Evaluate overall what you have achieved. Did the baselines get improved? Why do you think the best model worked well or the models didn't work well? How did you handle issues? What could be done more to get `better` and `interpretable` results? Explain with technical terms.

***
\newpage{}

\section{Project Evaluation}

The submitted project report will be evaluated according to the following criteria: 

\begin{enumerate}
\item All models in the instruction used correctly 
\item Completeness and novelty of the model fitting 
\item Techniques and theorems of the methods used accurately
\item Reflection of in-class lectures and discussions
\item Achieved reasonable/high performances; insights obtained (patterns of variables)
\item Clear and minimalist write-ups
\end{enumerate}

If the response is not full or not reflecting the correct answer as expected, you may still earn partial points. For each part or model, I formulated this `partial points` as this:

- 20% of pts: little progress with some minor solutions; 
- 40% of pts: major calculation mistake(s), but good work, ignored important pieces; 
- 60-80% of pts: correct method used, but minor mistake(s). 

Additionally, a student who will get the highest performances from both problems in the class (`minimum test MSE` from the regression model and `highest F1` from the classification model) will get a BONUS (up to +2 pts). Just follow up when you think you did good job!

***
\newpage{}
\section{Tips}

- `Term.gpa` is an aggregated gpa up until the current semester, however, this does not include this current semester. In the modeling of `gpa`, include all predictors except `persistent`.
- The data shows the `N.Ws`, `N.DFs`, `N.As` as the number of courses withdrawn, D or Fs, A's respectively in the current semester.
- Some rows were made synthetic so may not make sense: in this case, feel free to keep or remove.
- It may be poor to find linear association between gpa and other predictors (don't include `persistent` in `gpa` modeling).
- Scatterplot may mislead since it doesn't show the density.
- You will use the test data set to asses the performance of the fitted models based on the train data set.
- Implementing 5-fold cross validation method while fitting with train data set is strongly suggested.
- You can use any packs (`caret`, `Superml`, `rpart`, `xgboost`, or [visit](https://cran.r-project.org/web/views/MachineLearning.html)  to search more) as long as you are sure what it does and clear to the grader.
- Include helpful and compact plots with titles.
- Keep at most 4 decimals to present numbers and the performance scores. 
- When issues come up, try to solve and write up how you solve or can't solve.
- Check this part for updates: the instructor puts here clarifications as asked.


***

\newpage


## Your Solutions

\subsection{Section A.} 

Data preprocessing
```{r eval=TRUE}
library(xgboost) 
library(caret)
library(formattable)

train = read.csv("StudentDataTrain.csv")
test = read.csv("StudentDataTest.csv")

# However, in midterm, you can omit rows with any NA values
train = na.omit(train)
test = na.omit(test)

#Dims
dim(train) 
dim(test) 

#dummies for categorical variables
library(dummies)
total = rbind(train, test)
total = dummy.data.frame(total, sep = ".")
train = total[1:5757,]
test = total[5758:7202,]
dim(train)
dim(test)
names(train)
```

# Gradient Boost Method
```{r eval=TRUE,results=FALSE}
library(gbm)
library(mlbench)
library(caret)

gbmGrid =  expand.grid(interaction.depth = c(1,2,3,4),
                        n.trees = (1:25)*50, 
                        shrinkage = c(0.05,0.01,0.005,0.001),
                        n.minobsinnode = 20)

fitControl = trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 1)

set.seed(99)
gbmFit_reg = train(Term.GPA~.-Persistence.NextYear, data=train,
                 method = "gbm", 
                 trControl = fitControl,
                 verbose = FALSE, 
                 tuneGrid = gbmGrid)
```

```{r eval=TRUE}
gbmFit_reg$bestTune

# test set prediction
pred = predict(gbmFit_reg, test)
GB_test_MSE = mean((test$Term.GPA-pred)^2)
GB_test_MSE
```

# ANN 
```{r eval=TRUE,results=FALSE}
library(neuralnet)

tune_grid = expand.grid(
  layer1 = c(1,2,3),
  layer2 = c(1,2,3),
  layer3 = 0)

tune_control = trainControl(method="repeatedcv", number=5, repeats=1, allowParallel = TRUE)

set.seed(99)
nn_tune = train(
  Term.GPA~.-Persistence.NextYear, data=train,
  tuneGrid = tune_grid,
  trControl = tune_control,
  method = 'neuralnet',
  metric = 'RMSE',
  threshold = 0.1,
  linear.output = TRUE)
```

```{r eval=TRUE}
nn_tune$bestTune

# test set prediction
pred = predict(nn_tune,test)
NN_test_MSE = mean((test$Term.GPA-pred)^2)
NN_test_MSE
```

# Summary of two regression models 
```{r eval=TRUE}
tab = matrix(c(percent(round(GB_test_MSE,4)), percent(round(NN_test_MSE,4))), ncol=2, byrow=TRUE)
colnames(tab) = c('Gradient Boost', 'Neural Network')
rownames(tab) = c('Test MSE')
#convert matrix to table 
tab = as.table(tab)
tab
```
The above table shows the test MSE of the optimal gradient boosting and neural network models. Compared to the baseline MSE 0.991, the test MSE of optimal gradient boosting is 0.9765 which is noticeably better. For neural network, the test MSE of the optimal model is 0.9946 which has no improvement from the baseline linear regression.

The optimal model for gradient boosting's parameters are: n.trees = 100, interaction.depth = 2, shrinkage = 0.05 and n.minobsinnode = 20.

The optimal model for neural network has two hidden layers with first hidden layer having one nodes and second hidden layer having three nodes.

***

\newpage
\subsection{Section B.} 

# Gradient Boost Method
```{r eval=TRUE, results=FALSE}
library(gbm)
library(mlbench)
library(caret)

gbmGrid =  expand.grid(interaction.depth = c(1,2,3,4),
                        n.trees = (1:25)*50, 
                        shrinkage = c(0.05,0.01,0.005,0.001),
                        n.minobsinnode = 20)


fitControl = trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 1)

set.seed(99)
gbmFit_cls = train(as.factor(Persistence.NextYear)~., data=train,
                 method = "gbm", 
                 trControl = fitControl,
                 verbose = FALSE, 
                 tuneGrid = gbmGrid)
```

```{r eval=TRUE}
gbmFit_cls$bestTune

pred = predict(gbmFit_cls, test)
ct = table(as.factor(test$Persistence.NextYear), as.factor(pred))
print(ct)

# test set prediction
GB_Accuracy_test = (ct[1]+ct[4])/sum(ct)
Recall_test = ct[4]/sum((ct[2]+ct[4]))     
Precision_test = ct[4]/sum((ct[3]+ct[4]))   
GB_F1_test = 2/(1/Recall_test+1/Precision_test)
```

# ANN 
```{r eval=TRUE, results=FALSE}
library(nnet)

set.seed(99)
fitControl = trainControl(method = "repeatedcv", 
                           number = 5, 
                           repeats = 1)

nnetGrid =  expand.grid(size = c(4,5,6,7,8),
                        decay = c(0.5,0.7,0.9,0.95))

nnetFit = train(as.factor(Persistence.NextYear)~., data=train,
                 method = "nnet",
                 metric = "Accuracy",
                 trControl = fitControl,
                 tuneGrid = nnetGrid,
                 verbose = FALSE)
```

```{r eval=TRUE}
nnetFit$bestTune
pred = predict(nnetFit, test)
ct = table(as.factor(test$Persistence.NextYear), as.factor(pred))
print(ct)

# test set prediction
NN_Accuracy_test = (ct[1]+ct[4])/sum(ct)
Recall_test = ct[4]/sum((ct[2]+ct[4]))     
Precision_test = ct[4]/sum((ct[3]+ct[4]))   
NN_F1_test = 2/(1/Recall_test+1/Precision_test)
```

# Summary of two classification models 
```{r eval=TRUE}
tab = matrix(c(percent(round(GB_Accuracy_test,4)), percent(round(NN_Accuracy_test,4)), percent(round(GB_F1_test,4)), percent(round(NN_F1_test,4))), ncol=2, byrow=TRUE)
colnames(tab) = c('Gradient Boosting', 'Neural Network')
rownames(tab) = c('Test accuracy', 'Test F1-score')
#convert matrix to table 
tab = as.table(tab)
tab
```

The above table shows the test accuracy and F1 score of the optimal gradient boosting and neural network models. Compared to the baseline accuracy 91.15% and F1 scores 95.65%, the test accuracy and F1 scores of optimal gradient boosting are 95.50% and 97.50% which are noticeably better than baseline. For neural network, the test accuracy and F1 scores of optimal model are 90.66% and 94.76% which have no improvement from the baseline linear regression.

The optimal model for gradient boosting's parameters are: n.trees = 550, interaction.depth =  4, shrinkage = 0.01, and n.minobsinnode = 20.

The optimal model for neural network has size = 5 and decay = 0.7.


***

\subsection{Section C.} 

- Part a.

```{r eval=TRUE}
vImpGbm=varImp(gbmFit_reg)
vImpGbm
plot(vImpGbm, top = 5)
```

I applied variable importance analysis in the optimal gradient boosting regression model, the top three most important predictors for predicting response Term.GPA are HSGPA, SAT_Total, and N.RegisteredCourse. Noticeably, HSGPA and SAT_Total are like students' historic academic performance. It's not suprising to see these two predictors are most important two to predict response Term.GPA.

***
- Part b.

```{r eval=TRUE}
vImpGbm=varImp(gbmFit_cls)
vImpGbm
plot(vImpGbm, top = 5)
```

I applied variable importance analysis in the optimal gradient boosting regression model, the top three most important predictors for predicting Persistence.NextYear are Term.GPA, HSGPA, and Entry_Term. This tells us whether a student Persistence.NextYear depends largely on his/her academic performance like Term.GPA and HSGPA.

***

- Part c.

Note that: I conceal the tuning process that show each parameters combo since this would take too many pages to cover.

In conlusion, the gradient boostin method performs well in both regression and classification problem. 

Q1: Did the baselines get improved?
Yes, gradient boosting perform significantly better than baseline linear regression in both regression and classification. In gradient boosting, I got test MSE 0.9765 in the optimal model after tuning vs. 0.991 baseline. In classification, we got test accuracy and F1 scores 95.50% and 97.50% seperately vs. baseline accuracy 91.15% and F1 scores 95.65%. However, the neural network models in regression and classification still not perform better than baseline after tuning. 

Q2: Why do you think the best model worked well or the models didn’t work well?
I think gradient boosting perform well since it is an ensemble method which reduces the variance in our predictive model and it always learn from its mistakes. Besides, it's easier to control overfitting problem in gradient boosting through grid search based hyper-parameters tuning. However, for neural networks, it seems that it is always overfitting which means we might need more data.

Q3: How did you handle issues?
I used repeated five-fold cross validation in hyper-parameter tuning process to avoid overfitting issues. 

Q4: What could be done more to get better and interpretable results?
For neural network, we can choose to try to scale the numeric predictors in differently which might give better results since i tried normal scale function it seems not improve the results. To get more interpretable results, we can choose the top 10 most important predictors from gradient boosting method and use them to fit decision tree models since decision tree always give us a clear and good interpretation.

***
I hereby write and submit my solutions without violating the academic honesty and integrity. If not, I accept the consequences -- Yangxin Fan

### Write your pair you worked at the top of the page. If no pair, it is ok. List other fiends you worked with (name, last name): No pair

### Disclose the resources or persons if you get any help: ISLR book and lecture notes

### How long did the assignment solutions take?: 6 hours


***
## References: NA
...