07-ensembles.Rmd

# Ensembles

## Load packages

```{r packages}
library(SuperLearner)
library(ck37r)
```

## Load data 

Load `train_x_class`, `train_y_class`, `test_x_class`, and `test_y_class` variables we defined in 02-preprocessing.Rmd for this *classification* task. 

```{r setup_data}
# Objects: task_reg, task_class
load("data/preprocessed.RData")
```

## Overview

In the preprocessing, lasso, decision tree, random forest, and boosted tree notebooks you have learned: 
- Ways to setup your data to plug it into different algorithms  
- Some common moving parts of different algorithms  
- How to define control structures and grid searches and why they are important  
- How to configure hyperparameter settings to improve performance  
- Why comparing more than one algorithm at once is preferred  

The ["SuperLearner" R package](https://cran.r-project.org/web/packages/SuperLearner/index.html) is a method that simplifies ensemble learning by allowing you to simultaneously evaluate the cross-validated performance of multiple algorithms and/or a single algorithm with differently tuned hyperparameters. This is a generally advisable approach to machine learning instead of fitting single algorithms. 

Let's see how the four classification algorithms you learned in this workshop (1-lasso, 2-decision tree, 3-random forest, and 4-gradient boosted trees) compare to each other and also to 5-binary logistic regression (`glm`) and to the 6-mean of Y as a benchmark algorithm, in terms of their cross-validated error!  

A "wrapper" is a short function that adapts an algorithm for the SuperLearner package. Check out the different algorithm wrappers offered by SuperLearner:

### Choose algorithms

```{r}
SuperLearner::listWrappers()
```

```{r cvsl_fit, cache = TRUE}
# Compile the algorithm wrappers to be used.
sl_lib = c("SL.mean", "SL.glm", "SL.glmnet", "SL.rpart", "SL.ranger", "SL.xgboost")
```

## Fit Model

Fit the ensemble! 

```{r}
# This is a seed that is compatible with multicore parallel processing.
# See ?set.seed for more information.
set.seed(1, "L'Ecuyer-CMRG") 

# This will take a few minutes to execute - take a look at the .html file to see the output!
cv_sl =
  SuperLearner::CV.SuperLearner(Y = train_y_class, X = train_x_class,
                                verbose = FALSE,
                                SL.library = sl_lib, family = binomial(),
                                # For a publication we would do V = 10 or 20
                                cvControl = list(V = 5L, stratifyCV = TRUE))
summary(cv_sl)
```

### Risk

Risk is a performance estimate - it's the average loss, and loss is how far off the prediction was for an individual observation. The lower the risk, the fewer errors the model makes in its prediction. SuperLearner's default loss metric is squared error $(y_{actual} - y_{predicted})^2$, so the risk is the mean-squared error (just like in ordinary least _squares_ regression). View the summary, plot results, and compute the AUC!

### Plot the risk

```{r cvsl_review}
# Plot the cross-validated risk estimate.
plot(cv_sl) + theme_minimal()
```

### Compute AUC for all estimators

```{r}
auc_table(cv_sl)
```

### Plot the ROC curve for the best estimator

```{r}
plot_roc(cv_sl)
```

### Review weight distribution for the SuperLearner

```{r}
print(cvsl_weights(cv_sl), row.names = FALSE)
```

"Discrete SL" is when the SuperLearner chooses the single algorithm with the lowest risk. "SuperLearner" is a weighted average of multiple algorithms, or an "ensemble". In theory the weighted-average should have a little better performance, although they often tie. In this case we only have a few algorithms so the difference is minor.  

## Challenge 5
Open Challenge 5 in the "Challenges" folder. 

A longer tutorial on SuperLearner is available here: (https://github.com/ck37/superlearner-guide)