Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 109 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,120 @@
# Project 2
> ZIRUI OU A20516756

Select one of the following two options:
## Model Selection

## Boosting Trees
### Program Design

Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
![alt text](images/image.png)

Put your README below. Answer the following questions.
### Results Overview
| Metric | **Linear Regression** | **Ridge Regression** |
|-----------------------|-----------------------------|-----------------------------|
| Coefficients | [0.2151, 0.5402] | [0.2411, 0.4849] |
| K-Fold MSE | 0.8418 | 0.8471 |
| Bootstrapping MSE | 0.8219 | 0.8214 |
| AIC | -17.495 | -17.461 |
| SSR | 80.6585 | 80.6853 |
| Verification (Coefficients) | True | True |
| Verification (AIC) | True | True |

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?

## Model Selection
![alt text](images/Figure_1.png)

### How to Run the Code

1. **Clone the Repository:**
```bash
git clone https://github.com/Jerry-zirui/Project2.git
cd Project2
```

2. **Install Required Packages:**
```bash
pip install numpy
pip install matplotlib #optional
```

3. **Generate Synthetic Data**
```python
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n = 100 # Number of samples
p = 1 # Number of features
X = np.random.rand(n, p) # Feature matrix
beta_true = np.array([1]) # True coefficients
y = X @ beta_true + np.random.randn(n) # Target variable with noise
```

4. **Define Models**
```python
models = {
"Linear Regression": {"fn": linear_regression},
"Ridge Regression": {"fn": ridge_regression, "params": {"alpha": 1.0}},
}
```

5. **Evaluate Models**
```python
# Evaluate models
results = evaluate_models(models, X, y, k=5, B=1000)
```

6. **Visualize Results**
```python
# Visualize results
results = visualize(evaluate_models, X, y, models, results)
```

7. **Run the Model Selection Script:**
```bash
python model_selection.py
```
### Q&A

Implement generic k-fold cross-validation and bootstrapping model selection methods.
#### Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
- In my case with linear regression and Ridge Regression models, the cross-validation and bootstrapping model selectors generally agree with the AIC model selector.
- The MSE values from cross-validation and bootstrapping are similar, and the AIC values are consistent with the model selection.

In your README, answer the following questions:
#### In what cases might the methods you've written fail or give incorrect or undesirable results?
- **Multicollinearity:**
- If there are highly correlated features, the methods may give incorrect results due to multicollinearity. For example, in this dataset, if `Feature1` and `Feature2` were highly correlated, the coefficients might not accurately reflect their individual contributions.

- **Small Sample Sizes:**
- With small sample sizes, the methods may not have enough data to provide reliable estimates. This could lead to high variance in the model coefficients and MSE values.

- **High Variance Models:**
- With high variance models, the methods may give higher MSE values and less desirable results.

- **Poorly Specified Models:**
- If the models are not properly specified (e.g., incorrect regularization parameters), the methods may fail. For instance, if the Ridge Regression model's alpha parameter was set too high, it might over-shrink the coefficients, leading to underfitting.

* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
* In what cases might the methods you've written fail or give incorrect or undesirable results?
* What could you implement given more time to mitigate these cases or help users of your methods?
* What parameters have you exposed to your users in order to use your model selectors.
#### What could you implement given more time to mitigate these cases or help users of your methods?
- **Feature Selection:**
- Implement feature selection methods to handle multicollinearity. Techniques like Variance Inflation Factor (VIF) can be used to identify and remove highly correlated features.

- **Regularization Techniques:**
- Implement more advanced regularization techniques such as Lasso, Ridge, and Elastic Net to handle different types of data and model requirements.

- **Cross-Validation with Multiple Splits:**
- Use cross-validation with multiple splits to improve reliability. This can help in obtaining a more stable estimate of the model performance.

- **Bootstrap Aggregation:**
- Implement bootstrap aggregation to improve the stability of the results. This technique can help in reducing the variance of the model estimates.

- **Model Diagnostics:**
- Provide detailed diagnostics to help users understand the model selection process. This can include residual plots, coefficient plots, and other visualizations that can aid in interpreting the results.

See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
### What parameters have you exposed to your users in order to use your model selectors.
- **ridge_regression**
- `alpha`: which controls the strength of the L2 regularization
- **K-Fold Cross-Validation:**
- `k`: Number of folds. (Default: 5)
- `model_fn`: Model function. (e.g., `LinearRegression()`, `Ridge()`)
- `model_params`: Model parameters. (e.g., `{'alpha': 1.0}` for Ridge Regression)

As usual, above-and-beyond efforts will be considered for bonus points.
- **Bootstrapping:**
- `B`: Number of bootstrap samples. (Default: 1000)
- `model_fn`: Model function. (e.g., `LinearRegression()`, `Ridge()`)
- `model_params`: Model parameters. (e.g., `{'alpha': 1.0}` for Ridge Regression)
Binary file added images/Figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
175 changes: 175 additions & 0 deletions model_selection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
import numpy as np
from typing import Callable, Dict, Any, Tuple

try:
import matplotlib.pyplot as plt
except Exception as e:
...

def linear_regression(X:np.ndarray, y:np.ndarray) -> np.ndarray:
X_b = np.hstack((np.ones((X.shape[0], 1)), X))
beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
return beta

def ridge_regression(X:np.ndarray, y:np.ndarray, alpha:float=1.0):
X_b = np.hstack((np.ones((X.shape[0], 1)), X))
I = np.eye(X_b.shape[1])
I[0, 0] = 0
beta = np.linalg.inv(X_b.T @ X_b + alpha * I) @ X_b.T @ y
return beta

def k_fold_cross_validation(X:np.ndarray, y:np.ndarray, k:int=5, model_fn:Callable=linear_regression, **model_params) -> float:
n = X.shape[0]
fold_size = n // k
indices = np.arange(n)
np.random.shuffle(indices)
mse_scores = np.zeros(k)

for i in range(k):
test_indices = indices[i * fold_size : (i + 1) * fold_size]
train_indices = np.setdiff1d(indices, test_indices)
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]

beta = model_fn(X_train, y_train, **model_params)
X_test_b = np.hstack((np.ones((X_test.shape[0], 1)), X_test))
predictions = X_test_b @ beta
mse_scores[i] = np.mean((y_test - predictions) ** 2)

return np.mean(mse_scores)

def bootstrapping(X:np.ndarray, y:np.ndarray, B:int=1000, model_fn=linear_regression, **model_params):
n = X.shape[0]
mse_scores = np.zeros(B)

for b in range(B):
indices = np.random.choice(n, n, replace=True)
X_b = X[indices]
y_b = y[indices]

beta = model_fn(X_b, y_b, **model_params)
X_test_b = np.hstack((np.ones((n, 1)), X))
predictions = X_test_b @ beta
mse_scores[b] = np.mean((y - predictions) ** 2)

return np.mean(mse_scores)

def aic(X:np.ndarray, y:np.ndarray, beta) -> Tuple[float, float]:
n = X.shape[0]
X_b = np.hstack((np.ones((X.shape[0], 1)), X))
y_pred = X_b @ beta
ssr = np.sum((y - y_pred) ** 2)
p = len(beta) - 1
aic_value = n * np.log(ssr / n) + 2 * (p + 1)
return aic_value, ssr

def evaluate_models(models:Dict[str,any], X:np.ndarray, y:np.ndarray, k:int=5, B:int=1000):
results = {}
for name, model_info in models.items():
model_fn = model_info['fn']
model_params = model_info.get('params', {})
beta = model_fn(X, y, **model_params)
cv_mse = k_fold_cross_validation(X, y, k, model_fn, **model_params)
bootstrap_mse = bootstrapping(X, y, B, model_fn, **model_params)
aic_val, ssr = aic(X, y, beta)
results[name] = {
"Coefficients": beta,
"K-Fold MSE": cv_mse,
"Bootstrapping MSE": bootstrap_mse,
"AIC": aic_val,
"SSR": ssr,
}
return results

def verify_coefficients(X:np.ndarray, y:np.ndarray, beta:np.ndarray, model_fn:callable, **model_params) -> bool:
beta_sanity = model_fn(X, y, **model_params)
return np.allclose(beta, beta_sanity)

def verify_aic(X:np.ndarray, y:np.ndarray, beta:np.ndarray, aic_value:float) -> bool:
n = X.shape[0]
X_b = np.hstack((np.ones((X.shape[0], 1)), X))
y_pred = X_b @ beta
ssr = np.sum((y - y_pred) ** 2)
p = len(beta) - 1
aic_manual = n * np.log(ssr / n) + 2 * (p + 1)
return np.isclose(aic_value, aic_manual)


np.random.seed(42)
n = 100
p = 1
X = np.random.rand(n, p)
beta_true = np.array([1])
y = X @ beta_true + np.random.randn(n)

models = {
"Linear Regression": {"fn": linear_regression},
"Ridge Regression": {"fn": ridge_regression, "params": {"alpha": 1.0}},
}

results = evaluate_models(models, X, y, k=5, B=1000)

if plt:
def visualize(evaluate_models, X, y, models, results):
kfold_mse = [result['K-Fold MSE'] for result in results.values()]
bootstrap_mse = [result['Bootstrapping MSE'] for result in results.values()]
aic_values = [result['AIC'] for result in results.values()]

results = evaluate_models(models, X, y, k=5, B=1000)
kfold_mse = [result['K-Fold MSE'] for result in results.values()]
bootstrap_mse = [result['Bootstrapping MSE'] for result in results.values()]
aic_values = [result['AIC'] for result in results.values()]
models_names = list(results.keys())

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
bar_width = 0.35
index = np.arange(len(models_names))
bars1 = ax1.bar(index, kfold_mse, bar_width, label='K-Fold MSE', color='blue')
bars2 = ax1.bar(index + bar_width, bootstrap_mse, bar_width, label='Bootstrapping MSE', color='green')

ax1.set_xlabel('Models', fontsize=12)
ax1.set_xticks(index + bar_width / 2)
ax1.set_xticklabels(models_names, rotation=45)

ax1.set_title('Model Evaluation: MSE Comparison', fontsize=14)
def autolabel(bars, ax):
for bar in bars:
height = bar.get_height()
ax.annotate(f'{height:.4f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')

autolabel(bars1, ax1)
autolabel(bars2, ax1)

ax1.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=2, fontsize=10)
ax1.set_yticks([])
bars3 = ax2.bar(index, aic_values, bar_width * 2, label='AIC', color='red')
ax2.set_xlabel('Models', fontsize=12)
ax2.set_xticks(index)
ax2.set_xticklabels(models_names, rotation=45)
ax2.set_title('Model Evaluation: AIC Comparison', fontsize=14)

autolabel(bars3, ax2)
ax2.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=1, fontsize=10)
ax2.set_yticks([])
plt.tight_layout()
plt.show()
return results

results = visualize(evaluate_models, X, y, models, results)


for name, result in results.items():
print(f"Model: {name}")
print(f"Coefficients: {result['Coefficients']}")
print(f"K-Fold MSE: {result['K-Fold MSE']}")
print(f"Bootstrapping MSE: {result['Bootstrapping MSE']}")
print(f"AIC: {result['AIC']}")
print(f"SSR: {result['SSR']}")
print(f"Verification (Coefficients): {verify_coefficients(X, y, result['Coefficients'], models[name]['fn'], **models[name].get('params', {}))}")
print(f"Verification (AIC): {verify_aic(X, y, result['Coefficients'], result['AIC'])}")
print("\n")
print("-"*30)