Fall2024CS584 · Jerry-zirui · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024
diff --git a/README.md b/README.md
@@ -1,29 +1,120 @@
-# Project 2
+> ZIRUI OU A20516756
 
-Select one of the following two options:
+## Model Selection
 
-## Boosting Trees
+### Program Design
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+![alt text](images/image.png)
 
-Put your README below. Answer the following questions.
+### Results Overview
+| Metric                | **Linear Regression**       | **Ridge Regression**       |
+|-----------------------|-----------------------------|-----------------------------|
+| Coefficients          | [0.2151, 0.5402]           | [0.2411, 0.4849]           |
+| K-Fold MSE            | 0.8418                     | 0.8471                     |
+| Bootstrapping MSE     | 0.8219                     | 0.8214                     |
+| AIC                   | -17.495                    | -17.461                    |
+| SSR                   | 80.6585                    | 80.6853                    |
+| Verification (Coefficients) | True                     | True                       |
+| Verification (AIC)    | True                       | True                       |
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
 
-## Model Selection
+![alt text](images/Figure_1.png)
+
+### How to Run the Code
+
+1. **Clone the Repository:**
+   ```bash
+   git clone https://github.com/Jerry-zirui/Project2.git
+   cd Project2
+   ```
+
+2. **Install Required Packages:**
+   ```bash
+   pip install numpy
+   pip install matplotlib #optional
+   ```
+
+3. **Generate Synthetic Data**
+    ```python
+    # Set random seed for reproducibility
+    np.random.seed(42)
+
+    # Generate synthetic data
+    n = 100  # Number of samples
+    p = 1    # Number of features
+    X = np.random.rand(n, p)  # Feature matrix
+    beta_true = np.array([1])  # True coefficients
+    y = X @ beta_true + np.random.randn(n)  # Target variable with noise
+    ```
+
+4. **Define Models**
+    ```python
+    models = {
+        "Linear Regression": {"fn": linear_regression},
+        "Ridge Regression": {"fn": ridge_regression, "params": {"alpha": 1.0}},
+    }
+    ```
+
+5. **Evaluate Models**
+    ```python
+    # Evaluate models
+    results = evaluate_models(models, X, y, k=5, B=1000)
+    ```
+
+6. **Visualize Results**
+    ```python
+    # Visualize results
+    results = visualize(evaluate_models, X, y, models, results)
+    ```
+
+7. **Run the Model Selection Script:**
+    ```bash
+    python model_selection.py
+    ```
+### Q&A
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+#### Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
+  - In my case with linear regression and Ridge Regression models, the cross-validation and bootstrapping model selectors generally agree with the AIC model selector.
+  - The MSE values from cross-validation and bootstrapping are similar, and the AIC values are consistent with the model selection.
 
-In your README, answer the following questions:
+#### In what cases might the methods you've written fail or give incorrect or undesirable results?
+- **Multicollinearity:**
+  - If there are highly correlated features, the methods may give incorrect results due to multicollinearity. For example, in this dataset, if `Feature1` and `Feature2` were highly correlated, the coefficients might not accurately reflect their individual contributions.
+
+- **Small Sample Sizes:**
+  - With small sample sizes, the methods may not have enough data to provide reliable estimates. This could lead to high variance in the model coefficients and MSE values.
+
+- **High Variance Models:**
+  - With high variance models, the methods may give higher MSE values and less desirable results.
+
+- **Poorly Specified Models:**
+  - If the models are not properly specified (e.g., incorrect regularization parameters), the methods may fail. For instance, if the Ridge Regression model's alpha parameter was set too high, it might over-shrink the coefficients, leading to underfitting.
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+#### What could you implement given more time to mitigate these cases or help users of your methods?
+- **Feature Selection:**
+  - Implement feature selection methods to handle multicollinearity. Techniques like Variance Inflation Factor (VIF) can be used to identify and remove highly correlated features.
+
+- **Regularization Techniques:**
+  - Implement more advanced regularization techniques such as Lasso, Ridge, and Elastic Net to handle different types of data and model requirements.
+
+- **Cross-Validation with Multiple Splits:**
+  - Use cross-validation with multiple splits to improve reliability. This can help in obtaining a more stable estimate of the model performance.
+
+- **Bootstrap Aggregation:**
+  - Implement bootstrap aggregation to improve the stability of the results. This technique can help in reducing the variance of the model estimates.
+
+- **Model Diagnostics:**
+  - Provide detailed diagnostics to help users understand the model selection process. This can include residual plots, coefficient plots, and other visualizations that can aid in interpreting the results.
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
+### What parameters have you exposed to your users in order to use your model selectors.
+- **ridge_regression**
+  -  `alpha`: which controls the strength of the L2 regularization
+- **K-Fold Cross-Validation:**
+  - `k`: Number of folds. (Default: 5)
+  - `model_fn`: Model function. (e.g., `LinearRegression()`, `Ridge()`)
+  - `model_params`: Model parameters. (e.g., `{'alpha': 1.0}` for Ridge Regression)
 
-As usual, above-and-beyond efforts will be considered for bonus points.
+- **Bootstrapping:**
+  - `B`: Number of bootstrap samples. (Default: 1000)
+  - `model_fn`: Model function. (e.g., `LinearRegression()`, `Ridge()`)
+  - `model_params`: Model parameters. (e.g., `{'alpha': 1.0}` for Ridge Regression)
diff --git a/images/Figure_1.png b/images/Figure_1.png
diff --git a/images/image.png b/images/image.png
diff --git a/model_selection.py b/model_selection.py
@@ -0,0 +1,175 @@
+import numpy as np
+from typing import Callable, Dict, Any, Tuple
+
+try:
+    import matplotlib.pyplot as plt
+except Exception as e:
+    ...
+
+def linear_regression(X:np.ndarray, y:np.ndarray) -> np.ndarray:
+    X_b = np.hstack((np.ones((X.shape[0], 1)), X))
+    beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
+    return beta
+
+def ridge_regression(X:np.ndarray, y:np.ndarray, alpha:float=1.0):
+    X_b = np.hstack((np.ones((X.shape[0], 1)), X))
+    I = np.eye(X_b.shape[1])
+    I[0, 0] = 0
+    beta = np.linalg.inv(X_b.T @ X_b + alpha * I) @ X_b.T @ y
+    return beta
+
+def k_fold_cross_validation(X:np.ndarray, y:np.ndarray, k:int=5, model_fn:Callable=linear_regression, **model_params) -> float:
+    n = X.shape[0]
+    fold_size = n // k
+    indices = np.arange(n)
+    np.random.shuffle(indices)
+    mse_scores = np.zeros(k)
+
+    for i in range(k):
+        test_indices = indices[i * fold_size : (i + 1) * fold_size]
+        train_indices = np.setdiff1d(indices, test_indices)
+        X_train, y_train = X[train_indices], y[train_indices]
+        X_test, y_test = X[test_indices], y[test_indices]
+
+        beta = model_fn(X_train, y_train, **model_params)
+        X_test_b = np.hstack((np.ones((X_test.shape[0], 1)), X_test))
+        predictions = X_test_b @ beta
+        mse_scores[i] = np.mean((y_test - predictions) ** 2)
+
+    return np.mean(mse_scores)
+
+def bootstrapping(X:np.ndarray, y:np.ndarray, B:int=1000, model_fn=linear_regression, **model_params):
+    n = X.shape[0]
+    mse_scores = np.zeros(B)
+
+    for b in range(B):
+        indices = np.random.choice(n, n, replace=True)
+        X_b = X[indices]
+        y_b = y[indices]
+
+        beta = model_fn(X_b, y_b, **model_params)
+        X_test_b = np.hstack((np.ones((n, 1)), X))
+        predictions = X_test_b @ beta
+        mse_scores[b] = np.mean((y - predictions) ** 2)
+
+    return np.mean(mse_scores)
+
+def aic(X:np.ndarray, y:np.ndarray, beta) -> Tuple[float, float]:
+    n = X.shape[0]
+    X_b = np.hstack((np.ones((X.shape[0], 1)), X))
+    y_pred = X_b @ beta
+    ssr = np.sum((y - y_pred) ** 2)
+    p = len(beta) - 1
+    aic_value = n * np.log(ssr / n) + 2 * (p + 1)
+    return aic_value, ssr
+
+def evaluate_models(models:Dict[str,any], X:np.ndarray, y:np.ndarray, k:int=5, B:int=1000):
+    results = {}
+    for name, model_info in models.items():
+        model_fn = model_info['fn']
+        model_params = model_info.get('params', {})
+        beta = model_fn(X, y, **model_params)
+        cv_mse = k_fold_cross_validation(X, y, k, model_fn, **model_params)
+        bootstrap_mse = bootstrapping(X, y, B, model_fn, **model_params)
+        aic_val, ssr = aic(X, y, beta)
+        results[name] = {
+            "Coefficients": beta,
+            "K-Fold MSE": cv_mse,
+            "Bootstrapping MSE": bootstrap_mse,
+            "AIC": aic_val,
+            "SSR": ssr,
+        }
+    return results
+
+def verify_coefficients(X:np.ndarray, y:np.ndarray, beta:np.ndarray, model_fn:callable, **model_params) -> bool:
+    beta_sanity = model_fn(X, y, **model_params)
+    return np.allclose(beta, beta_sanity)
+
+def verify_aic(X:np.ndarray, y:np.ndarray, beta:np.ndarray, aic_value:float) -> bool:
+    n = X.shape[0]
+    X_b = np.hstack((np.ones((X.shape[0], 1)), X))
+    y_pred = X_b @ beta
+    ssr = np.sum((y - y_pred) ** 2)
+    p = len(beta) - 1
+    aic_manual = n * np.log(ssr / n) + 2 * (p + 1)
+    return np.isclose(aic_value, aic_manual)
+
+
+np.random.seed(42)
+n = 100
+p = 1
+X = np.random.rand(n, p)
+beta_true = np.array([1])
+y = X @ beta_true + np.random.randn(n)
+
+models = {
+    "Linear Regression": {"fn": linear_regression},
+    "Ridge Regression": {"fn": ridge_regression, "params": {"alpha": 1.0}},
+}
+
+results = evaluate_models(models, X, y, k=5, B=1000)
+
+if plt:
+    def visualize(evaluate_models, X, y, models, results):
+        kfold_mse = [result['K-Fold MSE'] for result in results.values()]
+        bootstrap_mse = [result['Bootstrapping MSE'] for result in results.values()]
+        aic_values = [result['AIC'] for result in results.values()]
+
+        results = evaluate_models(models, X, y, k=5, B=1000)
+        kfold_mse = [result['K-Fold MSE'] for result in results.values()]
+        bootstrap_mse = [result['Bootstrapping MSE'] for result in results.values()]
+        aic_values = [result['AIC'] for result in results.values()]
+        models_names = list(results.keys())
+
+        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
+        bar_width = 0.35
+        index = np.arange(len(models_names))
+        bars1 = ax1.bar(index, kfold_mse, bar_width, label='K-Fold MSE', color='blue')
+        bars2 = ax1.bar(index + bar_width, bootstrap_mse, bar_width, label='Bootstrapping MSE', color='green')
+
+        ax1.set_xlabel('Models', fontsize=12)
+        ax1.set_xticks(index + bar_width / 2)
+        ax1.set_xticklabels(models_names, rotation=45)
+
+        ax1.set_title('Model Evaluation: MSE Comparison', fontsize=14)
+        def autolabel(bars, ax):
+            for bar in bars:
+                height = bar.get_height()
+                ax.annotate(f'{height:.4f}',
+                        xy=(bar.get_x() + bar.get_width() / 2, height),
+                        xytext=(0, 3),
+                        textcoords="offset points",
+                        ha='center', va='bottom')
+
+        autolabel(bars1, ax1)
+        autolabel(bars2, ax1)
+
+        ax1.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=2, fontsize=10)
+        ax1.set_yticks([])
+        bars3 = ax2.bar(index, aic_values, bar_width * 2, label='AIC', color='red')
+        ax2.set_xlabel('Models', fontsize=12)
+        ax2.set_xticks(index)
+        ax2.set_xticklabels(models_names, rotation=45)
+        ax2.set_title('Model Evaluation: AIC Comparison', fontsize=14)
+
+        autolabel(bars3, ax2)
+        ax2.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=1, fontsize=10)
+        ax2.set_yticks([])
+        plt.tight_layout()
+        plt.show()
+        return results
+
+    results = visualize(evaluate_models, X, y, models, results)
+
+
+for name, result in results.items():
+    print(f"Model: {name}")
+    print(f"Coefficients: {result['Coefficients']}")
+    print(f"K-Fold MSE: {result['K-Fold MSE']}")
+    print(f"Bootstrapping MSE: {result['Bootstrapping MSE']}")
+    print(f"AIC: {result['AIC']}")
+    print(f"SSR: {result['SSR']}")
+    print(f"Verification (Coefficients): {verify_coefficients(X, y, result['Coefficients'], models[name]['fn'], **models[name].get('params', {}))}")
+    print(f"Verification (AIC): {verify_aic(X, y, result['Coefficients'], result['AIC'])}")
+    print("\n")
+    print("-"*30)