UofT-DSI · surjannu · Dec 14, 2025 · Dec 15, 2025 · Dec 16, 2025 · Dec 18, 2025
diff --git a/01_materials/notebooks/Classification_cheatsheet.ipynb b/01_materials/notebooks/Classification_cheatsheet.ipynb
@@ -0,0 +1,145 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "57e01e55",
+   "metadata": {},
+   "source": [
+    "## Steps for K-Nearest Neighbors algorithm for Classification problem\n",
+    "\n",
+    "### Manual\n",
+    "Step1: **Install** the packages and **import** the dataset  \n",
+    "Step2: **Inspect** the dataset, clean if required  \n",
+    "Step3: **Visualize** the data  \n",
+    "Step4: Manually calculate the distance from new observation to all the data points for the predict variables  \n",
+    "Step5: Import Scikitlearn package and **Initialise the model**  \n",
+    "Step6: **Fit the model** using cancer dataset, the `X` argument is used to specify the data for the predictor variables, while the `y` argument is used to specify the data for the response variable    \n",
+    "Stpe7: **Predict** using the new observation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b21a020",
+   "metadata": {},
+   "source": [
+    "### Using Scikitlearn package\n",
+    "Step1: **Install** the packages and **import** the dataset  \n",
+    "Step2: **Inspect** the dataset, **clean** the data if required  \n",
+    "Step3: **Visualize** the data  \n",
+    "Step4: **Scale** the data by initialising the **StandardScaler()** and using **fit_transform()** method  \n",
+    "Step5: **Split** the data into train and **train_test_split()** method and the random seed  \n",
+    "Step6: Import Scikitlearn package and **Initialise the model**  \n",
+    "Step7: **Fit the model** using train dataset      \n",
+    "Stpe8: **Predict** using the test dataset  \n",
+    "Step9: Measure the accuracy of the model using the **score()** method  \n",
+    "Step10: View the **confusion matrix** output to breakdown the number of correct and incorrect predictions for each class using **cross_tab()** method  \n",
+    "Step11: Measure the precision using **precision_score()** method  \n",
+    "Step12: Measure the recall using **recall_score()** method  \n",
+    "\n",
+    "### Confusion matrix:  \n",
+    "<table>\n",
+    "  <tr>\n",
+    "    <th></th>\n",
+    "    <th>Predicted Malignant</th>\n",
+    "    <th>Predicted Benign</th>\n",
+    "  </tr>\n",
+    "  <tr>\n",
+    "    <th>Actually Malignant</th>\n",
+    "    <td>True Positive</td>\n",
+    "    <td>False Negative</td>\n",
+    "  </tr>\n",
+    "  <tr>\n",
+    "    <th>Actually Benign</th>\n",
+    "    <td>False Positive</td>\n",
+    "    <td>True Negative</td>\n",
+    "  </tr>\n",
+    "</table>\n",
+    "  \n",
+    "### Precision:  \n",
+    "$$\n",
+    "\\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\textbf{False Positives}}\n",
+    "$$\n",
+    "\n",
+    "### Recall:  \n",
+    "$$\n",
+    "\\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\textbf{False Negatives}}\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "### Notes:  \n",
+    "Precision measures how many of the predicted positives are actually positive. High precision means that when the classifier predicts a positive, it's likely to be correct.  \n",
+    "Recall measures how many actual positive observations were correctly identified by the classifier. High recall means that if there is a positive instance in the test data, the classifier is likely to detect it.  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1d177ef",
+   "metadata": {},
+   "source": [
+    "### Tuning the Classifier  \n",
+    "\n",
+    "Step1: **Initialize** the model with some k-value  \n",
+    "Step2: **Fit** the model using the train dataset  \n",
+    "Step3: Using **cross_validate()** method pass the parameters model, number of folds(cv), X and y values, which returns a dictionary with the validation scores for each fold.  \n",
+    "step4: Compute the **mean** and **Standard Error of Mean(SEM)** of test score, to summarize the Classifier's performance \n",
+    "\n",
+    "\n",
+    "\n",
+    "#### Notes\n",
+    "\n",
+    "1. We use cross validation to ensure each observation is in the validation set only once.  \n",
+    "2. The cross validation automatically handles the class stratification in each fold. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b5d1a64",
+   "metadata": {},
+   "source": [
+    "### Hyper parameter(K) tuning overview  \n",
+    "\n",
+    "1. **Split the Data**: Use `train_test_split` to divide the data into training and test sets. Set `stratify` to the class label column to maintain class distribution. Set the test set aside.\n",
+    "\n",
+    "2. **Define the Parameter Grid**: Specify the range of $k$ values to tune.\n",
+    "\n",
+    "3. **Perform Grid Search**: Use `GridSearchCV` with a parameter grid (passing estimator, parameter grid and CV values) to estimate accuracy for different $k$ values.\n",
+    "\n",
+    "4. **Execute Grid Search**: Fit the `GridSearchCV` instance on the training data to find the best $k$.\n",
+    "\n",
+    "5. **Select Optimal $k$**: Choose the $k$ value with high accuracy(using **best_params_** attribute) and stable performance across nearby values.\n",
+    "\n",
+    "6. **Retrain the Model**: Create a new model with the best $k$ and fit it to the training data.\n",
+    "\n",
+    "7. **Evaluate the Model**: Assess the model's accuracy on the test set using the `score` method.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4903284a",
+   "metadata": {},
+   "source": [
+    "From the `GridSearchCV` results, focus on:\n",
+    "\n",
+    "- **Number of neighbors** (`param_n_neighbors`)\n",
+    "- **Cross-validation accuracy estimate** (`mean_test_score`)\n",
+    "- **Standard error of the accuracy estimate**\n",
+    "\n",
+    "GridSearchCV does not directly provide the standard error, but you can compute it using the standard deviation (`std_test_score`) with the formula:\n",
+    "\n",
+    "$$\n",
+    " \\text{Standard Error} = \\frac{\\text{Standard Deviation}}{\\sqrt{\\text{Number of Folds}}} \n",
+    " $$\n",
+    "\n",
+    "This formula allows you to estimate the uncertainty around the accuracy estimate."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/01_materials/notebooks/KNN_and_Linear_Reg_cheatsheet.ipynb b/01_materials/notebooks/KNN_and_Linear_Reg_cheatsheet.ipynb
@@ -0,0 +1,171 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a483c6c1",
+   "metadata": {},
+   "source": [
+    "### **KNN Regression**  \n",
+    "\n",
+    "Step1: **Install** the packages and **import** the dataset  \n",
+    "Step2: **Inspect** the dataset, **clean** the data if required  \n",
+    "Step3: **Visualize** the data  \n",
+    "Step4: **Calculate** the obsolute difference between new observation and the each data point     \n",
+    "Step5: **Find** the k nearest rows with smallest difference (closest to the new observation)  \n",
+    "Step6: **Predict** the response variable value by calculating the average of the k nearest values  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f839b0c6",
+   "metadata": {},
+   "source": [
+    "### Training, evaluating, and tuning the model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c210e7c",
+   "metadata": {},
+   "source": [
+    "Step1: **Split** the dataset into train and test  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a4751ec",
+   "metadata": {},
+   "source": [
+    "Step2: **Cross-validation**\n",
+    "\n",
+    "In KNN regression, to evaluate how well the model predicts the response variable, we use root mean square prediction error (RMSPE). The formula for calculating RMSPE is:\n",
+    "\n",
+    "$$\n",
+    " \\text{RMSPE} = \\sqrt{\\frac{1}{n}\\sum\\limits_{i=1}^{n}(y_i - \\hat{y}_i)^2}\n",
+    "$$\n",
+    "\n",
+    "where:\n",
+    "- $ y_i $ is the true value of the response variable,\n",
+    "- $ \\hat{y}_i $ is the predicted value from the model,\n",
+    "- $ n $ is the number of observations.\n",
+    "\n",
+    "RMSPE measures how much our predictions deviate from the actual values. It gives us an idea of how close our predictions are to the real outcomes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa933e25",
+   "metadata": {},
+   "source": [
+    "### **KNN Regressor training and K value optimization**  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7506b34",
+   "metadata": {},
+   "source": [
+    "Step1: **Split** the train data into X(predictor) and y(response) variables  \n",
+    "Step2: **Initialise the model** KNN regressor    \n",
+    "Step3: Define the **Parameter grid**  \n",
+    "Step4: Initialize **GridsearchCV**  \n",
+    "Step5: **Fit** the model using the train dataset   \n",
+    "Step6: Measure the accuracy of the model using the **score()** method  \n",
+    "Step7: **Retrieve and format** results( After fitting the model, we extract the cross-validation results using **`cv_results_`**. This output includes various metrics and parameters tested during the cross-validation process.)    \n",
+    "Step8: Find the best K value from the mean test score using **best_params_** method on the model or smallest value of mean test score from the cv results     \n",
+    "Step9: **Evaluate** the model or **Predict** the response variable using the test data  \n",
+    "Step10: Calculate the **RMSPE(Root Mean Squared Prediction Error)** by passing the true and predicted values  \n",
+    "Step11: Calculate and display R2 (coefficient of determination) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c372d2e1",
+   "metadata": {},
+   "source": [
+    "#### **Benefits**  \n",
+    "It is simplet to understand and can capture non-linear relationships in data.  \n",
+    "It works very well when the data has patterns that at well explained by the nearby neighbors.  \n",
+    "\n",
+    "#### **Limitations**\n",
+    "It struggles to make predictions for values outside the range of the training data, meaning it can’t effectively handle cases where the target variable extends beyond what’s been observed.  \n",
+    "Additionally, as the dataset grows larger, KNN becomes computationally slower since it has to calculate distances for every new prediction."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33a3d36f",
+   "metadata": {},
+   "source": [
+    "### **Simple Linear Regression**  \n",
+    "When there is a need to generalize beyond the training data or handle larger datasets more efficiently, we often turn to linear regression as an alternative. Linear regression offers a more scalable approach and provides a way to make predictions across a wider range of values."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17b087ff",
+   "metadata": {},
+   "source": [
+    "Step1: **Split** the dataset into train and test  \n",
+    "Step2: **Initialize** the linear regression model  \n",
+    "Step3: **Fit** the model using train data  \n",
+    "Step4: Make a dataframe containing b_1 (slope) and b_0 (intercept) coefficients using the model and the model can be represented in the equation form  \n",
+    "Step5: **Predict** on the test dataset  \n",
+    "Step6: Calculate **RMSPE and R2 score** to evaluate the performance of the model by passing the true and predicted responses    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69724b44",
+   "metadata": {},
+   "source": [
+    "### **Cross Validation** for Linear Regression  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6592381f",
+   "metadata": {},
+   "source": [
+    "The cross-validation provides a more reliable and comprehensive evaluation of your model's performance compared to a single train-test split!\n",
+    "\n",
+    "To perform 5-fold cross-validation in Python using `scikit-learn`, we need to follow these steps:\n",
+    "1. Set **cv=5** for 5 folds.\n",
+    "2. Pass the **estimator, predictors and response as X and y along with scoring** equals to either \"neg_root_mean_squared_error\" or \"r2\".\n",
+    "3. Use the **`cross_validate`** function from scikit-learn.\n",
+    "4. Convert the results into a pandas DataFrame for better visualization of the **test score**.\n",
+    "5. Calculate the **Mean** and **Standard Error of Mean** of the **test score** from the results  \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de6f9a8b",
+   "metadata": {},
+   "source": [
+    "### **Multivariable Linear Regression**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e385d7d",
+   "metadata": {},
+   "source": [
+    "Step1: **Split** the dataset into train and test  \n",
+    "Step2: **Initialize** the linear regression model  \n",
+    "Step3: **Fit** the model using train data(passing **multiple predictor variables** into X and response variable as y)  \n",
+    "Step4: Make a dataframe containing b_1 (slope) and b_0 (intercept) coefficients using the model and the model can be represented in the equation form  \n",
+    "Step5: **Predict** on the test dataset  \n",
+    "Step6: Calculate **RMSPE and R2 score** to evaluate the performance of the model by passing the true and predicted responses \n",
+    "\n",
+    "**Note:** Cross validation steps are identical to the Simple Linear Regression  "
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}