Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions 01_materials/notebooks/Classification_cheatsheet.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "57e01e55",
"metadata": {},
"source": [
"## Steps for K-Nearest Neighbors algorithm for Classification problem\n",
"\n",
"### Manual\n",
"Step1: **Install** the packages and **import** the dataset \n",
"Step2: **Inspect** the dataset, clean if required \n",
"Step3: **Visualize** the data \n",
"Step4: Manually calculate the distance from new observation to all the data points for the predict variables \n",
"Step5: Import Scikitlearn package and **Initialise the model** \n",
"Step6: **Fit the model** using cancer dataset, the `X` argument is used to specify the data for the predictor variables, while the `y` argument is used to specify the data for the response variable \n",
"Stpe7: **Predict** using the new observation"
]
},
{
"cell_type": "markdown",
"id": "7b21a020",
"metadata": {},
"source": [
"### Using Scikitlearn package\n",
"Step1: **Install** the packages and **import** the dataset \n",
"Step2: **Inspect** the dataset, **clean** the data if required \n",
"Step3: **Visualize** the data \n",
"Step4: **Scale** the data by initialising the **StandardScaler()** and using **fit_transform()** method \n",
"Step5: **Split** the data into train and **train_test_split()** method and the random seed \n",
"Step6: Import Scikitlearn package and **Initialise the model** \n",
"Step7: **Fit the model** using train dataset \n",
"Stpe8: **Predict** using the test dataset \n",
"Step9: Measure the accuracy of the model using the **score()** method \n",
"Step10: View the **confusion matrix** output to breakdown the number of correct and incorrect predictions for each class using **cross_tab()** method \n",
"Step11: Measure the precision using **precision_score()** method \n",
"Step12: Measure the recall using **recall_score()** method \n",
"\n",
"### Confusion matrix: \n",
"<table>\n",
" <tr>\n",
" <th></th>\n",
" <th>Predicted Malignant</th>\n",
" <th>Predicted Benign</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Actually Malignant</th>\n",
" <td>True Positive</td>\n",
" <td>False Negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Actually Benign</th>\n",
" <td>False Positive</td>\n",
" <td>True Negative</td>\n",
" </tr>\n",
"</table>\n",
" \n",
"### Precision: \n",
"$$\n",
"\\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\textbf{False Positives}}\n",
"$$\n",
"\n",
"### Recall: \n",
"$$\n",
"\\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\textbf{False Negatives}}\n",
"$$\n",
"\n",
"\n",
"### Notes: \n",
"Precision measures how many of the predicted positives are actually positive. High precision means that when the classifier predicts a positive, it's likely to be correct. \n",
"Recall measures how many actual positive observations were correctly identified by the classifier. High recall means that if there is a positive instance in the test data, the classifier is likely to detect it. "
]
},
{
"cell_type": "markdown",
"id": "f1d177ef",
"metadata": {},
"source": [
"### Tuning the Classifier \n",
"\n",
"Step1: **Initialize** the model with some k-value \n",
"Step2: **Fit** the model using the train dataset \n",
"Step3: Using **cross_validate()** method pass the parameters model, number of folds(cv), X and y values, which returns a dictionary with the validation scores for each fold. \n",
"step4: Compute the **mean** and **Standard Error of Mean(SEM)** of test score, to summarize the Classifier's performance \n",
"\n",
"\n",
"\n",
"#### Notes\n",
"\n",
"1. We use cross validation to ensure each observation is in the validation set only once. \n",
"2. The cross validation automatically handles the class stratification in each fold. "
]
},
{
"cell_type": "markdown",
"id": "3b5d1a64",
"metadata": {},
"source": [
"### Hyper parameter(K) tuning overview \n",
"\n",
"1. **Split the Data**: Use `train_test_split` to divide the data into training and test sets. Set `stratify` to the class label column to maintain class distribution. Set the test set aside.\n",
"\n",
"2. **Define the Parameter Grid**: Specify the range of $k$ values to tune.\n",
"\n",
"3. **Perform Grid Search**: Use `GridSearchCV` with a parameter grid (passing estimator, parameter grid and CV values) to estimate accuracy for different $k$ values.\n",
"\n",
"4. **Execute Grid Search**: Fit the `GridSearchCV` instance on the training data to find the best $k$.\n",
"\n",
"5. **Select Optimal $k$**: Choose the $k$ value with high accuracy(using **best_params_** attribute) and stable performance across nearby values.\n",
"\n",
"6. **Retrain the Model**: Create a new model with the best $k$ and fit it to the training data.\n",
"\n",
"7. **Evaluate the Model**: Assess the model's accuracy on the test set using the `score` method.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "4903284a",
"metadata": {},
"source": [
"From the `GridSearchCV` results, focus on:\n",
"\n",
"- **Number of neighbors** (`param_n_neighbors`)\n",
"- **Cross-validation accuracy estimate** (`mean_test_score`)\n",
"- **Standard error of the accuracy estimate**\n",
"\n",
"GridSearchCV does not directly provide the standard error, but you can compute it using the standard deviation (`std_test_score`) with the formula:\n",
"\n",
"$$\n",
" \\text{Standard Error} = \\frac{\\text{Standard Deviation}}{\\sqrt{\\text{Number of Folds}}} \n",
" $$\n",
"\n",
"This formula allows you to estimate the uncertainty around the accuracy estimate."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
171 changes: 171 additions & 0 deletions 01_materials/notebooks/KNN_and_Linear_Reg_cheatsheet.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a483c6c1",
"metadata": {},
"source": [
"### **KNN Regression** \n",
"\n",
"Step1: **Install** the packages and **import** the dataset \n",
"Step2: **Inspect** the dataset, **clean** the data if required \n",
"Step3: **Visualize** the data \n",
"Step4: **Calculate** the obsolute difference between new observation and the each data point \n",
"Step5: **Find** the k nearest rows with smallest difference (closest to the new observation) \n",
"Step6: **Predict** the response variable value by calculating the average of the k nearest values "
]
},
{
"cell_type": "markdown",
"id": "f839b0c6",
"metadata": {},
"source": [
"### Training, evaluating, and tuning the model"
]
},
{
"cell_type": "markdown",
"id": "5c210e7c",
"metadata": {},
"source": [
"Step1: **Split** the dataset into train and test "
]
},
{
"cell_type": "markdown",
"id": "3a4751ec",
"metadata": {},
"source": [
"Step2: **Cross-validation**\n",
"\n",
"In KNN regression, to evaluate how well the model predicts the response variable, we use root mean square prediction error (RMSPE). The formula for calculating RMSPE is:\n",
"\n",
"$$\n",
" \\text{RMSPE} = \\sqrt{\\frac{1}{n}\\sum\\limits_{i=1}^{n}(y_i - \\hat{y}_i)^2}\n",
"$$\n",
"\n",
"where:\n",
"- $ y_i $ is the true value of the response variable,\n",
"- $ \\hat{y}_i $ is the predicted value from the model,\n",
"- $ n $ is the number of observations.\n",
"\n",
"RMSPE measures how much our predictions deviate from the actual values. It gives us an idea of how close our predictions are to the real outcomes."
]
},
{
"cell_type": "markdown",
"id": "aa933e25",
"metadata": {},
"source": [
"### **KNN Regressor training and K value optimization** "
]
},
{
"cell_type": "markdown",
"id": "f7506b34",
"metadata": {},
"source": [
"Step1: **Split** the train data into X(predictor) and y(response) variables \n",
"Step2: **Initialise the model** KNN regressor \n",
"Step3: Define the **Parameter grid** \n",
"Step4: Initialize **GridsearchCV** \n",
"Step5: **Fit** the model using the train dataset \n",
"Step6: Measure the accuracy of the model using the **score()** method \n",
"Step7: **Retrieve and format** results( After fitting the model, we extract the cross-validation results using **`cv_results_`**. This output includes various metrics and parameters tested during the cross-validation process.) \n",
"Step8: Find the best K value from the mean test score using **best_params_** method on the model or smallest value of mean test score from the cv results \n",
"Step9: **Evaluate** the model or **Predict** the response variable using the test data \n",
"Step10: Calculate the **RMSPE(Root Mean Squared Prediction Error)** by passing the true and predicted values \n",
"Step11: Calculate and display R2 (coefficient of determination) "
]
},
{
"cell_type": "markdown",
"id": "c372d2e1",
"metadata": {},
"source": [
"#### **Benefits** \n",
"It is simplet to understand and can capture non-linear relationships in data. \n",
"It works very well when the data has patterns that at well explained by the nearby neighbors. \n",
"\n",
"#### **Limitations**\n",
"It struggles to make predictions for values outside the range of the training data, meaning it can’t effectively handle cases where the target variable extends beyond what’s been observed. \n",
"Additionally, as the dataset grows larger, KNN becomes computationally slower since it has to calculate distances for every new prediction."
]
},
{
"cell_type": "markdown",
"id": "33a3d36f",
"metadata": {},
"source": [
"### **Simple Linear Regression** \n",
"When there is a need to generalize beyond the training data or handle larger datasets more efficiently, we often turn to linear regression as an alternative. Linear regression offers a more scalable approach and provides a way to make predictions across a wider range of values."
]
},
{
"cell_type": "markdown",
"id": "17b087ff",
"metadata": {},
"source": [
"Step1: **Split** the dataset into train and test \n",
"Step2: **Initialize** the linear regression model \n",
"Step3: **Fit** the model using train data \n",
"Step4: Make a dataframe containing b_1 (slope) and b_0 (intercept) coefficients using the model and the model can be represented in the equation form \n",
"Step5: **Predict** on the test dataset \n",
"Step6: Calculate **RMSPE and R2 score** to evaluate the performance of the model by passing the true and predicted responses "
]
},
{
"cell_type": "markdown",
"id": "69724b44",
"metadata": {},
"source": [
"### **Cross Validation** for Linear Regression "
]
},
{
"cell_type": "markdown",
"id": "6592381f",
"metadata": {},
"source": [
"The cross-validation provides a more reliable and comprehensive evaluation of your model's performance compared to a single train-test split!\n",
"\n",
"To perform 5-fold cross-validation in Python using `scikit-learn`, we need to follow these steps:\n",
"1. Set **cv=5** for 5 folds.\n",
"2. Pass the **estimator, predictors and response as X and y along with scoring** equals to either \"neg_root_mean_squared_error\" or \"r2\".\n",
"3. Use the **`cross_validate`** function from scikit-learn.\n",
"4. Convert the results into a pandas DataFrame for better visualization of the **test score**.\n",
"5. Calculate the **Mean** and **Standard Error of Mean** of the **test score** from the results \n"
]
},
{
"cell_type": "markdown",
"id": "de6f9a8b",
"metadata": {},
"source": [
"### **Multivariable Linear Regression**"
]
},
{
"cell_type": "markdown",
"id": "4e385d7d",
"metadata": {},
"source": [
"Step1: **Split** the dataset into train and test \n",
"Step2: **Initialize** the linear regression model \n",
"Step3: **Fit** the model using train data(passing **multiple predictor variables** into X and response variable as y) \n",
"Step4: Make a dataframe containing b_1 (slope) and b_0 (intercept) coefficients using the model and the model can be represented in the equation form \n",
"Step5: **Predict** on the test dataset \n",
"Step6: Calculate **RMSPE and R2 score** to evaluate the performance of the model by passing the true and predicted responses \n",
"\n",
"**Note:** Cross validation steps are identical to the Simple Linear Regression "
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading