Developed by kathrinmzl
In the banking sector, effective credit risk assessment is critical for maintaining financial stability and minimizing losses. Loan defaults can lead to significant financial setbacks and reduced liquidity for lending institutions. The Loan Guard project aims to help financial institutions proactively identify borrowers who are likely to default and understand the key factors driving default risk.
In addition, the project incorporates borrower segmentation through clustering, which groups borrowers with similar financial profiles and historical behavior. This combined approach allows lenders not only to predict default probabilities but also to tailor risk management strategies and business decisions based on the characteristics of different borrower segments.
The project was created for educational purposes only.
The used dataset is publicly available on Kaggle and contains information about individual borrowers and their loan characteristics. Each row represents a loan record, including both personal and financial attributes that may influence the likelihood of default. The dataset provides a comprehensive overview of borrower profiles, such as age, income, home ownership or employment details, as well as loan-specific features like loan amount, interest rate and purpose.
In total, the dataset includes 32,581 records and 12 variables. The target variable, loan_status, indicates whether a borrower has defaulted on their loan (1) or successfully repaid it (0). The target distribution is imbalanced toward non-default cases, reflecting real-world lending scenarios where most borrowers do not default. This dataset enables predictive modeling to identify patterns and risk factors associated with loan default.
| Variable | Description | Role | Data Type | Units / Possible Values |
|---|---|---|---|---|
person_age |
Age of the borrower | Feature | int64 | Years |
person_income |
Annual income of the borrower | Feature | float64 | USD |
person_home_ownership |
Type of home ownership | Feature | object | RENT, OWN, MORTGAGE, OTHER |
person_emp_length |
Length of employment | Feature | float64 | Years |
loan_intent |
Purpose of the loan | Feature | object | PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEIMPROVEMENT, DEBTCONSOLIDATION |
loan_grade |
Loan grade assigned by lender | Feature | object | A, B, C, D, E, F, G |
loan_amnt |
Loan amount requested | Feature | float64 | USD |
loan_int_rate |
Interest rate applied to the loan | Feature | float64 | Percentage |
loan_percent_income |
Loan amount as a percentage of annual income | Feature | float64 | Ratio |
cb_person_default_on_file |
Whether the person has previously defaulted | Feature | object | Y, N |
cb_person_cred_hist_length |
Length of credit history | Feature | int64 | Years |
loan_status |
Loan default flag | Target | int64 | 0 = No Default, 1 = Default |
NOTE:
When I initially started working on this project, I used a different dataset from Kaggle. After attempting to build meaningful prediction and clustering models, I decided to switch to a new dataset.
The previous dataset was highly synthetic, with all variables being uniformly distributed and showing very little correlation, both between features and with the target variable. Uniform distributions are particularly challenging for predictive modeling and clustering because they lack natural variability and concentration of values. Consequently, there are few meaningful patterns, groupings, or relationships for the models to learn from.
As a result, it was very difficult to build a predictive model with good performance metrics and without overfitting. I experimented with several approaches to improve model performance and reduce overfitting, including hyperparameter tuning and binning numerical variables, but none led to satisfactory results. Furthermore, during the cluster analysis, the results did not correspond to any recognizable borrower groups or risk profiles, limiting the usefulness of the analysis.
Therefore, I decided to switch to the current dataset. Although it required more extensive data cleaning and transformation, it produces models with stronger performance and revealed meaningful, interpretable clusters. Overall, business interpretability and analytical insight were significantly improved.
- A borrower is a person who takes out a loan from a financial institution.
- A loan is an amount of money borrowed that is expected to be paid back with interest.
- A default occurs when a borrower fails to make scheduled loan payments or meet the agreed repayment terms.
- A defaulted borrower is a borrower who has failed to repay their loan as agreed and is classified as being in default.
From a business perspective, this project supports the strategic goals of a financial institution such as:
- Improving risk management by identifying high-risk applicants early.
- Enhancing profitability through optimized loan approval decisions.
- Increasing borrower trust and operational efficiency by offering fair, data-driven credit evaluations.
- Enabling personalized loan offerings and proactive interventions for at-risk borrowers (e.g., adjusted payment plans or counseling).
Ultimately, this project aligns predictive analytics with the bank’s long-term objective of balancing growth with financial stability.
To achieve the outlined objectives, the project will focus on the following key requirements.
Business Requirement 1: Data Insights (Conventional Analysis)
- Identify key borrower and loan attributes that are most correlated with loan default. Provide visual and statistical insights to help business analysts understand the primary drivers of credit risk.
Business Requirement 2: Classification Model (Machine Learning)
- Develop a machine learning model capable of predicting whether a loan applicant is likely to default. The system should output a probability of default to support the credit team in decision-making.
Business Requirement 3: Clustering Model (Machine Learning)
- Group borrowers into risk-based clusters to segment borrowers by credit behavior and improve tailored intervention strategies.
To better understand the factors influencing loan default risk, four key hypotheses were formulated based on domain knowledge and the available data. Each hypothesis focuses on a variable expected to impact default probability.
| Hypothesis | Rationale | Validation |
|---|---|---|
H1: Higher loan_amnt is associated with higher default risk |
Borrowers taking larger loans may face greater repayment burdens, increasing the likelihood of default | Visualize distribution of loan_amnt by loan_status, conduct statistical test to confirm difference |
H2: Lower person_income is associated with higher default risk |
Borrowers with lower income may have limited financial capacity to meet repayment obligations | Visualize distribution of person_income by loan_status, conduct statistical test to confirm difference |
H3: Lower loan_grade (credit quality) is associated with higher default risk |
A lower loan grade reflects weaker creditworthiness and higher assessed lending risk | Analyze frequency of defaults across loan_grade categories, perform Chi-square test for association |
H4: Shorter person_emp_length (employment length) is associated with higher default risk |
Borrowers with shorter employment histories may experience less income stability, increasing repayment risk | Visualize distribution of person_emp_length by loan_status, conduct statistical test to confirm difference |
These hypotheses will be tested through exploratory data analysis and statistical testing to identify whether the respective features are influential predictors of default risk.
This section explains how each business requirement is addressed by specific analyses, visualizations and ML techniques. It ensures that insights and predictions directly support the business goals and can be interpreted by stakeholders.
Business Requirement 1: Data Insights (Conventional Analysis)
- Identify key borrower and loan attributes that are most correlated with loan default.
- Provide visual and statistical insights to help business analysts understand the primary drivers of credit risk.
- Visualize distributions and relationships between key features and the target variable.
Business Requirement 2: Classification Model (Machine Learning)
- Develop a binary classification model to predict whether a loan applicant is likely to default.
- Show the probability of default to support the credit team in decision-making.
- Evaluate model performance and feature importance for transparency and reliability.
Business Requirement 3: Clustering Model (Machine Learning)
- Group borrowers into risk-based clusters to segment borrowers by credit behavior and improve tailored intervention strategies.
- Analyze and visualize cluster characteristics to understand risk profiles.
- Visualize cluster assignments to facilitate understanding by stakeholders.
We aim to develop a supervised machine learning model that predicts whether a loan applicant will default or not.
The model should provide a probability of default to support the credit risk team in decision-making.
- Goal: Predict if a borrower will default on their loan (
loan_status) and provide the associated probability of default. - Model type: Supervised - Binary Classification.
- Input features: Borrower demographic and financial attributes
- Model choice: After experimentation, a Random Forest model was chosen as the best-performing and most interpretable model.
- Success metrics (on both training and test sets):
- Recall for default ≥ 0.75 – to minimize false negatives (high-risk borrowers predicted as safe)
- F1 score for default ≥ 0.60 – ensures a balance between recall and precision
- Failure conditions:
- Strong degradation of performance on test data vs. train data → indicates overfitting.
- Large imbalance between precision and recall → predictions may not be reliable for business decisions.
- Output definition:
- Binary prediction (
0= no default,1= default). - Probability of default (e.g., 0.76 = 76% chance of default) to guide credit risk decisions.
- Binary prediction (
- Heuristics: The model should be used to prioritize risk review and support decision-making (not to fully automate rejections). Thresholds for action should be determined in consultation with the credit risk team to balance loss prevention and borrower impact.
We implemented an unsupervised clustering model to group borrowers with similar credit and loan characteristics.
This segmentation helps the credit and retention teams tailor communication, product offerings, and risk mitigation strategies.
- Goal: Identify distinct borrower segments based on credit behavior and financial characteristics.
- Model type: Unsupervised - Clustering.
- Input features: Borrower demographic and financial attributes
- Model choice: K-Means
- Success metrics:
- Average silhouette score ≥ 0.45
- Clusters are interpretable and distinct in profile characteristics.
- Failure conditions:
- Model suggests more than 15 clusters → difficult to interpret or apply in business context.
- Clusters are not meaningfully distinct (overlapping feature distributions).
- Output definition:
- Cluster assignments appended to the dataset as an additional categorical column (
Clusters). - Each borrower belongs to one cluster.
- Identified cluster characteristics:
- Cluster 0: Borrowers with a history of previous defaults, mostly renters, moderate income, highest default rate (high-risk).
- Cluster 1: Borrowers with no history of previous defaults, who mostly rent, lower to mid-range incomes, moderate default rates (middle-risk).
- Cluster 2: Borrowers with no history of previous defaults, who primarily have mortgages, higher incomes, rarely default (low-risk).
- Cluster assignments appended to the dataset as an additional categorical column (
- Heuristics: This clustering provides a systematic segmentation where previously none existed. The results can inform targeted risk interventions and product offerings.
User stories were developed to clearly define the needs and goals of different stakeholders, ensuring that the project dashboard delivers actionable insights and functionality aligned with both business and technical objectives.
-
As a non-technical stakeholder, I want to view a concise and structured overview of the project, including its goals, dataset, and business requirements, so that I can understand what the project aims to achieve and how to navigate the dashboard.
-
As a data analyst, I want to explore correlations and key drivers of loan default through interactive data exploration and visualizations, so that I can identify which borrower and loan attributes most influence default risk and provide data-driven insights to the business.
-
As a business analyst, I want to review the project’s main hypotheses about borrower behavior and validate them with visual and statistical evidence, so that I can understand which factors are meaningfully linked to default and ensure the findings are grounded in data.
-
As a loan officer, I want to input borrower information and receive a predicted probability of default along with a borrower cluster assignment, so that I can make informed lending decisions and take appropriate risk mitigation actions based on the borrower’s profile.
-
As a technical reviewer, I want to examine the predictive model’s structure, key features, and performance metrics, so that I can assess whether the model meets business requirements and delivers reliable and interpretable predictions.
-
As a technical reviewer, I can view borrower clustering insights to evaluate the clustering model’s performance, understand cluster characteristics, and assess how effectively the clusters segment borrowers by default risk.
The dashboard will be developed in Streamlit and designed to guide the user from business understanding to actionable insights and model-based predictions.
It will consist of six main pages, each mapped to specific business requirements.
The goal of the dashboard is to provide both descriptive insights and predictive intelligence to support data-driven decisions in loan management.
It will serve two main user groups:
- Business analysts: who need to explore patterns and trends in borrower data.
- Credit officers: who need actionable information on loan risk and applicant default probability.
- Purpose: Provide a clear overview of the project and orient users.
- Sections:
- Project introduction
- Project terms & jargon
- Dataset overview
- Checkbox: Data inspection (number of rows, columns, and first 10 rows)
- Business requirements
- Navigation guide for subsequent pages
- Purpose: Address Business Requirement 1 (Data Insights). This page helps financial institutions understand what drives default risk. It focuses on identifying key borrower and loan attributes most correlated with default and provides visual and statistical insights for business analysts.
- Sections:
- Correlation Analysis:
- Checkbox: Display PPS Heatmap to detect both linear and non-linear relationships with the target variable
- Table of most important features according to PPS score
- Visualization of main drivers of default
- Checkbox: Display distributions of selected key features
- Summary insights highlight trends
- Checkbox: Display Parallel Plot to show interactions between multiple key features and their influence on default probability.
- Correlation Analysis:
- Purpose: Present hypotheses and their validation process.
- Sections:
- State each of the four project hypotheses.
- Show validation result:
- Short written conclusion summarizing whether the hypothesis was confirmed or not
- Checkbox: Display corresponding distribution plot for each hypothesis and result of statistical test
- Purpose: Address Business Requirement 2 (Classification Model) and Business Requirement 3 (Clustering Model)
- Sections:
- State Business Requirements 2 and 3
- Widget input fields for necessary borrower data
- “Run Predictive Analysis” button to send input data through the trained ML pipelines
- Output:
- Predicted default probability
- Cluster assignment for additional context
- Cluster profile summary
- Combined business recommendation based on default probability and cluster
- Purpose: Address Business Requirement 2 (Classification Model). Show predictive model performance and interpretation.
- Sections:
- Describe model objective
- Overview of used ML pipelines
- Visualization of the top features contributing to the model’s predictions
- Insights into model performance
- Confusion matrix and classification report for both train and test sets
- Performance metrics interpretation and conclusions for business relevance
- Purpose: Address Business Requirement 3 (Clustering Model). Show cluster analysis performance and interpretation.
- Sections:
- Describe model objective
- Overview of used ML pipeline
- Insights into cluster analysis performance
- Silhouette plot, average silhouette score and number of clusters chosen
- Cluster distribution across default levels
- Visualization of the top features defining the clusters
- Description of cluster profiles and business use of segmentation
The technologies used throughout the development are listed below.
- Pandas - Open source library for data manipulation and analysis.
- NumPy - Adds support for large, multi-dimensional arrays and high-level mathematical functions.
- Matplotlib - Comprehensive library for creating static, animated, and interactive visualisations.
- Seaborn - Statistical data visualisation library for attractive and informative graphics.
- Plotly Express - High-level library for creating interactive plots easily.
- scikit-learn - Open source machine learning library featuring classification, regression, and clustering.
- Feature-engine - Library with multiple transformers to engineer and select features for ML models.
- ppscore - Library for detecting linear or non-linear relationships between two features.
- SciPy - Library for scientific computing, including statistical tests.
- XGBoost - High-performance gradient boosting library.
- CatBoost - Gradient boosting library optimized for categorical features.
- imbalanced-learn - Tools for handling imbalanced datasets in classification tasks.
- Yellowbrick - Visual analysis and diagnostic tools for machine learning models.
- Joblib - Utilities for pipelining, caching, and saving/loading models.
- YData Profiling - Automated exploratory data analysis and profiling reports.
- Warnings - Suppress or manage warnings in Python.
- Streamlit - Framework for building interactive dashboards and web applications.
- Glob - File pathname pattern matching.
- Zipfile - Work with ZIP archives in Python.
- Git - For version control
- GitHub - Code repository
- Heroku - For application deployment
- VSCode - IDE used for development
Each page of the dashboard corresponds to a specific user story, ensuring that the application meets the needs of both non-technical and technical users. The dashboard was manually tested using these user stories as a basis for determining success.
For the Jupyter notebooks, manual testing against user stories was deemed irrelevant, as their execution relies on consecutive functions being successful. Instead, correctness was ensured through code validation and sequential function execution.
As a non-technical stakeholder, I want to view a concise and structured overview of the project, including its goals, dataset, and business requirements, so that I can understand what the project aims to achieve and how to navigate the dashboard.
| Feature | Action | Expected Result | Test Result |
|---|---|---|---|
| Project Summary Page | Navigate to summary page | Page is displayed with all sections visible; user can read overview, terms, dataset, and business requirements. | Pass |
| Data Inspection | Tick checkbox to inspect dataset | Table shows number of rows, columns, and first 10 rows. | Pass |
As a data analyst, I want to explore correlations and key drivers of loan default through interactive data exploration and visualizations, so that I can identify which borrower and loan attributes most influence default risk and provide data-driven insights to the business.
| Feature | Action | Expected Result | Test Result |
|---|---|---|---|
| Loan Default Study Page | Navigate to page | Page loads correctly; correlation and feature distributions sections are displayed. | Pass |
| PPS Heatmap | Tick checkbox to show heatmap | PPS heatmap is displayed with relevant scores. | Pass |
| Feature Distributions | Tick checkbox to show distributions | Visualizations for key features are displayed by default level. | Pass |
| Parallel Plot | Tick checkbox to show parallel plot | Parallel plot displays interactions between features and default probability. | Pass |
As a business analyst, I want to review the project’s main hypotheses about borrower behavior and validate them with visual and statistical evidence, so that I can understand which factors are meaningfully linked to default and ensure the findings are grounded in data.
| Feature | Action | Expected Result | Test Result |
|---|---|---|---|
| Project Hypotheses Page | Navigate to page | Page displays all four hypotheses with introductory text. | Pass |
| Hypothesis Validation | Tick checkbox for each hypothesis | Corresponding distribution plot is displayed along with statistical test result. | Pass |
| Validation Summary | Read page | Written conclusion confirms whether each hypothesis is supported by data. | Pass |
As a loan officer, I want to input borrower information and receive a predicted probability of default along with a borrower cluster assignment, so that I can make informed lending decisions and take appropriate risk mitigation actions based on the borrower’s risk profile.
| Feature | Action | Expected Result | Test Result |
|---|---|---|---|
| Default Prediction Tool Page | Navigate to page | Page displays explanation of BR2 & BR3, and input widgets are visible. | Pass |
| Input Widgets | Enter unseen borrower data | Widgets respond to input correctly. | Pass |
| Run Predictive Analysis | Click “Run Predictive Analysis” | Predicted default probability, cluster assignment, cluster profile, and business recommendation are displayed. | Pass |
As a technical reviewer, I want to examine the predictive model’s structure, key features, and performance metrics, so that I can assess whether the model meets business requirements and delivers reliable and interpretable predictions.
| Feature | Action | Expected Result | Test Result |
|---|---|---|---|
| Classification Model Insights Page | Navigate to page | Page loads with model overview, ML pipelines, and feature importance visualizations. | Pass |
| Model Metrics | View confusion matrices and classification reports | Train and test set performance metrics are displayed; insights are interpretable by a technical reviewer. | Pass |
As a technical reviewer, I can view borrower clustering insights to evaluate the clustering model’s performance, understand cluster characteristics, and assess how effectively the clusters segment borrowers by default risk.
| Feature | Action | Expected Result | Test Result |
|---|---|---|---|
| Borrower Clustering Insights Page | Navigate to page | Page loads with cluster analysis overview, ML pipeline description, and top feature visualizations. | Pass |
| Cluster Performance Metrics | View silhouette plot and average score | Silhouette plot and score are displayed; number of clusters is indicated. | Pass |
| Cluster Distribution | Analyze default levels by cluster | Correct distribution of borrowers across clusters is shown. | Pass |
| Cluster Profiles | Examine cluster characteristics | Cluster descriptions and business relevance are displayed; technical reviewer can interpret segmentation. | Pass |
All python code within the app_pagesand src directories as well as the app.py file has been validated for PEP8 compliance using Code Institute’s PEP8 Linter. No issues remain.
All code within the .ipynb files in jupyter_notebooks directory has primarily been checked not to exceed 79 characters per line. Some whitespaces remain.
To this date, no known unfixed errors remain in the application, though, even after thorough testing, I cannot rule out the possibility.
The App live link is: https://loan-guard-c4aee35f5523.herokuapp.com/
The project was deployed to Heroku using the following steps:
- Within your working directory, ensure there is a
setup.shfile containing the following:
mkdir -p ~/.streamlit/
echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml
- Within your working directory, ensure there is a
.python-versionfile containing a Heroku-24 stack supported version of Python.
3.12
- Within your working directory, ensure there is a
Procfilefile containing the following:
web: sh setup.sh && streamlit run app.py
- Ensure your
requirements.txtfile contains all the packages necessary to run the streamlit dashboard. - Ensure your
.slugignorefile contains all files/directories that are unnecessary for deployment.
*.ipynb
- Log in to Heroku or create an account if you do not already have one.
- Click the New button on the dashboard and from the dropdown menu select "Create new app".
- Enter a suitable app name and select your region, then click the Create app button.
- Once the app has been created, navigate to the Deploy tab.
- At the Deploy tab, in the Deployment method section select GitHub.
- Enter your repository name and click Search. Once it is found, click Connect.
- Navigate to the bottom of the Deploy page to the Manual deploy section and select main from the branch dropdown menu.
- Click the Deploy Branch button to begin deployment.
- The deployment process should happen smoothly if all deployment files are fully functional. Click the button Open App at the top of the page to access your App.
- If the build fails, check the build log carefully to troubleshoot what went wrong. If the slug size is too large then add large files not required for the app to the
.slugignorefile.
If you wish to fork or clone this repository, please follow the instructions below:
- In the top right of the main repository page, click the Fork button.
- Under Owner, select the desired owner from the dropdown menu.
- OPTIONAL: Change the default name of the repository in order to distinguish it.
- OPTIONAL: In the Description field, enter a description for the forked repository.
- Ensure the 'Copy the main branch only' checkbox is selected.
- Click the Create fork button.
- On the main repository page, click the Code button.
- Copy the HTTPS URL from the resulting dropdown menu.
- In your IDE terminal, navigate to the directory you want the cloned repository to be created.
- In your IDE terminal, type
git cloneand paste the copied URL. - Hit Enter to create the cloned repository.
This project has been based on the methodologies used in the Churnometer project from Code Institute. Some functions from that project have been used in their original form, while others have been customized for the purposes of this project. The main functions that have been used are:
- Functions for calculating correlations and displaying correlation heatmaps.
- Customized function for testing different numerical transformations to evaluate improvement in distribution shape.
- Function for performing hyperparameter optimization using grid search, adapted to also perform randomized search.
- Function to display confusion matrix and classification performance report.
- Cluster analysis utilities:
- Visualization of PCA results, elbow and silhouette plots.
- Cluster distribution plots.
- Function to describe cluster profiles.