Loan Guard

In the banking sector, effective credit risk assessment is critical for maintaining financial stability and minimizing losses. Loan defaults can lead to significant financial setbacks and reduced liquidity for lending institutions. The Loan Guard project aims to help financial institutions proactively identify borrowers who are likely to default and understand the key factors driving default risk.

In addition, the project incorporates borrower segmentation through clustering, which groups borrowers with similar financial profiles and historical behavior. This combined approach allows lenders not only to predict default probabilities but also to tailor risk management strategies and business decisions based on the characteristics of different borrower segments.

The project was created for educational purposes only.

Live page on Heroku

Dataset Content

The used dataset is publicly available on Kaggle and contains information about individual borrowers and their loan characteristics. Each row represents a loan record, including both personal and financial attributes that may influence the likelihood of default. The dataset provides a comprehensive overview of borrower profiles, such as age, income, home ownership or employment details, as well as loan-specific features like loan amount, interest rate and purpose.

In total, the dataset includes 32,581 records and 12 variables. The target variable, loan_status, indicates whether a borrower has defaulted on their loan (1) or successfully repaid it (0). The target distribution is imbalanced toward non-default cases, reflecting real-world lending scenarios where most borrowers do not default. This dataset enables predictive modeling to identify patterns and risk factors associated with loan default.

Variable	Description	Role	Data Type	Units / Possible Values
`person_age`	Age of the borrower	Feature	int64	Years
`person_income`	Annual income of the borrower	Feature	float64	USD
`person_home_ownership`	Type of home ownership	Feature	object	RENT, OWN, MORTGAGE, OTHER
`person_emp_length`	Length of employment	Feature	float64	Years
`loan_intent`	Purpose of the loan	Feature	object	PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEIMPROVEMENT, DEBTCONSOLIDATION
`loan_grade`	Loan grade assigned by lender	Feature	object	A, B, C, D, E, F, G
`loan_amnt`	Loan amount requested	Feature	float64	USD
`loan_int_rate`	Interest rate applied to the loan	Feature	float64	Percentage
`loan_percent_income`	Loan amount as a percentage of annual income	Feature	float64	Ratio
`cb_person_default_on_file`	Whether the person has previously defaulted	Feature	object	Y, N
`cb_person_cred_hist_length`	Length of credit history	Feature	int64	Years
`loan_status`	Loan default flag	Target	int64	0 = No Default, 1 = Default

NOTE:

When I initially started working on this project, I used a different dataset from Kaggle. After attempting to build meaningful prediction and clustering models, I decided to switch to a new dataset.

The previous dataset was highly synthetic, with all variables being uniformly distributed and showing very little correlation, both between features and with the target variable. Uniform distributions are particularly challenging for predictive modeling and clustering because they lack natural variability and concentration of values. Consequently, there are few meaningful patterns, groupings, or relationships for the models to learn from.

As a result, it was very difficult to build a predictive model with good performance metrics and without overfitting. I experimented with several approaches to improve model performance and reduce overfitting, including hyperparameter tuning and binning numerical variables, but none led to satisfactory results. Furthermore, during the cluster analysis, the results did not correspond to any recognizable borrower groups or risk profiles, limiting the usefulness of the analysis.

Therefore, I decided to switch to the current dataset. Although it required more extensive data cleaning and transformation, it produces models with stronger performance and revealed meaningful, interpretable clusters. Overall, business interpretability and analytical insight were significantly improved.

Project Terms & Jargon

A borrower is a person who takes out a loan from a financial institution.
A loan is an amount of money borrowed that is expected to be paid back with interest.
A default occurs when a borrower fails to make scheduled loan payments or meet the agreed repayment terms.
A defaulted borrower is a borrower who has failed to repay their loan as agreed and is classified as being in default.

Business Requirements

From a business perspective, this project supports the strategic goals of a financial institution such as:

Improving risk management by identifying high-risk applicants early.
Enhancing profitability through optimized loan approval decisions.
Increasing borrower trust and operational efficiency by offering fair, data-driven credit evaluations.
Enabling personalized loan offerings and proactive interventions for at-risk borrowers (e.g., adjusted payment plans or counseling).

Ultimately, this project aligns predictive analytics with the bank’s long-term objective of balancing growth with financial stability.

To achieve the outlined objectives, the project will focus on the following key requirements.

Business Requirement 1: Data Insights (Conventional Analysis)

Identify key borrower and loan attributes that are most correlated with loan default. Provide visual and statistical insights to help business analysts understand the primary drivers of credit risk.

Business Requirement 2: Classification Model (Machine Learning)

Develop a machine learning model capable of predicting whether a loan applicant is likely to default. The system should output a probability of default to support the credit team in decision-making.

Business Requirement 3: Clustering Model (Machine Learning)

Group borrowers into risk-based clusters to segment borrowers by credit behavior and improve tailored intervention strategies.

Hypotheses and how to validate them?

To better understand the factors influencing loan default risk, four key hypotheses were formulated based on domain knowledge and the available data. Each hypothesis focuses on a variable expected to impact default probability.

Hypothesis	Rationale	Validation
H1: Higher `loan_amnt` is associated with higher default risk	Borrowers taking larger loans may face greater repayment burdens, increasing the likelihood of default	Visualize distribution of `loan_amnt` by `loan_status`, conduct statistical test to confirm difference
H2: Lower `person_income` is associated with higher default risk	Borrowers with lower income may have limited financial capacity to meet repayment obligations	Visualize distribution of `person_income` by `loan_status`, conduct statistical test to confirm difference
H3: Lower `loan_grade` (credit quality) is associated with higher default risk	A lower loan grade reflects weaker creditworthiness and higher assessed lending risk	Analyze frequency of defaults across `loan_grade` categories, perform Chi-square test for association
H4: Shorter `person_emp_length` (employment length) is associated with higher default risk	Borrowers with shorter employment histories may experience less income stability, increasing repayment risk	Visualize distribution of `person_emp_length` by `loan_status`, conduct statistical test to confirm difference

These hypotheses will be tested through exploratory data analysis and statistical testing to identify whether the respective features are influential predictors of default risk.

The rationale to map the business requirements to the Data Visualizations and ML tasks

This section explains how each business requirement is addressed by specific analyses, visualizations and ML techniques. It ensures that insights and predictions directly support the business goals and can be interpreted by stakeholders.

Business Requirement 1: Data Insights (Conventional Analysis)

Identify key borrower and loan attributes that are most correlated with loan default.
Provide visual and statistical insights to help business analysts understand the primary drivers of credit risk.
Visualize distributions and relationships between key features and the target variable.

Business Requirement 2: Classification Model (Machine Learning)

Develop a binary classification model to predict whether a loan applicant is likely to default.
Show the probability of default to support the credit team in decision-making.
Evaluate model performance and feature importance for transparency and reliability.

Business Requirement 3: Clustering Model (Machine Learning)

Group borrowers into risk-based clusters to segment borrowers by credit behavior and improve tailored intervention strategies.
Analyze and visualize cluster characteristics to understand risk profiles.
Visualize cluster assignments to facilitate understanding by stakeholders.

ML Business Case

Binary Classification Model — Loan Default Prediction

We aim to develop a supervised machine learning model that predicts whether a loan applicant will default or not.
The model should provide a probability of default to support the credit risk team in decision-making.

Goal: Predict if a borrower will default on their loan (loan_status) and provide the associated probability of default.
Model type: Supervised - Binary Classification.
Input features: Borrower demographic and financial attributes
Model choice: After experimentation, a Random Forest model was chosen as the best-performing and most interpretable model.
Success metrics (on both training and test sets):
- Recall for default ≥ 0.75 – to minimize false negatives (high-risk borrowers predicted as safe)
- F1 score for default ≥ 0.60 – ensures a balance between recall and precision
Failure conditions:
- Strong degradation of performance on test data vs. train data → indicates overfitting.
- Large imbalance between precision and recall → predictions may not be reliable for business decisions.
Output definition:
- Binary prediction (0 = no default, 1 = default).
- Probability of default (e.g., 0.76 = 76% chance of default) to guide credit risk decisions.
Heuristics: The model should be used to prioritize risk review and support decision-making (not to fully automate rejections). Thresholds for action should be determined in consultation with the credit risk team to balance loss prevention and borrower impact.

Clustering Model — Borrower Segmentation

We implemented an unsupervised clustering model to group borrowers with similar credit and loan characteristics.
This segmentation helps the credit and retention teams tailor communication, product offerings, and risk mitigation strategies.

Goal: Identify distinct borrower segments based on credit behavior and financial characteristics.
Model type: Unsupervised - Clustering.
Input features: Borrower demographic and financial attributes
Model choice: K-Means
Success metrics:
- Average silhouette score ≥ 0.45
- Clusters are interpretable and distinct in profile characteristics.
Failure conditions:
- Model suggests more than 15 clusters → difficult to interpret or apply in business context.
- Clusters are not meaningfully distinct (overlapping feature distributions).
Output definition:
- Cluster assignments appended to the dataset as an additional categorical column (Clusters).
- Each borrower belongs to one cluster.
- Identified cluster characteristics:
  - Cluster 0: Borrowers with a history of previous defaults, mostly renters, moderate income, highest default rate (high-risk).
  - Cluster 1: Borrowers with no history of previous defaults, who mostly rent, lower to mid-range incomes, moderate default rates (middle-risk).
  - Cluster 2: Borrowers with no history of previous defaults, who primarily have mortgages, higher incomes, rarely default (low-risk).
Heuristics: This clustering provides a systematic segmentation where previously none existed. The results can inform targeted risk interventions and product offerings.

User Stories

User stories were developed to clearly define the needs and goals of different stakeholders, ensuring that the project dashboard delivers actionable insights and functionality aligned with both business and technical objectives.

As a non-technical stakeholder, I want to view a concise and structured overview of the project, including its goals, dataset, and business requirements, so that I can understand what the project aims to achieve and how to navigate the dashboard.
As a data analyst, I want to explore correlations and key drivers of loan default through interactive data exploration and visualizations, so that I can identify which borrower and loan attributes most influence default risk and provide data-driven insights to the business.
As a business analyst, I want to review the project’s main hypotheses about borrower behavior and validate them with visual and statistical evidence, so that I can understand which factors are meaningfully linked to default and ensure the findings are grounded in data.
As a loan officer, I want to input borrower information and receive a predicted probability of default along with a borrower cluster assignment, so that I can make informed lending decisions and take appropriate risk mitigation actions based on the borrower’s profile.
As a technical reviewer, I want to examine the predictive model’s structure, key features, and performance metrics, so that I can assess whether the model meets business requirements and delivers reliable and interpretable predictions.
As a technical reviewer, I can view borrower clustering insights to evaluate the clustering model’s performance, understand cluster characteristics, and assess how effectively the clusters segment borrowers by default risk.

Dashboard Design

The dashboard will be developed in Streamlit and designed to guide the user from business understanding to actionable insights and model-based predictions.
It will consist of six main pages, each mapped to specific business requirements.

The goal of the dashboard is to provide both descriptive insights and predictive intelligence to support data-driven decisions in loan management.
It will serve two main user groups:

Business analysts: who need to explore patterns and trends in borrower data.
Credit officers: who need actionable information on loan risk and applicant default probability.

Page 1: Project Summary

Purpose: Provide a clear overview of the project and orient users.
Sections:
- Project introduction
- Project terms & jargon
- Dataset overview
- Checkbox: Data inspection (number of rows, columns, and first 10 rows)
- Business requirements
- Navigation guide for subsequent pages

Page 2: Loan Default Study

Purpose: Address Business Requirement 1 (Data Insights). This page helps financial institutions understand what drives default risk. It focuses on identifying key borrower and loan attributes most correlated with default and provides visual and statistical insights for business analysts.
Sections:
- Correlation Analysis:
  - Checkbox: Display PPS Heatmap to detect both linear and non-linear relationships with the target variable
  - Table of most important features according to PPS score
- Visualization of main drivers of default
  - Checkbox: Display distributions of selected key features
  - Summary insights highlight trends
  - Checkbox: Display Parallel Plot to show interactions between multiple key features and their influence on default probability.

Page 3: Project Hypotheses and Validation

Purpose: Present hypotheses and their validation process.
Sections:
- State each of the four project hypotheses.
- Show validation result:
  - Short written conclusion summarizing whether the hypothesis was confirmed or not
  - Checkbox: Display corresponding distribution plot for each hypothesis and result of statistical test

Page 4: Default Prediction Tool

Purpose: Address Business Requirement 2 (Classification Model) and Business Requirement 3 (Clustering Model)
Sections:
- State Business Requirements 2 and 3
- Widget input fields for necessary borrower data
- “Run Predictive Analysis” button to send input data through the trained ML pipelines
- Output:
  - Predicted default probability
  - Cluster assignment for additional context
  - Cluster profile summary
  - Combined business recommendation based on default probability and cluster

Page 5: Classification Model Insights

Purpose: Address Business Requirement 2 (Classification Model). Show predictive model performance and interpretation.
Sections:
- Describe model objective
- Overview of used ML pipelines
- Visualization of the top features contributing to the model’s predictions
- Insights into model performance
  - Confusion matrix and classification report for both train and test sets
  - Performance metrics interpretation and conclusions for business relevance

Page 6: Borrower Clustering Insights

Purpose: Address Business Requirement 3 (Clustering Model). Show cluster analysis performance and interpretation.
Sections:
- Describe model objective
- Overview of used ML pipeline
- Insights into cluster analysis performance
  - Silhouette plot, average silhouette score and number of clusters chosen
- Cluster distribution across default levels
- Visualization of the top features defining the clusters
- Description of cluster profiles and business use of segmentation

Technologies Used

The technologies used throughout the development are listed below.

Languages

Python

Python Packages

Main Data Analysis & Machine Learning Libraries

Pandas - Open source library for data manipulation and analysis.
NumPy - Adds support for large, multi-dimensional arrays and high-level mathematical functions.
Matplotlib - Comprehensive library for creating static, animated, and interactive visualisations.
Seaborn - Statistical data visualisation library for attractive and informative graphics.
Plotly Express - High-level library for creating interactive plots easily.
scikit-learn - Open source machine learning library featuring classification, regression, and clustering.
Feature-engine - Library with multiple transformers to engineer and select features for ML models.
ppscore - Library for detecting linear or non-linear relationships between two features.
SciPy - Library for scientific computing, including statistical tests.
XGBoost - High-performance gradient boosting library.
CatBoost - Gradient boosting library optimized for categorical features.
imbalanced-learn - Tools for handling imbalanced datasets in classification tasks.
Yellowbrick - Visual analysis and diagnostic tools for machine learning models.
Joblib - Utilities for pipelining, caching, and saving/loading models.
YData Profiling - Automated exploratory data analysis and profiling reports.

Utilities

Warnings - Suppress or manage warnings in Python.
Streamlit - Framework for building interactive dashboards and web applications.
Glob - File pathname pattern matching.
Zipfile - Work with ZIP archives in Python.

Other Technologies

Git - For version control
GitHub - Code repository
Heroku - For application deployment
VSCode - IDE used for development

Testing

User Story Testing

Each page of the dashboard corresponds to a specific user story, ensuring that the application meets the needs of both non-technical and technical users. The dashboard was manually tested using these user stories as a basis for determining success.

For the Jupyter notebooks, manual testing against user stories was deemed irrelevant, as their execution relies on consecutive functions being successful. Instead, correctness was ensured through code validation and sequential function execution.

User Story 1: Project Summary Page

As a non-technical stakeholder, I want to view a concise and structured overview of the project, including its goals, dataset, and business requirements, so that I can understand what the project aims to achieve and how to navigate the dashboard.

Feature	Action	Expected Result	Test Result
Project Summary Page	Navigate to summary page	Page is displayed with all sections visible; user can read overview, terms, dataset, and business requirements.	Pass
Data Inspection	Tick checkbox to inspect dataset	Table shows number of rows, columns, and first 10 rows.	Pass

User Story 2: Loan Default Study Page

As a data analyst, I want to explore correlations and key drivers of loan default through interactive data exploration and visualizations, so that I can identify which borrower and loan attributes most influence default risk and provide data-driven insights to the business.

Feature	Action	Expected Result	Test Result
Loan Default Study Page	Navigate to page	Page loads correctly; correlation and feature distributions sections are displayed.	Pass
PPS Heatmap	Tick checkbox to show heatmap	PPS heatmap is displayed with relevant scores.	Pass
Feature Distributions	Tick checkbox to show distributions	Visualizations for key features are displayed by default level.	Pass
Parallel Plot	Tick checkbox to show parallel plot	Parallel plot displays interactions between features and default probability.	Pass

User Story 3: Project Hypotheses Page

As a business analyst, I want to review the project’s main hypotheses about borrower behavior and validate them with visual and statistical evidence, so that I can understand which factors are meaningfully linked to default and ensure the findings are grounded in data.

Feature	Action	Expected Result	Test Result
Project Hypotheses Page	Navigate to page	Page displays all four hypotheses with introductory text.	Pass
Hypothesis Validation	Tick checkbox for each hypothesis	Corresponding distribution plot is displayed along with statistical test result.	Pass
Validation Summary	Read page	Written conclusion confirms whether each hypothesis is supported by data.	Pass

User Story 4: Default Prediction Tool Page

As a loan officer, I want to input borrower information and receive a predicted probability of default along with a borrower cluster assignment, so that I can make informed lending decisions and take appropriate risk mitigation actions based on the borrower’s risk profile.

Feature	Action	Expected Result	Test Result
Default Prediction Tool Page	Navigate to page	Page displays explanation of BR2 & BR3, and input widgets are visible.	Pass
Input Widgets	Enter unseen borrower data	Widgets respond to input correctly.	Pass
Run Predictive Analysis	Click “Run Predictive Analysis”	Predicted default probability, cluster assignment, cluster profile, and business recommendation are displayed.	Pass

User Story 5: Classification Model Insights Page

As a technical reviewer, I want to examine the predictive model’s structure, key features, and performance metrics, so that I can assess whether the model meets business requirements and delivers reliable and interpretable predictions.

Feature	Action	Expected Result	Test Result
Classification Model Insights Page	Navigate to page	Page loads with model overview, ML pipelines, and feature importance visualizations.	Pass
Model Metrics	View confusion matrices and classification reports	Train and test set performance metrics are displayed; insights are interpretable by a technical reviewer.	Pass

User Story 6: Borrower Clustering Insights Page

As a technical reviewer, I can view borrower clustering insights to evaluate the clustering model’s performance, understand cluster characteristics, and assess how effectively the clusters segment borrowers by default risk.

Feature	Action	Expected Result	Test Result
Borrower Clustering Insights Page	Navigate to page	Page loads with cluster analysis overview, ML pipeline description, and top feature visualizations.	Pass
Cluster Performance Metrics	View silhouette plot and average score	Silhouette plot and score are displayed; number of clusters is indicated.	Pass
Cluster Distribution	Analyze default levels by cluster	Correct distribution of borrowers across clusters is shown.	Pass
Cluster Profiles	Examine cluster characteristics	Cluster descriptions and business relevance are displayed; technical reviewer can interpret segmentation.	Pass

Code Validation

All python code within the app_pagesand src directories as well as the app.py file has been validated for PEP8 compliance using Code Institute’s PEP8 Linter. No issues remain.

All code within the .ipynb files in jupyter_notebooks directory has primarily been checked not to exceed 79 characters per line. Some whitespaces remain.

Unfixed Bugs

To this date, no known unfixed errors remain in the application, though, even after thorough testing, I cannot rule out the possibility.

Deployment

Heroku

The App live link is: https://loan-guard-c4aee35f5523.herokuapp.com/

The project was deployed to Heroku using the following steps:

Within your working directory, ensure there is a setup.sh file containing the following:

mkdir -p ~/.streamlit/
echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml

Within your working directory, ensure there is a .python-version file containing a Heroku-24 stack supported version of Python.

3.12

Within your working directory, ensure there is a Procfile file containing the following:

web: sh setup.sh && streamlit run app.py

Ensure your requirements.txt file contains all the packages necessary to run the streamlit dashboard.
Ensure your .slugignore file contains all files/directories that are unnecessary for deployment.

*.ipynb

Log in to Heroku or create an account if you do not already have one.
Click the New button on the dashboard and from the dropdown menu select "Create new app".
Enter a suitable app name and select your region, then click the Create app button.
Once the app has been created, navigate to the Deploy tab.
At the Deploy tab, in the Deployment method section select GitHub.
Enter your repository name and click Search. Once it is found, click Connect.
Navigate to the bottom of the Deploy page to the Manual deploy section and select main from the branch dropdown menu.
Click the Deploy Branch button to begin deployment.
The deployment process should happen smoothly if all deployment files are fully functional. Click the button Open App at the top of the page to access your App.
If the build fails, check the build log carefully to troubleshoot what went wrong. If the slug size is too large then add large files not required for the app to the .slugignore file.

Forking and Cloning

If you wish to fork or clone this repository, please follow the instructions below:

Forking

In the top right of the main repository page, click the Fork button.
Under Owner, select the desired owner from the dropdown menu.
OPTIONAL: Change the default name of the repository in order to distinguish it.
OPTIONAL: In the Description field, enter a description for the forked repository.
Ensure the 'Copy the main branch only' checkbox is selected.
Click the Create fork button.

Cloning

On the main repository page, click the Code button.
Copy the HTTPS URL from the resulting dropdown menu.
In your IDE terminal, navigate to the directory you want the cloned repository to be created.
In your IDE terminal, type git clone and paste the copied URL.
Hit Enter to create the cloned repository.

Credits

This project has been based on the methodologies used in the Churnometer project from Code Institute. Some functions from that project have been used in their original form, while others have been customized for the purposes of this project. The main functions that have been used are:

Functions for calculating correlations and displaying correlation heatmaps.
Customized function for testing different numerical transformations to evaluate improvement in distribution shape.
Function for performing hyperparameter optimization using grid search, adapted to also perform randomized search.
Function to display confusion matrix and classification performance report.
Cluster analysis utilities:
- Visualization of PCA results, elbow and silhouette plots.
- Cluster distribution plots.
- Function to describe cluster profiles.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.devcontainer		.devcontainer
app_pages		app_pages
jupyter_notebooks		jupyter_notebooks
outputs		outputs
src		src
.gitignore		.gitignore
.python-version		.python-version
.slugignore		.slugignore
Procfile		Procfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.sh		setup.sh

kathrinmzl/LoanGuard

Folders and files

Latest commit

History

Repository files navigation

Loan Guard

Dataset Content

Project Terms & Jargon

Business Requirements

Hypotheses and how to validate them?

The rationale to map the business requirements to the Data Visualizations and ML tasks

ML Business Case

Binary Classification Model — Loan Default Prediction

Clustering Model — Borrower Segmentation

User Stories

Dashboard Design

Page 1: Project Summary

Page 2: Loan Default Study

Page 3: Project Hypotheses and Validation

Page 4: Default Prediction Tool

Page 5: Classification Model Insights

Page 6: Borrower Clustering Insights

Technologies Used

Languages

Python Packages

Main Data Analysis & Machine Learning Libraries

Utilities

Other Technologies

Testing

User Story Testing

User Story 1: Project Summary Page

User Story 2: Loan Default Study Page

User Story 3: Project Hypotheses Page

User Story 4: Default Prediction Tool Page

User Story 5: Classification Model Insights Page

User Story 6: Borrower Clustering Insights Page

Code Validation

Unfixed Bugs

Deployment

Heroku

Forking and Cloning

Forking

Cloning

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages