Credit Risk Modeling

A machine learning pipeline for predicting loan defaults using LendingClub data. This project explores how financial institutions assess credit risk and which borrower characteristics drive default behavior.

Overview

Banks face significant losses from loan defaults each year. This project investigates:

What distinguishes borrowers who repay from those who default?
Can we predict defaults before they occur?
Which features carry the most predictive power?

Dataset

1.3M+ loans from LendingClub (2007-2018) with 150+ features:

Loan characteristics (amount, term, interest rate)
Borrower profile (income, employment, home ownership)
Credit history (utilization, delinquencies, inquiries)

Target Distribution

The dataset exhibits a 20% default rate, presenting a class imbalance challenge addressed through weighted training.

Analysis

Interest Rate and Default Risk

Defaulted loans consistently show higher interest rates, indicating that risk-based pricing reflects genuine default probability.

Income Distribution

Lower income quintiles show elevated default rates, though the relationship is more nuanced than expected. Borrowers across income levels demonstrate similar repayment patterns when other factors are controlled.

Loan Grade Performance

Grade A loans default at 6%, while Grade G reaches nearly 50%. The internal grading system proves to be a reliable risk indicator.

Feature Correlations

Interest rate (0.26 correlation) emerges as the strongest individual predictor. Income shows a weak negative correlation (-0.04) with default.

Categorical Risk Factors

Renters default at higher rates than homeowners. Small business loans carry the highest risk at 30%.

Model Results

Model	ROC-AUC
Logistic Regression	0.719
Random Forest	0.715
Gradient Boosting	0.726

ROC Curve Comparison

Gradient Boosting slightly outperforms other models. All three achieve similar performance in the 0.71-0.73 range, which is realistic for credit risk prediction without data leakage.

Feature Importance

Sub-grade and interest rate dominate feature importance, followed by loan term and loan-to-income ratio. Origination FICO scores contribute modestly, indicating that LendingClub's internal grading system captures most credit risk signal.

Project Structure

risk_modeling/
├── images/                 # Visualizations and charts
├── notebooks/
│   ├── 01_data_acquisition.ipynb
│   ├── 02_data_cleaning.ipynb
│   ├── 03_feature_engineering.ipynb
│   ├── 04_eda.ipynb
│   └── 05_modeling.ipynb
├── src/
│   └── api/
│       └── risk_scorer.py  # Scoring API
├── data/
│   ├── raw/                # Raw LendingClub data
│   ├── interim/            # Cleaned data
│   └── processed/          # Feature-engineered data
├── models/
│   ├── trained/            # Saved models and feature names
│   └── scalers/            # StandardScaler for preprocessing
└── requirements.txt

Usage

Setup

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run Notebooks

Execute notebooks 01-05 in order to reproduce the analysis.

Scoring API

from src.api.risk_scorer import CreditRiskScorer

scorer = CreditRiskScorer()
result = scorer.predict({
    # Required fields
    "loan_amnt": 20000,
    "annual_inc": 80000,
    "dti": 15.5,
    "installment": 665.0,
    # Recommended fields for better predictions
    "term": " 36 months",
    "int_rate": 12.5,
    "grade": "B",
    "home_ownership": "RENT",
    "purpose": "debt_consolidation",
    "revol_bal": 12000,
    "revol_util": 45.0,
})

# Output:
# {
#     "risk_score": 45,
#     "decision": "MANUAL_REVIEW",
#     "default_probability": 0.4588,
#     "confidence": "LOW"
# }

Required fields: loan_amnt, annual_inc, dti, installment

Decision thresholds:

APPROVE: risk_score <= 30
MANUAL_REVIEW: 30 < risk_score <= 60
REJECT: risk_score > 60

Key Takeaways

Data leakage prevention - Removed post-origination features (last FICO scores, payment history) that would not be available at loan decision time
Class imbalance handling - Applied balanced class weights to improve minority class detection
Sub-grade signal - LendingClub's internal grading system captures most predictive signal, outperforming raw FICO scores
Realistic performance - ROC-AUC of 0.72 reflects production-accurate predictions without information leakage

Stack

pandas, numpy, scikit-learn, matplotlib, seaborn, kagglehub

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
notebooks		notebooks
reports/figures		reports/figures
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Credit Risk Modeling

Overview

Dataset

Target Distribution

Analysis

Interest Rate and Default Risk

Income Distribution

Loan Grade Performance

Feature Correlations

Categorical Risk Factors

Model Results

ROC Curve Comparison

Feature Importance

Project Structure

Usage

Setup

Run Notebooks

Scoring API

Key Takeaways

Stack

About

Uh oh!

Releases

Packages

Languages

Hussain0327/risk_modeling

Folders and files

Latest commit

History

Repository files navigation

Credit Risk Modeling

Overview

Dataset

Target Distribution

Analysis

Interest Rate and Default Risk

Income Distribution

Loan Grade Performance

Feature Correlations

Categorical Risk Factors

Model Results

ROC Curve Comparison

Feature Importance

Project Structure

Usage

Setup

Run Notebooks

Scoring API

Key Takeaways

Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages