A machine learning pipeline for predicting loan defaults using LendingClub data. This project explores how financial institutions assess credit risk and which borrower characteristics drive default behavior.
Banks face significant losses from loan defaults each year. This project investigates:
- What distinguishes borrowers who repay from those who default?
- Can we predict defaults before they occur?
- Which features carry the most predictive power?
1.3M+ loans from LendingClub (2007-2018) with 150+ features:
- Loan characteristics (amount, term, interest rate)
- Borrower profile (income, employment, home ownership)
- Credit history (utilization, delinquencies, inquiries)
The dataset exhibits a 20% default rate, presenting a class imbalance challenge addressed through weighted training.
Defaulted loans consistently show higher interest rates, indicating that risk-based pricing reflects genuine default probability.
Lower income quintiles show elevated default rates, though the relationship is more nuanced than expected. Borrowers across income levels demonstrate similar repayment patterns when other factors are controlled.
Grade A loans default at 6%, while Grade G reaches nearly 50%. The internal grading system proves to be a reliable risk indicator.
Interest rate (0.26 correlation) emerges as the strongest individual predictor. Income shows a weak negative correlation (-0.04) with default.
Renters default at higher rates than homeowners. Small business loans carry the highest risk at 30%.
| Model | ROC-AUC |
|---|---|
| Logistic Regression | 0.719 |
| Random Forest | 0.715 |
| Gradient Boosting | 0.726 |
Gradient Boosting slightly outperforms other models. All three achieve similar performance in the 0.71-0.73 range, which is realistic for credit risk prediction without data leakage.
Sub-grade and interest rate dominate feature importance, followed by loan term and loan-to-income ratio. Origination FICO scores contribute modestly, indicating that LendingClub's internal grading system captures most credit risk signal.
risk_modeling/
├── images/ # Visualizations and charts
├── notebooks/
│ ├── 01_data_acquisition.ipynb
│ ├── 02_data_cleaning.ipynb
│ ├── 03_feature_engineering.ipynb
│ ├── 04_eda.ipynb
│ └── 05_modeling.ipynb
├── src/
│ └── api/
│ └── risk_scorer.py # Scoring API
├── data/
│ ├── raw/ # Raw LendingClub data
│ ├── interim/ # Cleaned data
│ └── processed/ # Feature-engineered data
├── models/
│ ├── trained/ # Saved models and feature names
│ └── scalers/ # StandardScaler for preprocessing
└── requirements.txt
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtExecute notebooks 01-05 in order to reproduce the analysis.
from src.api.risk_scorer import CreditRiskScorer
scorer = CreditRiskScorer()
result = scorer.predict({
# Required fields
"loan_amnt": 20000,
"annual_inc": 80000,
"dti": 15.5,
"installment": 665.0,
# Recommended fields for better predictions
"term": " 36 months",
"int_rate": 12.5,
"grade": "B",
"home_ownership": "RENT",
"purpose": "debt_consolidation",
"revol_bal": 12000,
"revol_util": 45.0,
})
# Output:
# {
# "risk_score": 45,
# "decision": "MANUAL_REVIEW",
# "default_probability": 0.4588,
# "confidence": "LOW"
# }Required fields: loan_amnt, annual_inc, dti, installment
Decision thresholds:
- APPROVE: risk_score <= 30
- MANUAL_REVIEW: 30 < risk_score <= 60
- REJECT: risk_score > 60
- Data leakage prevention - Removed post-origination features (last FICO scores, payment history) that would not be available at loan decision time
- Class imbalance handling - Applied balanced class weights to improve minority class detection
- Sub-grade signal - LendingClub's internal grading system captures most predictive signal, outperforming raw FICO scores
- Realistic performance - ROC-AUC of 0.72 reflects production-accurate predictions without information leakage
pandas, numpy, scikit-learn, matplotlib, seaborn, kagglehub







