Objective:- To come up with an ML Model to determine whether companies or individuals would be able to return the money lent on time.
- Competition link : https://www.hackerearth.com/challenges/competitive/amexpert-code-lab/machine-learning/credit-card-default-risk-5-95cbc85f/
-
First encoding the categorical columns ['gender','owns_car','owns_house']
- binary encoded (0 or 1)
-
column ['occupation_type'] contains categorical data (16 unique values)
- encoded with value 0 to 16 based on position ,e.x :'Laborers' to 2 and 'Drivers' to 4
-
columns ['migrant_worker','prev_defaults','default_in_last_6_month'] are already binary values (0,1)
-
Missing values filled with mode and mean of that cols for categorical and continues columns respectively
- ['no_of_children','total_family_members','migrant_worker'] ==> mode() for missing values
- ['no_of_days_employed','yearly_debt_payments','credit_score'] ==> mean() for missing values for respective column
-
handeling Imbalance dataset
- dataset was highly imbalanced (ratio of 0/1 was very big)
- used Imblearn's oversampling (SMOTE) to create a balance dataset
- original shape ~(41k) , new size ~(83k)
-
dataset of this stage is kept as (x_train,y_train) --> phase1
-
Used Standardization to scale each column (expect target col) to bring mean to 0 and standard deviation to 1 for that column
- used StandardScaler from sklearn.preprocessing
-
dataset at this stage is kept as (x_train_scaled,y_train) ---> phase2
-
Applied PCA technique to reduce dimensions from 16 to 12 .
- used PCA from sklearn.decomposition
- used a iterative while loop to see how many columns can be reduced to retain 95% of variance , using covariance matrix of generated by PCA
-
dataset at this stage is kept as (x_train_scaled_pca,y_train) ---> phase3
-
y_train remains the same as it contins only targets (dependent variable)
-
uses the same instances used while train datset to preprocess the test data(test.csv)
- instances like , StandardScaler , PCA()
- created 3 different test dataframe
- test_x , test_x_scaled, test_x_scaled_pca ( just like training phase)
- Used Kfold and cross validation score and tried running training dataset on 5 different classifier,given next to them are the avg. f1 score from cross_val_score()
-
normal raw dataset (phase1 dataset)
- Logistic Regression : 0.9378
- KNeighborsClassifier : 0.8301
- LinearSVC : 0.7213
- Random forest : 0.9874
- XGBoost : 0.9807
-
scaled dataset (phase2 dataset)
- Logistic Regression : 0.9688
- KNeighborsClassifier : 0.9704
- LinearSVC : 0.9692
- Random forest : 0.9874
- XGBoost : 0.9810
-
(scaled + pca) dataset (phase3 dataset)
- Logistic Regression : 0.9689
- KNeighborsClassifier : 0.9703
- LinearSVC : 0.9692
- Random forest : 0.9810
- XGBoost : 0.9721
-
- here dataset == test dataset(test.csv)
-
normal raw dataset (test_x)
- Logistic Regression : 0.8085
- KNeighborsClassifier : 0.4681
- LinearSVC : 0.8085
- Random forest : 0.91798
- XGBoost : 0.8954
-
scaled dataset (test_x_scaled)
- Logistic Regression : 0.8827
- KNeighborsClassifier : 0.8710
- LinearSVC : 0.8827
- Random forest : 0.91731
- XGBoost : 0.8962
-
(scaled + pca) dataset (test_x_scaled_pca)
- Logistic Regression : 0.8835
- KNeighborsClassifier : 0.8615
- LinearSVC : 0.8835
- Random forest : 0.89368
- XGBoost : 0.8816
-
final standings on leaderboard (Rank 93) of the competition : link to leaderboard : https://www.hackerearth.com/challenges/competitive/amexpert-code-lab/leaderboard/credit-card-default-risk-5-95cbc85f/
- more EDA.
- Hyperparameter Tuning