AmExpert CodeLab

Credit Card default risk

Objective:- To come up with an ML Model to determine whether companies or individuals would be able to return the money lent on time.

Competition link : https://www.hackerearth.com/challenges/competitive/amexpert-code-lab/machine-learning/credit-card-default-risk-5-95cbc85f/

Approach

First encoding the categorical columns ['gender','owns_car','owns_house']
- binary encoded (0 or 1)
column ['occupation_type'] contains categorical data (16 unique values)
- encoded with value 0 to 16 based on position ,e.x :'Laborers' to 2 and 'Drivers' to 4
columns ['migrant_worker','prev_defaults','default_in_last_6_month'] are already binary values (0,1)
Missing values filled with mode and mean of that cols for categorical and continues columns respectively
- ['no_of_children','total_family_members','migrant_worker'] ==> mode() for missing values
- ['no_of_days_employed','yearly_debt_payments','credit_score'] ==> mean() for missing values for respective column
handeling Imbalance dataset
- dataset was highly imbalanced (ratio of 0/1 was very big)
- used Imblearn's oversampling (SMOTE) to create a balance dataset
  - original shape ~(41k) , new size ~(83k)
dataset of this stage is kept as (x_train,y_train) --> phase1
Used Standardization to scale each column (expect target col) to bring mean to 0 and standard deviation to 1 for that column
- used StandardScaler from sklearn.preprocessing
dataset at this stage is kept as (x_train_scaled,y_train) ---> phase2
Applied PCA technique to reduce dimensions from 16 to 12 .
- used PCA from sklearn.decomposition
- used a iterative while loop to see how many columns can be reduced to retain 95% of variance , using covariance matrix of generated by PCA
dataset at this stage is kept as (x_train_scaled_pca,y_train) ---> phase3
y_train remains the same as it contins only targets (dependent variable)
uses the same instances used while train datset to preprocess the test data(test.csv)
- instances like , StandardScaler , PCA()
- created 3 different test dataframe
  - test_x , test_x_scaled, test_x_scaled_pca ( just like training phase)

Testing of different ML classifier training datasets:

Used Kfold and cross validation score and tried running training dataset on 5 different classifier,given next to them are the avg. f1 score from cross_val_score()
- normal raw dataset (phase1 dataset)
  - Logistic Regression : 0.9378
  - KNeighborsClassifier : 0.8301
  - LinearSVC : 0.7213
  - Random forest : 0.9874
  - XGBoost : 0.9807
- scaled dataset (phase2 dataset)
  - Logistic Regression : 0.9688
  - KNeighborsClassifier : 0.9704
  - LinearSVC : 0.9692
  - Random forest : 0.9874
  - XGBoost : 0.9810
- (scaled + pca) dataset (phase3 dataset)
  - Logistic Regression : 0.9689
  - KNeighborsClassifier : 0.9703
  - LinearSVC : 0.9692
  - Random forest : 0.9810
  - XGBoost : 0.9721

Score / Result from online HackerEarth Judge

here dataset == test dataset(test.csv)
- normal raw dataset (test_x)
  - Logistic Regression : 0.8085
  - KNeighborsClassifier : 0.4681
  - LinearSVC : 0.8085
  - Random forest : 0.91798
  - XGBoost : 0.8954
- scaled dataset (test_x_scaled)
  - Logistic Regression : 0.8827
  - KNeighborsClassifier : 0.8710
  - LinearSVC : 0.8827
  - Random forest : 0.91731
  - XGBoost : 0.8962
- (scaled + pca) dataset (test_x_scaled_pca)
  - Logistic Regression : 0.8835
  - KNeighborsClassifier : 0.8615
  - LinearSVC : 0.8835
  - Random forest : 0.89368
  - XGBoost : 0.8816

final standings on leaderboard (Rank 93) of the competition : link to leaderboard : https://www.hackerearth.com/challenges/competitive/amexpert-code-lab/leaderboard/credit-card-default-risk-5-95cbc85f/

Extra steps to improve accuracy :-

more EDA.
Hyperparameter Tuning

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
main.ipynb		main.ipynb
readme.md		readme.md
submission_files.zip		submission_files.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AmExpert CodeLab

Credit Card default risk

Approach

Testing of different ML classifier training datasets:

Score / Result from online HackerEarth Judge

Extra steps to improve accuracy :-

About

Uh oh!

Releases

Packages

Uh oh!

Languages

0x1h0b/AmExpert-CodeLab

Folders and files

Latest commit

History

Repository files navigation

AmExpert CodeLab

Credit Card default risk

Approach

Testing of different ML classifier training datasets:

Score / Result from online HackerEarth Judge

Extra steps to improve accuracy :-

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages