Skip to content

My work/solution to the online ML competition "AmExpert 2021 CODELAB – Machine Learning Hackathon" on HackerEarth platform.

Notifications You must be signed in to change notification settings

0x1h0b/AmExpert-CodeLab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AmExpert CodeLab

Credit Card default risk

Objective:- To come up with an ML Model to determine whether companies or individuals would be able to return the money lent on time.

Approach

  • First encoding the categorical columns ['gender','owns_car','owns_house']

    • binary encoded (0 or 1)
  • column ['occupation_type'] contains categorical data (16 unique values)

    • encoded with value 0 to 16 based on position ,e.x :'Laborers' to 2 and 'Drivers' to 4
  • columns ['migrant_worker','prev_defaults','default_in_last_6_month'] are already binary values (0,1)

  • Missing values filled with mode and mean of that cols for categorical and continues columns respectively

    • ['no_of_children','total_family_members','migrant_worker'] ==> mode() for missing values
    • ['no_of_days_employed','yearly_debt_payments','credit_score'] ==> mean() for missing values for respective column
  • handeling Imbalance dataset

    • dataset was highly imbalanced (ratio of 0/1 was very big)
    • used Imblearn's oversampling (SMOTE) to create a balance dataset
      • original shape ~(41k) , new size ~(83k)
  • dataset of this stage is kept as (x_train,y_train) --> phase1

  • Used Standardization to scale each column (expect target col) to bring mean to 0 and standard deviation to 1 for that column

    • used StandardScaler from sklearn.preprocessing
  • dataset at this stage is kept as (x_train_scaled,y_train) ---> phase2

  • Applied PCA technique to reduce dimensions from 16 to 12 .

    • used PCA from sklearn.decomposition
    • used a iterative while loop to see how many columns can be reduced to retain 95% of variance , using covariance matrix of generated by PCA
  • dataset at this stage is kept as (x_train_scaled_pca,y_train) ---> phase3

  • y_train remains the same as it contins only targets (dependent variable)

  • uses the same instances used while train datset to preprocess the test data(test.csv)

    • instances like , StandardScaler , PCA()
    • created 3 different test dataframe
      • test_x , test_x_scaled, test_x_scaled_pca ( just like training phase)

Testing of different ML classifier training datasets:

  • Used Kfold and cross validation score and tried running training dataset on 5 different classifier,given next to them are the avg. f1 score from cross_val_score()
    • normal raw dataset (phase1 dataset)

      • Logistic Regression : 0.9378
      • KNeighborsClassifier : 0.8301
      • LinearSVC : 0.7213
      • Random forest : 0.9874
      • XGBoost : 0.9807
    • scaled dataset (phase2 dataset)

      • Logistic Regression : 0.9688
      • KNeighborsClassifier : 0.9704
      • LinearSVC : 0.9692
      • Random forest : 0.9874
      • XGBoost : 0.9810
    • (scaled + pca) dataset (phase3 dataset)

      • Logistic Regression : 0.9689
      • KNeighborsClassifier : 0.9703
      • LinearSVC : 0.9692
      • Random forest : 0.9810
      • XGBoost : 0.9721

Score / Result from online HackerEarth Judge

  • here dataset == test dataset(test.csv)
    • normal raw dataset (test_x)

      • Logistic Regression : 0.8085
      • KNeighborsClassifier : 0.4681
      • LinearSVC : 0.8085
      • Random forest : 0.91798
      • XGBoost : 0.8954
    • scaled dataset (test_x_scaled)

      • Logistic Regression : 0.8827
      • KNeighborsClassifier : 0.8710
      • LinearSVC : 0.8827
      • Random forest : 0.91731
      • XGBoost : 0.8962
    • (scaled + pca) dataset (test_x_scaled_pca)

      • Logistic Regression : 0.8835
      • KNeighborsClassifier : 0.8615
      • LinearSVC : 0.8835
      • Random forest : 0.89368
      • XGBoost : 0.8816

final standings on leaderboard (Rank 93) of the competition : link to leaderboard : https://www.hackerearth.com/challenges/competitive/amexpert-code-lab/leaderboard/credit-card-default-risk-5-95cbc85f/

Extra steps to improve accuracy :-

  • more EDA.
  • Hyperparameter Tuning

About

My work/solution to the online ML competition "AmExpert 2021 CODELAB – Machine Learning Hackathon" on HackerEarth platform.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages