GitHub - recurze/BankMarketingDatasetEDA: Performing Exploratory Data Analysis (EDA)

Goals

Exploratory data analysis (EDA) of the Portugese Bank Marketing dataset (UCI, kaggle).
Classification goal of predicting whether or not a customer will subscribe a term deposit.

Info

This was a group project part of the course "DSA5101: Introduction to Big Data for Industry". I did the EDA and pre-processing bits.
I've also included some models to demo performance. A more thorough study of different models and parameters was performed by the other members.
The presentation also included discussions of the business implication, insights and findings, and future ideas. I'll not be updating the slides here.
Please download the dataset from the links provided. I'll not be uploading the csv files here.

Summary

EDA

Get metadata about the data frame (like number of entries) and its columns (like names and type).
Distinguish between numerical and categorical columns and treat them separately.
Get the distribution of the output. Here, it's imbalanced classification (a.k.a outlier/novelty detection).
Get the distributions of the individual columns and any relevant conditional distributions (e.g.: column | output = 1).
Impute unknown/null values.
Check for (and remove) outliers.
Drop columns that seem logically irrelevant (carefully).
Create new features.
Find correlations using mutual information and correlation matrices.

Preprocessing

Use pipelines to streamline processing.
Transform the data:
- Incorporate adding/dropping columns from EDA into pipelines and leave the original data untouched.
- Use column transformers to scale numerical columns and encode + scale categorical columns.
- If your model requires it, consider normalizing.
Select features:
- Remove noisy features to avoid overfitting.
- Remove interdependent features.
- Use dimensionality reduction, univariate feature selection or select from model.
Consider oversampling to tackle the imbalanced dataset.

The pipeline looks like this:

model = IMBPipeline([
    ('cleaner', data_cleaner),  # to drop columns and imputation
    ('feature_engineer', feature_engineer),  # to add new columns
    ('column_transformer', column_transformer),
    ('feature_selector', 'passthrough'),
    ('oversampler', SMOTE(random_state=0)),
    ('predictor', RandomForestClassifier(random_state=0)),
])

Models

Trees:
- Boosting
- Bagging
- Balanced
- Ensemble
Linear and NNs

Performance Analysis

Scoring
- Choose scoring metric (more than one preferred).
- Cross validate or simple train test split.
- Generate report (e.g., confusion_matrix).
Model Selection:
- Use grid search for the best hyper parameters.
- Use tuner if you are using deep learning.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Readme.md		Readme.md
dataexp_preproc.ipynb		dataexp_preproc.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goals

Info

Summary

EDA

Preprocessing

Models

Performance Analysis

About

Releases

Packages

Languages

recurze/BankMarketingDatasetEDA

Folders and files

Latest commit

History

Repository files navigation

Goals

Info

Summary

EDA

Preprocessing

Models

Performance Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages