Project Overview

layout

title

image

Project Overview

Context

Welcome to the Credit Card Fraud Detection Project!

Credit card fraud poses a major threat to both financial institutions and consumers. As online transactions become more prevalent, identifying fraudulent activities has become more complex.

Here, I created a fraud detection system using Python and widely-used machine learning libraries like scikit-learn. The primary goal was to build a robust fraud detection system using Random Forest, Logistic Regression, and Decision Tree classifiers. This project also addresses the class imbalance problem inherent in credit card fraud datasets.

Project Structure

creditcard.csv: The dataset containing transaction details and fraud labels (The dataset used in this project is available on Kaggle).
Credit Card Fraud Detection.py: The main script containing data preprocessing, model training, evaluation, and model saving.

credit_card_model.p: The serialized Random Forest model saved using pickle.

Actions

I first needed to compile the necessary data from tables in the database.

As I was predicting a binary output, I tested three classification modeling approaches, namely:

Logistic Regression
Decision Tree
Random Forest

I imported the data, defined models using a dictionary, trained & tested each model, and then measured this predictive performance based on several metrics to give a well-rounded overview of which was the best. I also used undersampling and oversampling techniques to address the class imbalance problem inherent in credit card fraud datasets.

Results

Here, I aimed to build a model that would accurately predict the fraud transaction. Based upon these, the chosen model is the Random Forest as it was the most consistently performant on the test set across classification accuracy, precision, recall, and f1-score.

**Metric 1: Classification Accuracy**

Random Forest = 0.99
Decision Tree = 0.99
Logistic Regression = 0.99

**Metric 2: Precision**

Random Forest = 0.90
Logistic Regression = 0.89
Decision Tree = 0.66

**Metric 3: Recall**

Random Forest = 0.76
Decision Tree = 0.75
Logistic Regression = 0.6

**Metric 4: F1 Score**

Random Forest = 0.82
Logistic Regression = 0.72
Decision Tree = 0.7

Handling Class Imbalance

The project addresses class imbalance using:

Undersampling: Reducing the majority class to balance the dataset.

Oversampling: Using SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples for the minority class.

I implemented both undersampling and oversampling techniques, but oversampling appears to be more favorable as it ensures no data is lost.

___

Modelling Overview

We built a model that looked to accurately predict fraud transaction.

As we were predicting a binary output, we tested three classification modeling approaches, namely:

Logistic Regression
Decision Tree
Random Forest

Import Required Packagess

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

Get the Data

** Read the 'Creditcard.csv file into a dataframe **

data = pd.read_csv("creditcard.csv")

Data Preprocessing

# Check data head
data.head()

# Let's see all the columns (instead of seeing ... for a couple of them)
pd.options.display.max_columns = None
data.head()

# Check data tail
data.tail()

# Check data shape
data.shape
print(f'Number of rows: {data.shape[0]}')
print(f'Number of columns: {data.shape[1]}')

# Check data type and missing values
data.info()
data.isna().sum()

# Standardisation

scale_standard = StandardScaler()

# Convert the Result from Array to DataFrame
data['Amount'] = scale_standard.fit_transform(pd.DataFrame(data['Amount']))         

# Drop Time column
data.drop('Time', axis = 1, inplace = True)

# Find duplicated data
data.duplicated().any()

# Drop duplicated data
data = data.drop_duplicates()
data.shape

I also investigated the class balance of the dependent variable - which is important when assessing classification accuracy.

# Class balance
data['Class'].value_counts()

From the last step in the above code, I saw that **0.2% of the data were in class 1 and the rest of them were in class 0**. This showed me a clear class imbalance. I made sure to not rely on classification accuracy alone when assessing results - also analyzing Precision, Recall, and F1-Score.

Create Input and Output Variables

X = data.drop(['Class'], axis = 1)
y = data['Class']

Train Test Split

Train_test_split was used to split our data into a training set and a testing set.

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2, random_state = 42)

Model Training and Assessment

Three classifiers were trained and evaluated:

Logistic Regression Decision Tree Classifier Random Forest Classifier

Each model's performance was evaluated using the following metrics:

Accuracy Precision Recall F1-Score

classifier = {
       'LogisticRegression': LogisticRegression(random_state = 42),
       'DecisionTreeClassifier': DecisionTreeClassifier(random_state = 42),
       'RandomForestClassifier': RandomForestClassifier(random_state = 42)}

for name, clf in classifier.items():
    
    print(f'\n==========={name}==========')
    
    clf.fit(X_train, y_train)
    y_pred_class = clf.predict(X_test)

    # Accuracy (the number of correct classification out of all attempted classifications)
    accuracy = accuracy_score(y_test, y_pred_class) 
    print(f'\n Accuracy: {accuracy}')
    
    # Precision (of all observations that were predicted as positive, how many were actually positive)
    precision = precision_score(y_test, y_pred_class)
    print(f'\n Precision: {precision}')

    # Recall (of all positive observations, how many did we predict as positive)
    recall = recall_score(y_test, y_pred_class) 
    print(f'\n Recall: {recall}')

    # F1-Score (the harmonic mean of precision and recall)
    f1 = f1_score(y_test, y_pred_class)
    print(f'\n F1_score: {f1}')

Handling Class Imbalance

The dataset includes two classes of data: class 1 represents fraudulent transactions, while class 0 represents normal transactions:

normal = data[data['Class']==0]
fraud = data[data['Class']==1]

normal.shape
fraud.shape

Undersampling

normal.sample = normal.sample(n=473)

# Create new undersampled dataset

new_data  = pd.concat([normal.sample, fraud], ignore_index=True)

new_data.head()

new_data['Class'].value_counts()

X = new_data.drop(['Class'], axis = 1)
y = new_data['Class']

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2, random_state = 42)

classifier = {
       'LogisticRegression': LogisticRegression(random_state = 42),
       'DecisionTreeClassifier': DecisionTreeClassifier(random_state = 42),
       'RandomForestClassifier': RandomForestClassifier(random_state = 42)}               

for name, clf in classifier.items():
    
    print(f'\n==========={name}==========')
    
    clf.fit(X_train, y_train)
    y_pred_class = clf.predict(X_test)

    # Accuracy (the number of correct classification out of all attempted classifications)
    accuracy = accuracy_score(y_test, y_pred_class) 
    print(f'\n Accuracy: {accuracy}')
    
    # Precision (of all observations that were predicted as positive, how many were actually positive)
    precision = precision_score(y_test, y_pred_class)
    print(f'\n Precision: {precision}')

    # Recall (of all positive observations, how many did we predict as positive)
    recall = recall_score(y_test, y_pred_class) 
    print(f'\n Recall: {recall}')

    # F1-Score (the harmonic mean of precision and recall)
    f1 = f1_score(y_test, y_pred_class)
    print(f'\n F1_score: {f1}')

Oversampling

X = data.drop(['Class'], axis = 1)
y = data['Class']

X.shape
y.shape

X_res, y_res = SMOTE().fit_resample(X,y)

y_res.value_counts()

X_train, X_test, y_train, y_test = train_test_split (X_res, y_res, test_size = 0.2, random_state = 42)

classifier = {
       'LogisticRegression': LogisticRegression(random_state = 42),
       'DecisionTreeClassifier': DecisionTreeClassifier(random_state = 42),
       'RandomForestClassifier': RandomForestClassifier(random_state = 42)}              

for name, clf in classifier.items():
    
    print(f'\n==========={name}==========')
    
    clf.fit(X_train, y_train)
    y_pred_class = clf.predict(X_test)

    # Accuracy (the number of correct classification out of all attempted classifications)
    accuracy = accuracy_score(y_test, y_pred_class) 
    print(f'\n Accuracy: {accuracy}')
    
    # Precision (of all observations that were predicted as positive, how many were actually positive)
    precision = precision_score(y_test, y_pred_class)
    print(f'\n Precision: {precision}')

    # Recall (of all positive observations, how many did we predict as positive)
    recall = recall_score(y_test, y_pred_class) 
    print(f'\n Recall: {recall}')

    # F1-Score (the harmonic mean of precision and recall)
    f1 = f1_score(y_test, y_pred_class)
    print(f'\n F1_score: {f1}')

Running this code resulted in:

**Metric 1: Classification Accuracy**

Random Forest = 0.99
Decision Tree = 0.99
Logistic Regression = 0.95

**Metric 2: Precision**

Random Forest = 0.99
Decision Tree = 0.99
Logistic Regression = 0.97

**Metric 3: Recall**

Random Forest = 1
Decision Tree = 0.99
Logistic Regression = 0.92

**Metric 4: F1 Score**

Random Forest = 0.99
Decision Tree = 0.99
Logistic Regression = 0.94

These are all higher than what we saw without resampling! Random forest and decision tree offers very smilar metrics.

Save Model

rfc = RandomForestClassifier()
rfc.fit(X_res, y_res)

pickle.dump(rfc, open('Credit Card Fraud Detection/credit_card_model.p', 'wb'))

Fraud Detection (Prediction)

An example prediction is provided using the saved model to classify a new transaction.

model = pickle.load(open('Credit Card Fraud Detection/credit_card_model.p', 'rb'))         
                 
prediction = model.predict([[-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62]])                                                                     
prediction[0]     

if prediction[0]==0:
    print ('Normal Transaction')
else:
    print ('Fraud Transaction')

Growth & Next Steps

We could look to tune the hyperparameters of the Random Forest, notably regularisation parameters such as tree depth, as well as potentially training on a higher number of Decision Trees in the Random Forest.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Credit Card Fraud Detection.py		Credit Card Fraud Detection.py
README.md		README.md
credit_card_model.p		credit_card_model.p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Context

Project Structure

Actions

Results

Handling Class Imbalance

Modelling Overview

Import Required Packagess

Get the Data

Data Preprocessing

Create Input and Output Variables

Train Test Split

Model Training and Assessment

Handling Class Imbalance

Undersampling

Oversampling

Save Model

Fraud Detection (Prediction)

Growth & Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Arezookhalili/Credit-Card-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Context

Project Structure

Actions

Results

Handling Class Imbalance

Modelling Overview

Import Required Packagess

Get the Data

Data Preprocessing

Create Input and Output Variables

Train Test Split

Model Training and Assessment

Handling Class Imbalance

Undersampling

Oversampling

Save Model

Fraud Detection (Prediction)

Growth & Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages