Skip to content

petedanielsmith/HealthcareInsuranceDataAnalyticsProject

Repository files navigation

Project banner

Health Insurance Cost Analysis

Analysing health insurance cost data from Kaggle.


Python Logo Pandas Logo Matplotlib Logo Seaborn Logo Plotly Logo Kaggle Logo

Project Board   -   Data Cleanup   -   Data Visualisation   -   Conclusions


Table of contents (Click to show)

How to use this repo (Click to show)

Make sure you have:

  • Python installed, this project used V3.12,
  • VS Code latest

Inside VS Code:

Open Extensions (Ctrl+Shift+X or ⇧⌘X on macOS) Install these extensions if you don't have them:

  • Python extension (by Microsoft in the Extensions Marketplace)
  • Jupyter extension (also by Microsoft)

From the terminal:

Open the folder in a terminal where you want the project to be saved

Run git clone:

git clone https://github.com/petedanielsmith/HealthcareInsuranceDataAnalyticsProject.git

Navigate in to the new folder:

cd HealthcareInsuranceDataAnalyticsProject

Setup a virtual enviroment:

Create a virtual enviroment for the project.

Linux / Mac:

python3 -m venv .venv
source .venv/bin/activate

Windows CMD:

python3 -m venv .venv
.venv\Scripts\activate

Windows PowerShell:

python3 -m venv .venv
.\.venv\Scripts\Activate.ps1

Install the dependancies:

This will install all the dependancies needed for the project in to the virtual enviroment if it is setup, rather than globally

pip install -r requirements.txt

Select the Kernel

There is a drop down at the top of the notebooks to select your kernal that will run the Python. If you setup a virtual enviroment then make sure you pick the venv one.



Dataset Content

The dataset used in this project can be downloaded from Kaggle: Healthcare Insurance Dataset. It contains information on the relationship between personal attributes, geographic factors, and their impact on medical insurance charges

Columns include:

  • age - The insured persons age.
  • sex - Gender (male or female) of the insured.
  • bmi - (Body Mass Index): A measure of body fat based on height and weight.
  • children - The number of dependents covered.
  • smoker - Whether the insured is a smoker (yes or no).
  • region - The geographic area of coverage.
  • charges - The medical insurance costs incurred by the insured person.

Business Requirements

Analyse healthcare insurance data to understand how personal attributes and geographic factors influence insurance costs.

Hypothesis

  • Smokers have higher insurance charges
  • People with higher BMI have higher insurance charges
  • Geographic region influences insurance charges
  • Sex of a person influences insurance charges
  • Including children on an insurance plan increases charges
  • Age of a person influences insurance charges

Project Plan

The prjoject follows the following steps:

  1. Extract - Extract the data from Kaggle.
  2. Load - Load the CSV via Pandas.
  3. Transform - Clean and process the data using Pandas, adding new columns and checking for missing or duplicated values.
  4. Visualise - Creating charts with Matplotlib, Seaborn and Plotly to visualise trends.
  5. Analyse - Interpret what the visualisations displayed.
  6. Document - Record findings and conclusions.

The rationale to map the business requirements to the Data Visualisations

  • Smokers have higher insurance charges
    • Use histogram to show smokers vs overall charge distribution
    • Correlation matrix to show the correlation
    • Violin plots to show the distributions
    • Scatter and 3D scatter charts to show the distribution
    • Box plot to show the distribution
  • People with higher BMI have higher insurance charges
    • Correlation matrix to show correlation
    • Violin plot to show the distribution
    • Scatter and 3D scatter to show the distribution
  • Geographic region influences insurance charges
    • Violin plot to show the distribution
    • Scatter chart to show the distribution
  • Sex of a person influences insurance charges
    • Violin plot to show the distribution
    • Box plot to show the distribution
  • Including children on an insurance plan increases charges
    • Bar chart to show the average sales charges
  • Age of a person influences insurance charges
    • Correlation matrix to show correlation
    • Violin plot to show the distribution
    • 3D scatter to show the distribution

Analysis techniques used

  1. Methods Used:

    • Descriptive statistics (.describe(), .info() etc.)

    • Segmentation (used bins for age group and bmi group)

    • Visual analytics (Matplotlib, Seaborn, Plotly)

  2. Limitations & Alternatives:

    • Limited data points availble in the csv, other factors could influence the decsion such as medial history etc.
  3. Structure Justification:

    • Data cleanup and transform notebook as the first part.

    • Visualisation notebook for the second part.

  4. Use of Generative AI:

    • AI supported: GitHub copilot extention was installed and so did speed up some repetative tasks.

Ethical considerations

  • The data was already anonymised and contained no data that could be used to identify an individual so there were no ethical concerns.

Unfixed Bugs

  • No unfixed bugs remaining

Development Roadmap

Challenges faced:

  • Having a separate notebook for clean and visualise meant i had to repeat the categorisation steps once importing the cleaned csv as the csv fileformat I used doesn't persist this data. If doing again I would investigate what other file formats data can be saved out to.
  • Creating a shared legend for multi chart plots rather than a repeating legend required ChatGPT to help me.
  • Adding layout styles and moving the intial camera on Plotly 3D charts required ChatGPT to help me.
  • GitHub static preview of notebooks does not display Plotly chart images so I added a link to the expored chart images.

Next steps:

  • Create a feature engineering pipeline to normalise and transform the data.
  • Create a predictive model that can predict insurance costs from given parameters.
  • Create a full interactive Plotly Dash charts app that have filter options that apply across multiple charts at once.

Conclusions

  • Smokers have higher insurance charges
    • Smoking clearly has the biggest effect on charges of all the data points available.
  • People with higher BMI have higher insurance charges
    • BMI had less of an effect on the charges than i expected, but did have an effect if the person was obese or severly obese and also a smoker.
  • Geographic region influences insurance charges
    • Geographic region didn't affect the charges to any note.
  • Sex of a person influences insurance charges
    • Sex didn't affect the charges to any note.
  • Including children on an insurance plan increases charges
    • Including children on the plan did have a very small increase. 4 and 5 children plans were coming in a bit lowerbut due to the small number of data points recorded for these values, nothing could be read in to this.
  • Age of a person influences insurance charges
    • Age did slowly increase the charges as they got older but wasn't by a huge amount.

Overall smoking was the biggest influence by far on insurance charges, especially if they were obese and severly obese smokers. Charges with age also slowly increased as the people got older. This is all very well displayed in this chart from the visualisation notebook:

Violin plot of all the distributions

Main Data Analysis Libraries

The libraries used for data analysis were:

  1. Pandas - For data loading, transforming and cleaning.
  2. NumPy - For data transforming in to categories.
  3. Matplotlib - For overall multi chart layouts.
  4. Seaborn - For a lot of the individual charts.
  5. Plotly - For interactive charts.

Credits

Content

  • Code institute - The intial project structure.
  • Kaggle - Providing the data set used.
  • NHS website - Providing the BMI category definitions.
  • KFF & State Health Compare - Information on insurance age group definitions.
  • ChatGPT - Help getting handles and making a single legend on a multi chart plots and adding layout changes to Plotly charts.
  • SimpleSteps.guide - My notes I recorded from the Code Institute course.

Media

About

Code Institute Course Work: Analyse healthcare insurance data to understand how personal attributes and geographic factors influence insurance costs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors