Health Insurance Cost Analysis

Analysing health insurance cost data from Kaggle.

Project Board - Data Cleanup - Data Visualisation - Conclusions

Table of contents (Click to show)

Dataset Content
Business Requirements
Hypothesis
Project Plan
The rationale to map the business requirements to the Data Visualisations
Analysis techniques used
Ethical considerations
Unfixed Bugs
Development Roadmap
Conclusions
Main Data Analysis Libraries
Credits
- Content
- Media
Acknowledgements

How to use this repo (Click to show)

Make sure you have:

Python installed, this project used V3.12,
VS Code latest

Inside VS Code:

Open Extensions (Ctrl+Shift+X or ⇧⌘X on macOS) Install these extensions if you don't have them:

Python extension (by Microsoft in the Extensions Marketplace)
Jupyter extension (also by Microsoft)

From the terminal:

Open the folder in a terminal where you want the project to be saved

Run git clone:

git clone https://github.com/petedanielsmith/HealthcareInsuranceDataAnalyticsProject.git

Navigate in to the new folder:

cd HealthcareInsuranceDataAnalyticsProject

Setup a virtual enviroment:

Create a virtual enviroment for the project.

Linux / Mac:

python3 -m venv .venv
source .venv/bin/activate

Windows CMD:

python3 -m venv .venv
.venv\Scripts\activate

Windows PowerShell:

python3 -m venv .venv
.\.venv\Scripts\Activate.ps1

Install the dependancies:

This will install all the dependancies needed for the project in to the virtual enviroment if it is setup, rather than globally

pip install -r requirements.txt

Select the Kernel

There is a drop down at the top of the notebooks to select your kernal that will run the Python. If you setup a virtual enviroment then make sure you pick the venv one.

Dataset Content

The dataset used in this project can be downloaded from Kaggle: Healthcare Insurance Dataset. It contains information on the relationship between personal attributes, geographic factors, and their impact on medical insurance charges

Columns include:

age - The insured persons age.
sex - Gender (male or female) of the insured.
bmi - (Body Mass Index): A measure of body fat based on height and weight.
children - The number of dependents covered.
smoker - Whether the insured is a smoker (yes or no).
region - The geographic area of coverage.
charges - The medical insurance costs incurred by the insured person.

Business Requirements

Analyse healthcare insurance data to understand how personal attributes and geographic factors influence insurance costs.

Hypothesis

Smokers have higher insurance charges
People with higher BMI have higher insurance charges
Geographic region influences insurance charges
Sex of a person influences insurance charges
Including children on an insurance plan increases charges
Age of a person influences insurance charges

Project Plan

The prjoject follows the following steps:

Extract - Extract the data from Kaggle.
Load - Load the CSV via Pandas.
Transform - Clean and process the data using Pandas, adding new columns and checking for missing or duplicated values.
Visualise - Creating charts with Matplotlib, Seaborn and Plotly to visualise trends.
Analyse - Interpret what the visualisations displayed.
Document - Record findings and conclusions.

The rationale to map the business requirements to the Data Visualisations

Smokers have higher insurance charges
- Use histogram to show smokers vs overall charge distribution
- Correlation matrix to show the correlation
- Violin plots to show the distributions
- Scatter and 3D scatter charts to show the distribution
- Box plot to show the distribution
People with higher BMI have higher insurance charges
- Correlation matrix to show correlation
- Violin plot to show the distribution
- Scatter and 3D scatter to show the distribution
Geographic region influences insurance charges
- Violin plot to show the distribution
- Scatter chart to show the distribution
Sex of a person influences insurance charges
- Violin plot to show the distribution
- Box plot to show the distribution
Including children on an insurance plan increases charges
- Bar chart to show the average sales charges
Age of a person influences insurance charges
- Correlation matrix to show correlation
- Violin plot to show the distribution
- 3D scatter to show the distribution

Analysis techniques used

Methods Used:
- Descriptive statistics (.describe(), .info() etc.)
- Segmentation (used bins for age group and bmi group)
- Visual analytics (Matplotlib, Seaborn, Plotly)
Limitations & Alternatives:
- Limited data points availble in the csv, other factors could influence the decsion such as medial history etc.
Structure Justification:
- Data cleanup and transform notebook as the first part.
- Visualisation notebook for the second part.
Use of Generative AI:
- AI supported: GitHub copilot extention was installed and so did speed up some repetative tasks.

Ethical considerations

The data was already anonymised and contained no data that could be used to identify an individual so there were no ethical concerns.

Unfixed Bugs

No unfixed bugs remaining

Development Roadmap

Challenges faced:

Having a separate notebook for clean and visualise meant i had to repeat the categorisation steps once importing the cleaned csv as the csv fileformat I used doesn't persist this data. If doing again I would investigate what other file formats data can be saved out to.
Creating a shared legend for multi chart plots rather than a repeating legend required ChatGPT to help me.
Adding layout styles and moving the intial camera on Plotly 3D charts required ChatGPT to help me.
GitHub static preview of notebooks does not display Plotly chart images so I added a link to the expored chart images.

Next steps:

Create a feature engineering pipeline to normalise and transform the data.
Create a predictive model that can predict insurance costs from given parameters.
Create a full interactive Plotly Dash charts app that have filter options that apply across multiple charts at once.

Conclusions

Smokers have higher insurance charges
- Smoking clearly has the biggest effect on charges of all the data points available.
People with higher BMI have higher insurance charges
- BMI had less of an effect on the charges than i expected, but did have an effect if the person was obese or severly obese and also a smoker.
Geographic region influences insurance charges
- Geographic region didn't affect the charges to any note.
Sex of a person influences insurance charges
- Sex didn't affect the charges to any note.
Including children on an insurance plan increases charges
- Including children on the plan did have a very small increase. 4 and 5 children plans were coming in a bit lowerbut due to the small number of data points recorded for these values, nothing could be read in to this.
Age of a person influences insurance charges
- Age did slowly increase the charges as they got older but wasn't by a huge amount.

Overall smoking was the biggest influence by far on insurance charges, especially if they were obese and severly obese smokers. Charges with age also slowly increased as the people got older. This is all very well displayed in this chart from the visualisation notebook:

Main Data Analysis Libraries

The libraries used for data analysis were:

Pandas - For data loading, transforming and cleaning.
NumPy - For data transforming in to categories.
Matplotlib - For overall multi chart layouts.
Seaborn - For a lot of the individual charts.
Plotly - For interactive charts.

Credits

Content

Code institute - The intial project structure.
Kaggle - Providing the data set used.
NHS website - Providing the BMI category definitions.
KFF & State Health Compare - Information on insurance age group definitions.
ChatGPT - Help getting handles and making a single legend on a multi chart plots and adding layout changes to Plotly charts.
SimpleSteps.guide - My notes I recorded from the Code Institute course.

Media

Midjourney AI - AI Generated banner logo.
Code Institute - Code Institute logo.
Python - Python logo image.
Pandas - Pandas logo image.
Matplotlib - Matplotlib logo image.
Seaborn - Seaborn logo image.
Plotly - Plotly logo image.
Kaggle - Kaggle logo image.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
charts		charts
data		data
images		images
jupyter_notebooks		jupyter_notebooks
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
.slugignore		.slugignore
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Health Insurance Cost Analysis

Run git clone:

Navigate in to the new folder:

Setup a virtual enviroment:

Install the dependancies:

Select the Kernel

Dataset Content

Business Requirements

Hypothesis

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Ethical considerations

Unfixed Bugs

Development Roadmap

Conclusions

Main Data Analysis Libraries

Credits

Content

Media

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Health Insurance Cost Analysis

Run git clone:

Navigate in to the new folder:

Setup a virtual enviroment:

Install the dependancies:

Select the Kernel

Dataset Content

Business Requirements

Hypothesis

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Ethical considerations

Unfixed Bugs

Development Roadmap

Conclusions

Main Data Analysis Libraries

Credits

Content

Media

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages