Analysing health insurance cost data from Kaggle.
Project Board
-
Data Cleanup
-
Data Visualisation
-
Conclusions
Table of contents (Click to show)
How to use this repo (Click to show)
Make sure you have:
- Python installed, this project used V3.12,
- VS Code latest
Inside VS Code:
Open Extensions (Ctrl+Shift+X or ⇧⌘X on macOS) Install these extensions if you don't have them:
- Python extension (by Microsoft in the Extensions Marketplace)
- Jupyter extension (also by Microsoft)
From the terminal:
Open the folder in a terminal where you want the project to be saved
git clone https://github.com/petedanielsmith/HealthcareInsuranceDataAnalyticsProject.git
cd HealthcareInsuranceDataAnalyticsProject
Create a virtual enviroment for the project.
Linux / Mac:
python3 -m venv .venv
source .venv/bin/activate
Windows CMD:
python3 -m venv .venv
.venv\Scripts\activate
Windows PowerShell:
python3 -m venv .venv
.\.venv\Scripts\Activate.ps1
This will install all the dependancies needed for the project in to the virtual enviroment if it is setup, rather than globally
pip install -r requirements.txt
There is a drop down at the top of the notebooks to select your kernal that will run the Python. If you setup a virtual enviroment then make sure you pick the venv one.
The dataset used in this project can be downloaded from Kaggle: Healthcare Insurance Dataset. It contains information on the relationship between personal attributes, geographic factors, and their impact on medical insurance charges
Columns include:
age- The insured persons age.sex- Gender (male or female) of the insured.bmi- (Body Mass Index): A measure of body fat based on height and weight.children- The number of dependents covered.smoker- Whether the insured is a smoker (yes or no).region- The geographic area of coverage.charges- The medical insurance costs incurred by the insured person.
Analyse healthcare insurance data to understand how personal attributes and geographic factors influence insurance costs.
- Smokers have higher insurance charges
- People with higher BMI have higher insurance charges
- Geographic region influences insurance charges
- Sex of a person influences insurance charges
- Including children on an insurance plan increases charges
- Age of a person influences insurance charges
The prjoject follows the following steps:
Extract- Extract the data from Kaggle.Load- Load the CSV via Pandas.Transform- Clean and process the data using Pandas, adding new columns and checking for missing or duplicated values.Visualise- Creating charts with Matplotlib, Seaborn and Plotly to visualise trends.Analyse- Interpret what the visualisations displayed.Document- Record findings and conclusions.
- Smokers have higher insurance charges
- Use histogram to show smokers vs overall charge distribution
- Correlation matrix to show the correlation
- Violin plots to show the distributions
- Scatter and 3D scatter charts to show the distribution
- Box plot to show the distribution
- People with higher BMI have higher insurance charges
- Correlation matrix to show correlation
- Violin plot to show the distribution
- Scatter and 3D scatter to show the distribution
- Geographic region influences insurance charges
- Violin plot to show the distribution
- Scatter chart to show the distribution
- Sex of a person influences insurance charges
- Violin plot to show the distribution
- Box plot to show the distribution
- Including children on an insurance plan increases charges
- Bar chart to show the average sales charges
- Age of a person influences insurance charges
- Correlation matrix to show correlation
- Violin plot to show the distribution
- 3D scatter to show the distribution
-
Methods Used:
-
Descriptive statistics (
.describe(),.info()etc.) -
Segmentation (used bins for age group and bmi group)
-
Visual analytics (
Matplotlib,Seaborn,Plotly)
-
-
Limitations & Alternatives:
- Limited data points availble in the csv, other factors could influence the decsion such as medial history etc.
-
Structure Justification:
-
Data cleanup and transform notebook as the first part.
-
Visualisation notebook for the second part.
-
-
Use of Generative AI:
- AI supported: GitHub copilot extention was installed and so did speed up some repetative tasks.
- The data was already anonymised and contained no data that could be used to identify an individual so there were no ethical concerns.
- No unfixed bugs remaining
Challenges faced:
- Having a separate notebook for clean and visualise meant i had to repeat the categorisation steps once importing the cleaned csv as the csv fileformat I used doesn't persist this data. If doing again I would investigate what other file formats data can be saved out to.
- Creating a shared legend for multi chart plots rather than a repeating legend required ChatGPT to help me.
- Adding layout styles and moving the intial camera on Plotly 3D charts required ChatGPT to help me.
- GitHub static preview of notebooks does not display Plotly chart images so I added a link to the expored chart images.
Next steps:
- Create a feature engineering pipeline to normalise and transform the data.
- Create a predictive model that can predict insurance costs from given parameters.
- Create a full interactive Plotly Dash charts app that have filter options that apply across multiple charts at once.
- Smokers have higher insurance charges
- Smoking clearly has the biggest effect on charges of all the data points available.
- People with higher BMI have higher insurance charges
- BMI had less of an effect on the charges than i expected, but did have an effect if the person was obese or severly obese and also a smoker.
- Geographic region influences insurance charges
- Geographic region didn't affect the charges to any note.
- Sex of a person influences insurance charges
- Sex didn't affect the charges to any note.
- Including children on an insurance plan increases charges
- Including children on the plan did have a very small increase. 4 and 5 children plans were coming in a bit lowerbut due to the small number of data points recorded for these values, nothing could be read in to this.
- Age of a person influences insurance charges
- Age did slowly increase the charges as they got older but wasn't by a huge amount.
Overall smoking was the biggest influence by far on insurance charges, especially if they were obese and severly obese smokers. Charges with age also slowly increased as the people got older. This is all very well displayed in this chart from the visualisation notebook:
The libraries used for data analysis were:
Pandas- For data loading, transforming and cleaning.NumPy- For data transforming in to categories.Matplotlib- For overall multi chart layouts.Seaborn- For a lot of the individual charts.Plotly- For interactive charts.
- Code institute - The intial project structure.
- Kaggle - Providing the data set used.
- NHS website - Providing the BMI category definitions.
- KFF & State Health Compare - Information on insurance age group definitions.
- ChatGPT - Help getting handles and making a single legend on a multi chart plots and adding layout changes to Plotly charts.
- SimpleSteps.guide - My notes I recorded from the Code Institute course.
- Midjourney AI - AI Generated banner logo.
- Code Institute - Code Institute logo.
- Python - Python logo image.
- Pandas - Pandas logo image.
- Matplotlib - Matplotlib logo image.
- Seaborn - Seaborn logo image.
- Plotly - Plotly logo image.
- Kaggle - Kaggle logo image.

