Project one - Healthcare

Project Overview

This project focuses on analysing healthcare insurance data to understand how various factors affect medical insurance costs. The primary method for will be using ETL (Extract, Transform and Load) and visualisations to identify relationships and trends within the data and provide clear insights to the viewer.

Dataset Content

The dataset contains the following column headers:

Field	Description	Possible Values / Notes	Example
age	Age of the individual in years	Between 18 and 65	29
sex	Biological sex of the individual	Categories: "male", "female"	female
bmi	Body Mass Index	Continuous value; weight (kg) / height (m)^2	27.3
children	Number of children covered	Non-negative integer	2
smoker	Smoker status	Categories: "yes", "no"	no
region	Geographic region	Categories: "northeast", "northwest", etc.	southeast
charges	Medical insurance charges ($)	Continuous value, insurance cost billed	13456.78
bmi_bracket	BMI category bracket	Categories: "underweight", "normal", "overweight", "obese"	obese

Business Requirements

Examine the dataset to see whether the factors listed above do directly impact the cost of healthcare.
The project specifically looks at how each factor affects the cost charged to the individual, as well as exploring whether there is a pattern of increase or decrease when multiple factors are present.

Hypothesis and how to validate?

Format:

Hypothesis.
This will be tested by...

Result:

There will be a considerable difference between insurance charges based on smoker status.
Studying correlation between smoker status and charges, as well as plotting the two status of smoker/non-smoker distribution as a visual

Result: The average charge for a non-smoker is under 10,000 whereas a smoker is facing an average charge of over 30,000.
Medical charges across regions will be considerably different.
Analysing how amount charged per individual varies by region

Result: Bar chart has shown that price does not differ a great deal per region, as a standalone factor this isn't significant.
Individuals with higher BMI have higher medical insurance charges.
Plotting average charge v bmi status as well as using correlation heatmap

Result: Obese individuals were paying a much higher premium than normal and underweight people. Surprisingly being overweight was on average no more expensive than someone of normal weight.

Project Plan

The data was collected from Kaggle, a third party data supplier. The Data will be imported and cleaned, ready for analysis. This will consist of:

Getting an understanding for the data .info()
Checking for missing values
Dealing with outliers
Changing dtypes
Creating new columns
Changing current columns
Checking statistical summary of data (mean, median, std, etc)

As per the project plan, the data will be cleaned to maximise the integrity of the data. Some of the data trends were visible from the initial exploration phase and statistical analysis but not in a presentable format. To address this visualisations were made with the non-technical user in mind, making the project findings understandable for everyone.

The rationale to map the business requirements to the Data Visualisations

The factors were addressed one by one to see how much of an impact they had on charges for the individual.

The first plots were bar charts of the average charge for smokers v non-smokers. The count of smoker v non-smoker, the average charge per individual
and the distribution of charges between the two groups and the contrast between them.
The second set of plots were to look at the distribution of charges and BMI amongst all individuals and the relationship between the two.
The third set of plots were looking at average charge amount across all categorical fields to find out if any of them had the impact on charges the same way that smoking status did. The number of children column had been replaced by whether the individual had children. An additional bar chart was plotted to see the relationship between number of dependencies and charges.
The fourth set of plots were testing the relationship between the numerical values (age and BMI) and the charge amount. Then combining them both with smoker status.
The fifth set of plots were scattergraphs showing the second and third biggest causes of increased charge (BMI and Age) combined with factors that had next to no influence on the amount, in region and number of children
The sixth set of plot was a heatmap which proved how little of an impact all factors (excluding smoker status) had on the calculation of the amount charged to the individual
The seventh set of plots was a scattermatrix showing all of this data.

Analysis techniques used

This project has been built on decriptive data analysis, by looking at a specific part of a dataset (charges) and understanding how each of the other factors contribute to this total whether that be increasing or decreasing it. Future projects could use this data to predict how much an individual will be charged based on their individual circumstances listed in the dataset.
A systematic approach was taken to analysing all the data, first by looking at each of the factors individually before going on to analyse how the charge amount is affected when two or three of the factors are combined.
There were some limitations on the data, there was no description of specific regions despite there being various health benefits to living in the countryside v in the city. There could also be other health factors affecting the individual or how many claims they have made previously.
The plan was to take the approach listed above using the notes made from the course LMS, masterclass and data coach sessions, AI was primarily used to help correct the syntax
of code that was throwing up errors that couldn't be fixed based on notes.

Ethical considerations

The only consideration of this kind that I can think of is that it was a private medical provider, meaning not everyone will have access to the same services due to financial reasons.

Unfixed Bugs

To my knowledge, there are no bugs in the project. It was the intention throughout to keep everything simple. Clean the data, dealing with missing values and anomalies before presenting findings with numerous visusalistions.

Development Roadmap

It became apparent that I had not worked on a project like this before. There was some indecision on whether to remove the outliers, as they accounted for 145 of the rows. When given the next project I will take more consideration in how to present my findings, everything is in a logical order but it could be more polished.
Gaps in my knowledge were nature of doing a bootcamp, it will be impossible to retain all information but I was able to refer to my notes when in times of need failing that AI was able to fine tune code when I was getting errors that I could't fix.

Main Data Analysis Libraries

Pandas - This was the foundation of the project, being used to store the csv file into a dataframe to be cleaned and manipulated prior to further analysis
Matplotlib, Seaborn and Plotly - Used to visualise my findings.

Credits

Code Institute LMS if my notes didn't contain the required information.
Masterclass and data coach session notes
Peer to peer support in discord
README template provided by course facilitator
Code Institute project template
Kaggle - source for the dataset

Content

The image at the top of the README was a free to use vector

Acknowledgements (optional)

Code institute tutors, masterclass coach Spencer Barriball and data coach Mark Briscoe
Peers in September cohort for support

Conclusion

Analysis of this dataset identified smoker status to the biggest influencing factor when calculating an individuals medical charges, with smokers paying on average over 20,000 more than non smoking individuals. The most staggering statistic was the difference between the two, the cheapest charge issued to a smoker was more than 75% of non smokers paid. Even when other factors were coupled together such as number of children and BMI, none of them had the effect that smoking did. There were certain limitations on the data such as wider medical history and previous claims that stopped me knuckling down the data even more. This analysis could form the foundations of another project that looks to calculate the total charges an individual will have to pay based on the information provided in this dataset.

The influence of smoker status on BMI statistics: |

The influence of smoker status on age:

Correlation between factors (excluding smoker status)

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
jupyter_notebooks		jupyter_notebooks
.gitignore		.gitignore
.python-version		.python-version
.slugignore		.slugignore
Hypothesis 1-1.png		Hypothesis 1-1.png
Hypothesis 1-2.png		Hypothesis 1-2.png
Hypothesis 1-3.png		Hypothesis 1-3.png
Hypothesis 1-4.png		Hypothesis 1-4.png
Hypothesis 1.png		Hypothesis 1.png
Hypothesis 3-1.png		Hypothesis 3-1.png
Hypothesis 3.png		Hypothesis 3.png
Procfile		Procfile
README.md		README.md
conclusion 1.png		conclusion 1.png
conclusion 2.png		conclusion 2.png
conclusion 3.png		conclusion 3.png
graphic header - healthcare.jpg		graphic header - healthcare.jpg
hypothesis 2.png		hypothesis 2.png
project board.png		project board.png
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project one - Healthcare

Contents

Project Overview

Dataset Content

Business Requirements

Hypothesis and how to validate?

Format:

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Ethical considerations

Unfixed Bugs

Development Roadmap

Main Data Analysis Libraries

Credits

Content

Acknowledgements (optional)

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project one - Healthcare

Contents

Project Overview

Dataset Content

Business Requirements

Hypothesis and how to validate?

Format:

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Ethical considerations

Unfixed Bugs

Development Roadmap

Main Data Analysis Libraries

Credits

Content

Acknowledgements (optional)

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages