This project focuses on analysing healthcare insurance data to understand how various factors affect medical insurance costs. The primary method for will be using ETL (Extract, Transform and Load) and visualisations to identify relationships and trends within the data and provide clear insights to the viewer.
The dataset contains the following column headers:
| Field | Description | Possible Values / Notes | Example |
|---|---|---|---|
| age | Age of the individual in years | Between 18 and 65 | 29 |
| sex | Biological sex of the individual | Categories: "male", "female" | female |
| bmi | Body Mass Index | Continuous value; weight (kg) / height (m)^2 | 27.3 |
| children | Number of children covered | Non-negative integer | 2 |
| smoker | Smoker status | Categories: "yes", "no" | no |
| region | Geographic region | Categories: "northeast", "northwest", etc. | southeast |
| charges | Medical insurance charges ($) | Continuous value, insurance cost billed | 13456.78 |
| bmi_bracket | BMI category bracket | Categories: "underweight", "normal", "overweight", "obese" | obese |
Examine the dataset to see whether the factors listed above do directly impact the cost of healthcare.
The project specifically looks at how each factor affects the cost charged to the individual, as well as exploring whether there is a pattern of increase or decrease when multiple factors are present.
-
Hypothesis.
This will be tested by...Result:
-
There will be a considerable difference between insurance charges based on smoker status.
Studying correlation between smoker status and charges, as well as plotting the two status of smoker/non-smoker distribution as a visualResult: The average charge for a non-smoker is under 10,000 whereas a smoker is facing an average charge of over 30,000.
-
Medical charges across regions will be considerably different.
Analysing how amount charged per individual varies by regionResult: Bar chart has shown that price does not differ a great deal per region, as a standalone factor this isn't significant.

-
Individuals with higher BMI have higher medical insurance charges.
Plotting average charge v bmi status as well as using correlation heatmapResult: Obese individuals were paying a much higher premium than normal and underweight people. Surprisingly being overweight was on average no more expensive than someone of normal weight.

The data was collected from Kaggle, a third party data supplier. The Data will be imported and cleaned, ready for analysis. This will consist of:
- Getting an understanding for the data .info()
- Checking for missing values
- Dealing with outliers
- Changing dtypes
- Creating new columns
- Changing current columns
- Checking statistical summary of data (mean, median, std, etc)
As per the project plan, the data will be cleaned to maximise the integrity of the data. Some of the data trends were visible from the initial exploration phase and statistical analysis but not in a presentable format. To address this visualisations were made with the non-technical user in mind, making the project findings understandable for everyone.
The factors were addressed one by one to see how much of an impact they had on charges for the individual.
- The first plots were bar charts of the average charge for smokers v non-smokers. The count of smoker v non-smoker, the average charge per individual
and the distribution of charges between the two groups and the contrast between them. - The second set of plots were to look at the distribution of charges and BMI amongst all individuals and the relationship between the two.
- The third set of plots were looking at average charge amount across all categorical fields to find out if any of them had the impact on charges the same way that smoking status did. The number of children column had been replaced by whether the individual had children. An additional bar chart was plotted to see the relationship between number of dependencies and charges.
- The fourth set of plots were testing the relationship between the numerical values (age and BMI) and the charge amount. Then combining them both with smoker status.
- The fifth set of plots were scattergraphs showing the second and third biggest causes of increased charge (BMI and Age) combined with factors that had next to no influence on the amount, in region and number of children
- The sixth set of plot was a heatmap which proved how little of an impact all factors (excluding smoker status) had on the calculation of the amount charged to the individual
- The seventh set of plots was a scattermatrix showing all of this data.
- This project has been built on decriptive data analysis, by looking at a specific part of a dataset (charges) and understanding how each of the other factors contribute to this total whether that be increasing or decreasing it. Future projects could use this data to predict how much an individual will be charged based on their individual circumstances listed in the dataset.
- A systematic approach was taken to analysing all the data, first by looking at each of the factors individually before going on to analyse how the charge amount is affected when two or three of the factors are combined.
- There were some limitations on the data, there was no description of specific regions despite there being various health benefits to living in the countryside v in the city. There could also be other health factors affecting the individual or how many claims they have made previously.
- The plan was to take the approach listed above using the notes made from the course LMS, masterclass and data coach sessions, AI was primarily used to help correct the syntax
of code that was throwing up errors that couldn't be fixed based on notes.
- The only consideration of this kind that I can think of is that it was a private medical provider, meaning not everyone will have access to the same services due to financial reasons.
- To my knowledge, there are no bugs in the project. It was the intention throughout to keep everything simple. Clean the data, dealing with missing values and anomalies before presenting findings with numerous visusalistions.
- It became apparent that I had not worked on a project like this before. There was some indecision on whether to remove the outliers, as they accounted for 145 of the rows. When given the next project I will take more consideration in how to present my findings, everything is in a logical order but it could be more polished.
- Gaps in my knowledge were nature of doing a bootcamp, it will be impossible to retain all information but I was able to refer to my notes when in times of need failing that AI was able to fine tune code when I was getting errors that I could't fix.
Pandas - This was the foundation of the project, being used to store the csv file into a dataframe to be cleaned and manipulated prior to further analysis
Matplotlib, Seaborn and Plotly - Used to visualise my findings.
- Code Institute LMS if my notes didn't contain the required information.
- Masterclass and data coach session notes
- Peer to peer support in discord
- README template provided by course facilitator
- Code Institute project template
- Kaggle - source for the dataset
- The image at the top of the README was a free to use vector
- Code institute tutors, masterclass coach Spencer Barriball and data coach Mark Briscoe
- Peers in September cohort for support
Analysis of this dataset identified smoker status to the biggest influencing factor when calculating an individuals medical charges, with smokers paying on average over 20,000 more than non smoking individuals. The most staggering statistic was the difference between the two, the cheapest charge issued to a smoker was more than 75% of non smokers paid. Even when other factors were coupled together such as number of children and BMI, none of them had the effect that smoking did. There were certain limitations on the data such as wider medical history and previous claims that stopped me knuckling down the data even more. This analysis could form the foundations of another project that looks to calculate the total charges an individual will have to pay based on the information provided in this dataset.
The influence of smoker status on BMI statistics:
| 





