Skip to content

StavSteven/project_one_insurance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project one - Healthcare

Contents

Project Overview

This project focuses on analysing healthcare insurance data to understand how various factors affect medical insurance costs. The primary method for will be using ETL (Extract, Transform and Load) and visualisations to identify relationships and trends within the data and provide clear insights to the viewer.

alt text

Dataset Content

The dataset contains the following column headers:

Field Description Possible Values / Notes Example
age Age of the individual in years Between 18 and 65 29
sex Biological sex of the individual Categories: "male", "female" female
bmi Body Mass Index Continuous value; weight (kg) / height (m)^2 27.3
children Number of children covered Non-negative integer 2
smoker Smoker status Categories: "yes", "no" no
region Geographic region Categories: "northeast", "northwest", etc. southeast
charges Medical insurance charges ($) Continuous value, insurance cost billed 13456.78
bmi_bracket BMI category bracket Categories: "underweight", "normal", "overweight", "obese" obese

Business Requirements

Examine the dataset to see whether the factors listed above do directly impact the cost of healthcare.
The project specifically looks at how each factor affects the cost charged to the individual, as well as exploring whether there is a pattern of increase or decrease when multiple factors are present.

Hypothesis and how to validate?

Format:

  • Hypothesis.
    This will be tested by...

    Result:

  1. There will be a considerable difference between insurance charges based on smoker status.
    Studying correlation between smoker status and charges, as well as plotting the two status of smoker/non-smoker distribution as a visual

    Result: The average charge for a non-smoker is under 10,000 whereas a smoker is facing an average charge of over 30,000.

    alt text

  2. Medical charges across regions will be considerably different.
    Analysing how amount charged per individual varies by region

    Result: Bar chart has shown that price does not differ a great deal per region, as a standalone factor this isn't significant. alt text

  3. Individuals with higher BMI have higher medical insurance charges.
    Plotting average charge v bmi status as well as using correlation heatmap

    Result: Obese individuals were paying a much higher premium than normal and underweight people. Surprisingly being overweight was on average no more expensive than someone of normal weight. alt text

Project Plan

alt text

The data was collected from Kaggle, a third party data supplier. The Data will be imported and cleaned, ready for analysis. This will consist of:

  • Getting an understanding for the data .info()
  • Checking for missing values
  • Dealing with outliers
  • Changing dtypes
  • Creating new columns
  • Changing current columns
  • Checking statistical summary of data (mean, median, std, etc)

As per the project plan, the data will be cleaned to maximise the integrity of the data. Some of the data trends were visible from the initial exploration phase and statistical analysis but not in a presentable format. To address this visualisations were made with the non-technical user in mind, making the project findings understandable for everyone.

The rationale to map the business requirements to the Data Visualisations

The factors were addressed one by one to see how much of an impact they had on charges for the individual.

  • The first plots were bar charts of the average charge for smokers v non-smokers. The count of smoker v non-smoker, the average charge per individual
    and the distribution of charges between the two groups and the contrast between them.
  • The second set of plots were to look at the distribution of charges and BMI amongst all individuals and the relationship between the two.
  • The third set of plots were looking at average charge amount across all categorical fields to find out if any of them had the impact on charges the same way that smoking status did. The number of children column had been replaced by whether the individual had children. An additional bar chart was plotted to see the relationship between number of dependencies and charges.
  • The fourth set of plots were testing the relationship between the numerical values (age and BMI) and the charge amount. Then combining them both with smoker status.
  • The fifth set of plots were scattergraphs showing the second and third biggest causes of increased charge (BMI and Age) combined with factors that had next to no influence on the amount, in region and number of children
  • The sixth set of plot was a heatmap which proved how little of an impact all factors (excluding smoker status) had on the calculation of the amount charged to the individual
  • The seventh set of plots was a scattermatrix showing all of this data.

Analysis techniques used

  • This project has been built on decriptive data analysis, by looking at a specific part of a dataset (charges) and understanding how each of the other factors contribute to this total whether that be increasing or decreasing it. Future projects could use this data to predict how much an individual will be charged based on their individual circumstances listed in the dataset.
  • A systematic approach was taken to analysing all the data, first by looking at each of the factors individually before going on to analyse how the charge amount is affected when two or three of the factors are combined.
  • There were some limitations on the data, there was no description of specific regions despite there being various health benefits to living in the countryside v in the city. There could also be other health factors affecting the individual or how many claims they have made previously.
  • The plan was to take the approach listed above using the notes made from the course LMS, masterclass and data coach sessions, AI was primarily used to help correct the syntax
    of code that was throwing up errors that couldn't be fixed based on notes.

Ethical considerations

  • The only consideration of this kind that I can think of is that it was a private medical provider, meaning not everyone will have access to the same services due to financial reasons.

Unfixed Bugs

  • To my knowledge, there are no bugs in the project. It was the intention throughout to keep everything simple. Clean the data, dealing with missing values and anomalies before presenting findings with numerous visusalistions.

Development Roadmap

  • It became apparent that I had not worked on a project like this before. There was some indecision on whether to remove the outliers, as they accounted for 145 of the rows. When given the next project I will take more consideration in how to present my findings, everything is in a logical order but it could be more polished.
  • Gaps in my knowledge were nature of doing a bootcamp, it will be impossible to retain all information but I was able to refer to my notes when in times of need failing that AI was able to fine tune code when I was getting errors that I could't fix.

Main Data Analysis Libraries

Pandas - This was the foundation of the project, being used to store the csv file into a dataframe to be cleaned and manipulated prior to further analysis
Matplotlib, Seaborn and Plotly - Used to visualise my findings.

Credits

  • Code Institute LMS if my notes didn't contain the required information.
  • Masterclass and data coach session notes
  • Peer to peer support in discord
  • README template provided by course facilitator
  • Code Institute project template
  • Kaggle - source for the dataset

Content

  • The image at the top of the README was a free to use vector

Acknowledgements (optional)

  • Code institute tutors, masterclass coach Spencer Barriball and data coach Mark Briscoe
  • Peers in September cohort for support

Conclusion

Analysis of this dataset identified smoker status to the biggest influencing factor when calculating an individuals medical charges, with smokers paying on average over 20,000 more than non smoking individuals. The most staggering statistic was the difference between the two, the cheapest charge issued to a smoker was more than 75% of non smokers paid. Even when other factors were coupled together such as number of children and BMI, none of them had the effect that smoking did. There were certain limitations on the data such as wider medical history and previous claims that stopped me knuckling down the data even more. This analysis could form the foundations of another project that looks to calculate the total charges an individual will have to pay based on the information provided in this dataset.

alt text

The influence of smoker status on BMI statistics: alt text | alt text

The influence of smoker status on age: alt text

Correlation between factors (excluding smoker status) alt text

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors