📊 Predicting Life Expectancy Using Regression This project develops a predictive model for life expectancy based on a dataset from Kaggle containing health, economic, and demographic variables across multiple countries. Using R, we cleaned and transformed the data, explored visualizations, and built multiple regression models to predict life expectancy with improved accuracy.
📁 Dataset
-
Source: Kaggle
-
Size: ~3,000 observations, 22 variables
-
Target Variable: Life expectancy
-
Selected Predictors:
Status(Developed / Developing)Alcohol(liters per capita)Percentage ExpenditureHepatitis B,Polio(immunization rates)BMISchooling(average years)
🔧 Workflow Summary
-
Data Cleaning
- Removed irrelevant variables
- Imputed missing values using mean and k-NN methods
- Scaled and transformed skewed variables
-
Exploratory Data Analysis (EDA)
- Histograms of predictors and target
- Correlation matrix using
pairs() - Identified skewed distributions (e.g., Alcohol, Expenditure)
-
Modeling
- Built multiple linear regression models
- Applied log and polynomial transformations
- Selected features based on statistical significance and diagnostics
- Achieved an 18% reduction in MSE
-
Validation
- Checked linear model assumptions using:
- Residual plots
- Q-Q plots
- R² and adjusted R²
📈 Sample Visualizations
# Histogram of Life Expectancy
hist(data_clean$Life.expectancy, main = "Life Expectancy", col = "skyblue", xlab = "Years")
# Correlation Plot
pairs(data_clean, cex = 0.1)🧠 Sample Model Code
# Fit linear model
model <- lm(Life.expectancy ~ Status + Alcohol + percentage.expenditure + Hepatitis.B +
Polio + BMI + Schooling, data = data_clean)
# Summary of model
summary(model)
# Residual diagnostics
par(mfrow = c(2, 2))
plot(model)✅ Results
- Final model includes 7 key predictors
- Achieved:
- R² ≈ 0.78
- MSE ↓ by 18% after transformations
- Schooling, Alcohol, and Polio were strong positive predictors of life expectancy
🔧 Tools Used
Language: R
Packages: dplyr, ggplot2, caret, MASS

