This repository presents a dual-analysis of community health outcomes, moving beyond simple prediction to deep, actionable explanation. It combines two complementary analyses: - A Supervised Machine Learning pipeline to predict Quality of Life (QoL) and identify its key drivers. - An Unsupervised Clustering analysis to explain patient satisfaction by identifying hidden biomechanical profiles and the service gaps they face.
The result is a complete, data-driven story that moves from "What" is happening (Prediction) to "Why" it's happening and "How" to fix it (Explanation).
View the analysis report here: https://biodatasage.github.io/ff-comm-health/
- Predict Quality of Life scores using patient demographics, biomechanical measures, and service utilization patterns
- Discover hidden patient profiles ("Slow," "Steady," and "Fast Walkers") using unsupervised k-Means clustering.
- Identify key determinants of patient satisfaction and health outcomes
- Explain the "Frustration Gap we found in Patient Satisfaction for the "Steady Walker" profile.
- Explain Identify a critical "Rehab Gap as the root cause of this frustration.
- Compare machine learning models (Linear Regression vs Random Forest) for predictive accuracy
- Provide interactive tools for healthcare professionals to explore patient data and generate predictions
- Generate evidence-based recommendations for intervention strategies and program optimization
The dataset is publicly available on kaggle: Link to dataset
| Characteristic | Details |
|---|---|
| Sample Size | 347 participants |
| Age Range | 18-69 years |
| Variables | 12 features (demographics, clinical, biomechanical, outcomes) |
| Study Type | Cross-sectional community health evaluation |
| Data Format | CSV (community_health_evaluation_dataset.csv) |
Demographics: - Age (years) - Gender (Male/Female) - Socioeconomic Status (SES: 1-4 scale)
Service Utilization: - Service Type (Consultation, Preventive, Rehabilitation) - Visit Frequency (Yearly, Monthly, Weekly)
Biomechanical Measures: - Step Frequency (steps/min) - Stride Length (meters) - Joint Angle (degrees) - EMG Activity Level (Low, Moderate, High)
Outcome Measures: - Quality of Life Score (0-100) - Patient Satisfaction (1-10 scale)
Our project uses a two-part analytical approach to tell the full story.
This analysis builds a predictive model to understand the key drivers of QoL. - Models: Linear Regression, Random Forest. - Process: 80/20 train-test split with 5-fold cross-validation. - Finding: The Random Forest model was superior (RΒ² = 0.489), and it identified Patient Satisfaction and Service Type as the two most important predictors of Quality of Life. - The Problem: This gave us a puzzle. To improve QoL, we have to "improve satisfaction." This isn't an answer; it's a new question.
This analysis was designed to solve the puzzle from Part 1. We investigated Patient Satisfaction to find its root cause. - Model: Unsupervised k-Means Clustering. - Process: 1. Clustering: We clustered patients on their 4 biomechanical variables. 2. Find k: An Elbow Plot showed k=3 was the optimal number of clusters. 3. Profile: We profiled the clusters by their Step Frequency and named them: "Slow Walkers," "Steady Walkers," and "Fast Walkers." 4. Connect: We cross-referenced these clusters with our key outcome (Patient Satisfaction) and the key driver (Service Type).
This structure is designed for clear, reproducible analysis.
.
βββ data/
βββ RData.rds # Dataset
βββ scripts/
βββ predictive_analysis.qmd # Script for Part 1 (RF Model)
βββ clustering_analysis.Rmd # Script for Part 2 (k-Means Model)
βββ app/
βββ app.R # Interactive Shiny dashboard
βββ presentations/
βββ final_presentation.pdf # Presentation file
βββ report.Rmd # Script for online report
βββ README.md # This file
βββ LICENSE # MIT License
βββ figures/ # Generated plots and visualizations
# Install required packages
install.packages(c(
"shiny", "tidyverse", "plotly", "DT", "randomForest",
"caret", "quarto", "corrplot", "patchwork"
))How to Reproduce Our Analysis The project is split into two logical, numbered scripts.
- Run the Predictive Analysis (Part 1):
# This script builds the Random Forest model and finds the key QoL predictors.
source("scripts/predictive_analysis.qmd")
- Run the Clustering Analysis (Part 2):
# This script runs the k-Means clustering, finds the "Frustration Gap,"
# and identifies the "Rehab Gap."
source("scripts/clustering_analysis.rmd")- Launching our Dashboard
# Launch the Shiny application
shiny::runApp("app.R")The dashboard provides: - Real-time QoL predictions based on patient parameters - Interactive data exploration with filtering and visualization - Model performance comparison with diagnostic plots - Clinical interpretation with actionable recommendations
# Render the Quarto presentation
render("presentations/report.rmd")| Model | RMSE | MAE | RΒ² |
|---|---|---|---|
| Linear Regression | 13.45 | 10.82 | 0.342 |
| Random Forest | 11.23 | 8.97 | 0.489 |
Random Forest outperforms Linear Regression across all metrics, capturing non-linear relationships and interactions between predictors.
- Patient Satisfaction (most important)
- Service Type
- Visit Frequency
- EMG Activity Level
- Age
Finding 1 (Prediction): Patient Satisfaction and Service Type are the most important predictors of Quality of Life.
Finding 2 (Clustering): We found 3 patient profiles: "Slow," "Steady," and "Fast Walkers."
Finding 3 (The "Frustration Gap"): The "Steady Walkers" (Cluster 2) are the least satisfied patient group.
Finding 4 (The "Rehab Gap"): This "frustrated" group also receives the least "Rehab" (26.7%), while the most satisfied group (Cluster 3) receives the most (37.1%).
Finding 5 (The Solution): The "one-size-fits-all" rehab model is failing. Our EMG plot shows a clear triage solution: - "Slow Walkers" (40% "Low EMG") need Strength Training. - "Steady Walkers" ("Mod/High EMG") need Physical Therapy for pain/balance.
- Summary statistics and key performance indicators
- Demographic distribution visualizations
- Service utilization patterns
- Interactive input controls for patient parameters
- Real-time predictions from both models
- Clinical interpretation with color-coded risk levels
- Feature importance visualization
- Confidence intervals based on similar patients
- Side-by-side model comparison
- Actual vs Predicted scatter plots
- Residual diagnostics
- Subgroup performance analysis
- Interactive filtering by demographics and service type
- Custom scatter plots and distributions
- Exportable data tables
- Risk Stratification: Identify patients at risk for poor QoL outcomes
- Intervention Planning: Prioritize resources based on predicted outcomes
- Service Optimization: Compare effectiveness of different service types
- Patient Counseling: Set realistic expectations based on similar cases
- Hypothesis Testing: Explore relationships between variables
- Model Benchmarking: Compare with alternative approaches
- Feature Selection: Identify most impactful predictors
- Subgroup Analysis: Examine disparities across populations
- Program Evaluation: Assess effectiveness of interventions
- Resource Allocation: Optimize service delivery models
- Quality Improvement: Monitor outcomes over time
- Evidence-Based Decision Making: Data-driven policy development
All analyses are fully reproducible. The code includes: - Fixed random seeds for model training - Explicit package version requirements - Clear documentation of preprocessing steps - Detailed comments throughout
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Authors: Derrick Nyarko, Daniel Adediran, Alejandra Ramirez, Julie Cha, Jiro Claveria, and Kass Fernandez
- Project Repository: https://github.com/BioDataSage/ff-comm-health
- Analysis Report: https://biodatasage.github.io/ff-comm-health/