Author: Ethan Daniel Lobo
An end-to-end Machine Learning architecture designed to predict high sleep latency (restless nights) using non-linear biological thresholds and behavioral data.
🔐 Data Confidentiality & Reproducibility Disclaimer
Due to data privacy restrictions and the handling of protected physiological metrics, the original wearable datasets used in this research cannot be published.
To ensure this Machine Learning pipeline remains fully reproducible for reviewers, a synthetic data generator (
00_generate_mock_data.py) has been included. This script generates localized mock data with mathematically injected correlations that mimic the true biological signals (e.g., elevated Heart Rate Z-Scores correlating to high latency) and the 2026 Concept Drift observed in the live project.Running the provided notebook against the mock data will successfully execute the pipeline and yield metric trade-offs representative of the real-world findings.
This project processes raw wearable API data and transforms it into a predictive classification engine, contained entirely within sleep_latency_master_pipeline.ipynb.
- Data Engineering: Engineered rolling 14-day physiological baselines (Heart Rate Z-Scores, Temperature Volatility) to solve the wearable "cold start" problem.
- Exploratory Data Analysis: Demonstrated that standard linear correlation (-0.02) failed to capture non-linear biological thresholds driving sleep latency.
- The Classification Pivot: Transitioned the target variable from continuous Regression (predicting exact minutes) to binary Classification (Normal Sleep vs. High Latency) to account for human biological noise.
- Model Optimization: Prevented tree-based overfitting via hyperparameter tuning and utilized
scale_pos_weightto counteract heavy class imbalance (77% normal sleep).
When testing models locked in a 2025 training state against unseen 2026 data, a significant baseline shift (Concept Drift) was observed.
| Model | Accuracy | Precision | Recall (Class 1) |
|---|---|---|---|
| 1. Logistic Baseline | 59.1% | 29.0% | 38.1% |
| 2. Polynomial Logistic | 64.2% | 28.3% | 23.3% |
| 3. Random Forest | 65.5% | 27.3% | 18.6% |
| 4. XGBoost | 63.5% | 29.9% | 28.6% |
Insights:
- XGBoost achieved the optimal balance of accuracy and precision.
- The simple Logistic Baseline outperformed complex tree models in Recall on unseen future data, proving that rigid, complex models struggle to generalize when a user's environmental stress baselines fundamentally shift year-over-year.
- Solution Implemented: A rolling retraining window was simulated within the pipeline, boosting XGBoost's Recall to 44% by allowing it to recalibrate to the new year's baselines.
- Clone the repository to your local machine.
- Install the required dependencies:
pip install -r requirements.txt
- Run the mock data generator to create the synthetic 2025 and 2026 datasets:
python 00_generate_mock_data.py
- Open and run
sleep_latency_master_pipeline.ipynbfrom top to bottom to execute the end-to-end pipeline.