Skip to content

lobo017/sleep-latency-prediction

Repository files navigation

Predicting Sleep Latency from Physiological Wearable Data

Author: Ethan Daniel Lobo

An end-to-end Machine Learning architecture designed to predict high sleep latency (restless nights) using non-linear biological thresholds and behavioral data.

🔐 Data Confidentiality & Reproducibility Disclaimer

Due to data privacy restrictions and the handling of protected physiological metrics, the original wearable datasets used in this research cannot be published.

To ensure this Machine Learning pipeline remains fully reproducible for reviewers, a synthetic data generator (00_generate_mock_data.py) has been included. This script generates localized mock data with mathematically injected correlations that mimic the true biological signals (e.g., elevated Heart Rate Z-Scores correlating to high latency) and the 2026 Concept Drift observed in the live project.

Running the provided notebook against the mock data will successfully execute the pipeline and yield metric trade-offs representative of the real-world findings.

Project Architecture

This project processes raw wearable API data and transforms it into a predictive classification engine, contained entirely within sleep_latency_master_pipeline.ipynb.

  1. Data Engineering: Engineered rolling 14-day physiological baselines (Heart Rate Z-Scores, Temperature Volatility) to solve the wearable "cold start" problem.
  2. Exploratory Data Analysis: Demonstrated that standard linear correlation (-0.02) failed to capture non-linear biological thresholds driving sleep latency.
  3. The Classification Pivot: Transitioned the target variable from continuous Regression (predicting exact minutes) to binary Classification (Normal Sleep vs. High Latency) to account for human biological noise.
  4. Model Optimization: Prevented tree-based overfitting via hyperparameter tuning and utilized scale_pos_weight to counteract heavy class imbalance (77% normal sleep).

Key Findings: The Concept Drift Challenge

When testing models locked in a 2025 training state against unseen 2026 data, a significant baseline shift (Concept Drift) was observed.

Model Accuracy Precision Recall (Class 1)
1. Logistic Baseline 59.1% 29.0% 38.1%
2. Polynomial Logistic 64.2% 28.3% 23.3%
3. Random Forest 65.5% 27.3% 18.6%
4. XGBoost 63.5% 29.9% 28.6%

Insights:

  • XGBoost achieved the optimal balance of accuracy and precision.
  • The simple Logistic Baseline outperformed complex tree models in Recall on unseen future data, proving that rigid, complex models struggle to generalize when a user's environmental stress baselines fundamentally shift year-over-year.
  • Solution Implemented: A rolling retraining window was simulated within the pipeline, boosting XGBoost's Recall to 44% by allowing it to recalibrate to the new year's baselines.

How to Run Locally

  1. Clone the repository to your local machine.
  2. Install the required dependencies:
    pip install -r requirements.txt
  3. Run the mock data generator to create the synthetic 2025 and 2026 datasets:
    python 00_generate_mock_data.py
  4. Open and run sleep_latency_master_pipeline.ipynb from top to bottom to execute the end-to-end pipeline.

About

Predicting sleep latency from physiological data using XGBoost and Random Forest, featuring rolling baseline engineering and concept drift recalibration.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors