Predicting Sleep Latency from Physiological Wearable Data

Author: Ethan Daniel Lobo

An end-to-end Machine Learning architecture designed to predict high sleep latency (restless nights) using non-linear biological thresholds and behavioral data.

🔐 Data Confidentiality & Reproducibility Disclaimer

Due to data privacy restrictions and the handling of protected physiological metrics, the original wearable datasets used in this research cannot be published.

To ensure this Machine Learning pipeline remains fully reproducible for reviewers, a synthetic data generator (00_generate_mock_data.py) has been included. This script generates localized mock data with mathematically injected correlations that mimic the true biological signals (e.g., elevated Heart Rate Z-Scores correlating to high latency) and the 2026 Concept Drift observed in the live project.

Running the provided notebook against the mock data will successfully execute the pipeline and yield metric trade-offs representative of the real-world findings.

Project Architecture

This project processes raw wearable API data and transforms it into a predictive classification engine, contained entirely within sleep_latency_master_pipeline.ipynb.

Data Engineering: Engineered rolling 14-day physiological baselines (Heart Rate Z-Scores, Temperature Volatility) to solve the wearable "cold start" problem.
Exploratory Data Analysis: Demonstrated that standard linear correlation (-0.02) failed to capture non-linear biological thresholds driving sleep latency.
The Classification Pivot: Transitioned the target variable from continuous Regression (predicting exact minutes) to binary Classification (Normal Sleep vs. High Latency) to account for human biological noise.
Model Optimization: Prevented tree-based overfitting via hyperparameter tuning and utilized scale_pos_weight to counteract heavy class imbalance (77% normal sleep).

Key Findings: The Concept Drift Challenge

When testing models locked in a 2025 training state against unseen 2026 data, a significant baseline shift (Concept Drift) was observed.

Model	Accuracy	Precision	Recall (Class 1)
1. Logistic Baseline	59.1%	29.0%	38.1%
2. Polynomial Logistic	64.2%	28.3%	23.3%
3. Random Forest	65.5%	27.3%	18.6%
4. XGBoost	63.5%	29.9%	28.6%

Insights:

XGBoost achieved the optimal balance of accuracy and precision.
The simple Logistic Baseline outperformed complex tree models in Recall on unseen future data, proving that rigid, complex models struggle to generalize when a user's environmental stress baselines fundamentally shift year-over-year.
Solution Implemented: A rolling retraining window was simulated within the pipeline, boosting XGBoost's Recall to 44% by allowing it to recalibrate to the new year's baselines.

How to Run Locally

Clone the repository to your local machine.
Install the required dependencies:
```
pip install -r requirements.txt
```
Run the mock data generator to create the synthetic 2025 and 2026 datasets:
```
python 00_generate_mock_data.py
```
Open and run sleep_latency_master_pipeline.ipynb from top to bottom to execute the end-to-end pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
00_generate_mock_data.py		00_generate_mock_data.py
README.md		README.md
requirements.txt		requirements.txt
sleep_latency_master_pipeline.ipynb		sleep_latency_master_pipeline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Sleep Latency from Physiological Wearable Data

Project Architecture

Key Findings: The Concept Drift Challenge

How to Run Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting Sleep Latency from Physiological Wearable Data

Project Architecture

Key Findings: The Concept Drift Challenge

How to Run Locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages