Fix temporal data leakage in data loaders#38
Merged
Raynergy-svg merged 5 commits intomainfrom Feb 13, 2026
Merged
Conversation
…ata leakage Co-authored-by: Raynergy-svg <82431565+Raynergy-svg@users.noreply.github.com>
Co-authored-by: Raynergy-svg <82431565+Raynergy-svg@users.noreply.github.com>
Co-authored-by: Raynergy-svg <82431565+Raynergy-svg@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Expand plan for cloud agent data integrity fixes
Fix temporal data leakage in data loaders
Feb 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Data loaders had multiple temporal leakage issues: feature selection used validation/test statistics, RF targets were forward-filled without valid future data, and no gap between train/val/test allowed autocorrelation leakage.
Changes
1. Temporal Split Gap Parameter
Added configurable gap between splits to prevent autocorrelation leakage:
Impact: Prevents temporal autocorrelation between splits. Use
gap=24for H1 data (1 day),gap=288for M5 (24 hours).2. Training-Only Feature Selection
Fixed
load_direction_data()to compute feature variance/correlation from training data only:Impact: Validation/test distributions no longer influence feature selection.
3. RF Target Handling
Removed forward-fill of invalid targets in
load_rf_data():Impact: Eliminates targets computed from forward-filled values.
4. Parameter Consistency
drawdown_horizon = 24inload_rf_data()- now respects function parametergapparameter to all data loaders (load_direction_data,load_xgboost_data,load_rf_data,load_ridge_data)gap=0maintains backward compatibilityExpected Impact
Backward Compatibility
All changes use default parameters that preserve existing behavior. No breaking changes.
Original prompt
Start implementation
The user has attached the following file paths as relevant context:
- .github/copilot-instructions.md
[Chronological Review: The conversation began with the user requesting an expansion of a plan for cloud agent implementation, focusing on data integrity fixes in a machine learning context. The user outlined specific tasks to be completed in the codebase, detailing modifications to functions and parameters in the `modular_data_loaders.py` file. The conversation has been centered around ensuring that the implementation is correct and comprehensive to avoid errors in the cloud agent's execution.][Intent Mapping: The user explicitly requested to "expand the plan to be more comprehensive for cloud agent not to implement wrongly," indicating a need for clarity and thoroughness in the outlined tasks. The user provided detailed steps for data integrity fixes, showing a clear intent to improve the robustness of the machine learning model.]
[Technical Inventory: The conversation involved the following technical components:
modular_data_loaders.py: A key file where multiple functions are modified.temporal_split(),load_direction_data(), andload_rf_data().[Code Archaeology: The user provided specific line numbers and functions within
modular_data_loaders.pythat require changes:temporal_split(): Needs a gap parameter.load_direction_data(): Feature selection to be moved to training-only data.drawdown_horizonparameter.[Progress Assessment: The user has outlined a comprehensive plan with multiple steps to improve data integrity. Each step is clearly defined, but no implementation has been executed yet as the focus is on planning and ensuring correctness.]
[Context Validation: All critical information for continuation is captured, including specific tasks, file references, and the rationale behind each change.]
[Recent Commands Analysis: The last agent command involved summarizing the conversation history, triggered by the user's request to expand the plan. No tool results were returned as this was a request for a summary rather than an execution of a specific tool.]
1. Conversation Overview: - Primary Objectives: The user requested to "expand the plan to be more comprehensive for cloud agent not to implement wrongly," focusing on data integrity fixes in the machine learning model. - Session Context: The conversation has revolved around outlining specific modifications to the codebase, particularly in `modular_data_loaders.py`, to ensure accurate implementation for the cloud agent. - User Intent Evolution: The user's needs have remained consistent, emphasizing the importance of a detailed and correct implementation plan.- Technical Foundation:
- Functions:
- Codebase Status:
- Purpose: Central to data loading and processing for machine learning.
- Current State: Requires multiple modifications as outlined by the user.
- Key Code Segments: Functions like
- Dependencies: Changes will affect how data is split and processed across various machine learning tasks.
- Problem Resolution:
- Issues Encountered: Potential inaccuracies in data processing and feature selection.
- Solutions Implemented: The user has proposed a detailed plan to address these issues, though no changes have been made yet.
- Debugging Context: The focus is on preventing future errors in the cloud agent's implementation.
- Lessons Learned: The importance of comprehensive planning in machine learning workflows.
- Progress Tracking:
- Completed Tasks: None yet, as the focus is on planning.
- Partially Complete Work: The user has outlined a detailed plan for data integrity fixes.
- Validated Outcomes: No outcomes have been validated yet as implementation is pending.
- Active Work State:
- Current Focus: The user was working on expanding the plan for data integrity fixes in the machine learning model.
- Recent Context: The user provided a detailed breakdown of tasks to be completed in
- Working Code: Specific functions and parameters in
- Immediate Context: The user was preparing a comprehensive plan to ensure correct implementation for the cloud agent.
- Recent Operations:
- Last Agent Commands: The last command was to summarize the conversation history.
- Tool Results Summary: No specific ...
modular_data_loaders.py: The primary file for modifications, focusing on data loading and processing functions.temporal_split(),load_direction_data(),load_rf_data(), and others mentioned for specific changes.modular_data_loaders.py:temporal_split()andload_direction_data()are critical for the planned changes.modular_data_loaders.py.modular_data_loaders.pywere discussed for modification.Created from VS Code.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.