Skip to content

[BUG] Memory Scalability Issue with Large Number of Exogenous Variables and Large-Scale Dataloader #1357

@strakehyr

Description

@strakehyr

What happened + What you expected to happen

When using the NeuralForecast.fit() method with the large-scale dataloader (by passing a list of Parquet files to the df argument), the GPU memory usage appears to scale inefficiently with the number of exogenous variables. I could run these specs before (used neuralforecast when it was version 1.7) and trian a model with them. With all the updates since then, I had to try the Large dataloading Method and still the problem only arises when exogenous variables are enabled. Beware that I can train TSMixer on these many Exog variables on other libraries, so theres definitely some memory handling issue.

Training a standard model (e.g., TSMixerx with default hyperparameters) on a large dataset with many time steps but no exogenous variables works as expected, successfully streaming data with a low memory footprint. However, adding a large number of exogenous variables (~80-90) to hist_exog_list and futr_exog_list causes a CUDA out of memory error on a high-VRAM GPU (48 GB), even when the model's architecture and batch_size remain unchanged.

This suggests that the windowing process for exogenous variables may not be as memory-efficient as it is for the target variable y when using the file-based dataloader, potentially loading the entire feature set for a series into memory at once.

Versions / Dependencies

neuralforecast = 3.0.2
pytorch_lightning = 2.5.2

Reproduction script


import numpy as np
import pandas as pd
import tempfile
import os
from neuralforecast import NeuralForecast
from neuralforecast.models import TSMixerx

# 1. Define dataset parameters
n_series = 10
n_timesteps = 800_000
n_exog = 90 # The high number of features that causes the issue
h = 24      # Forecast horizon
input_size = 96 # Lookback window

# 2. Create a large, synthetic DataFrame
print("Creating synthetic DataFrame...")
ids = [f'series_{i}' for i in range(n_series)]
df_list = []
for i, unique_id in enumerate(ids):
    # Create timestamps
    start_date = pd.to_datetime('2020-01-01')
    timestamps = pd.date_range(start=start_date, periods=n_timesteps, freq='15min')
    
    # Create data
    y = np.random.randn(n_timesteps)
    
    # Create DataFrame for this series
    series_df = pd.DataFrame({
        'ds': timestamps,
        'unique_id': unique_id,
        'y': y
    })
    
    # Add exogenous features
    for j in range(n_exog):
        series_df[f'exog_{j}'] = np.random.randn(n_timesteps)
        
    df_list.append(series_df)

Y_df = pd.concat(df_list).reset_index(drop=True)

# Ensure data types are memory-efficient
numeric_cols = Y_df.select_dtypes(include=np.number).columns
Y_df[numeric_cols] = Y_df[numeric_cols].astype(np.float32)
print("DataFrame created. Shape:", Y_df.shape)


# 3. Save as partitioned Parquet files for large-scale loading
print("Saving data to partitioned Parquet files...")
tmpdir = tempfile.TemporaryDirectory()
Y_df.to_parquet(tmpdir.name, partition_cols=['unique_id'], index=False)
files_list = [f"{tmpdir.name}/{d}" for d in os.listdir(tmpdir.name) if not d.startswith('.')]
static_df = pd.DataFrame({'unique_id': ids})


# 4. Define model and NeuralForecast object
exog_list = [f'exog_{j}' for j in range(n_exog)]

# Using standard, non-intensive hyperparameters
model = TSMixerx(
    h=h,
    input_size=input_size,
    n_series=n_series,
    hist_exog_list=exog_list, # This is the key part that causes the issue
    # futr_exog_list=exog_list, # Can also be used here
    revin=True,
    scaler_type='standard',
    batch_size=32 # Using a reasonable batch size
)

nf = NeuralForecast(models=[model], freq='15min')

# 5. Attempt to fit the model
# This step is expected to cause a CUDA OOM error due to memory handling of exogenous variables
try:
    print("Attempting to fit the model...")
    nf.fit(
        df=files_list,
        static_df=static_df,
        val_size=h * 10 # A small validation set
    )
except Exception as e:
    print(f"\nTraining failed as expected. Error: {e}")

# Clean up
tmpdir.cleanup()

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions