StreamingDataset intermittently fails due to lack of index.json #337

plra · 2024-08-20T21:50:24Z

🐛 Bug

My training job intermittently fails with

File ".../litdata/streaming/dataset.py", line 89, in __init__
    self.subsampled_files, self.region_of_interest = subsample_streaming_dataset(
File ".../litdata/utilities/dataset_utilities.py", line 60, in subsample_streaming_dataset
    raise ValueError(
ValueError: The provided dataset `/root/.lightning/chunks/<hash>/<hash>` doesn't contain any index.json file. HINT: Did you successfully optimize a dataset to the provided `input_dir`?

I see this occasionally when attempting to train in standard single-node DDP setups, but now that I've started using dual-node DDP I'm seeing it much more often (at least one node will fail ~80% of the time -- I want to say this is much higher than would be the case if the node-level failures were independent). My usual solution has been to just re-run the job, but this is now very impractical.

Code sample

My setup is very standard AFAIK. I have a LightningDataModule with StreamingDataLoaders on top of StreamingDatasets pointing to a collection of files in an S3 bucket.

Environment

PyTorch Version (e.g., 1.0): 2.4.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.10
CUDA/cuDNN version: 12.1.0
GPU models and configuration: 8x H100 x 2 nodes (+ various other configurations)

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-20T21:50:49Z

Hi! thanks for your contribution!, great first issue!

Borda · 2024-08-21T11:49:14Z

My setup is very standard AFAIK. I have a LightningDataModule with StreamingDataLoaders on top of StreamingDatasets pointing to a collection of files in an S3 bucket.

thank you for sharing this with us, could we kindly ask you to share a full reproducible example?

schopra8 · 2024-12-26T06:10:07Z

@Borda What's the best way to reproduce and share -- so we can pick up this thread and continue debugging, given we have confidential datasets? I'm having the same issue, streaming data down from R2 using the s3 API.

I'm running on 4, 8xH100 Nodes. Each node has their own independent /scratch folder shared across the 8 GPUs.

2 of the Nodes successfully download the index.json to a manually specified cache_dir (/scratch/training_data/dataset_part_1/index.json). These 2 successful nodes also have a index.lock.json that's an empty file.
2 of the Nodes fail to download the index.json. They also have a index.lock.json that's an empty file.

schopra8 · 2024-12-26T06:21:54Z

One thing to note -- there is kind of a work around --

If I go on a node that failed to download a index.json file, delete the index.json.lock files, and then try launching training again sometimes the index.json file download successfully.

It's a bit like wacka-a-mole, since I'm using a CombinedDataset with K datasets, so I'm iteratively running torchrun, crashing because a particular dataset is missing on a node, deleting the index.lock.json associated with the missing dataset, and then repeating.

My gut is that there is some weird race condition being triggered by multinode DDP training.

schopra8 · 2024-12-26T06:25:32Z

I'm going to proceed the work around for now -- but I'm not sure if this race condition is only on index.json files -- or if it also persists for the individual chunks (the .bin files). I'll report back, once I have all the index.json files working.

schopra8 · 2024-12-26T06:59:21Z

We have lift off 🚀. After using the workaround I mentioned above to iteratively retry training resumes I was able to get all the index.json files to download and training to start.

There's still some sort of bug / race condition to tackle -- but glad that there's at least a temporary workaround :)

plra added bug Something isn't working help wanted Extra attention is needed labels Aug 20, 2024

bhimrazy mentioned this issue Sep 2, 2024

Error Should Indicate Missing Folder Instead of Missing index.json File #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamingDataset intermittently fails due to lack of index.json #337

StreamingDataset intermittently fails due to lack of index.json #337

plra commented Aug 20, 2024

github-actions bot commented Aug 20, 2024

Borda commented Aug 21, 2024

schopra8 commented Dec 26, 2024 •

edited

Loading

schopra8 commented Dec 26, 2024

schopra8 commented Dec 26, 2024

schopra8 commented Dec 26, 2024

StreamingDataset intermittently fails due to lack of index.json #337

StreamingDataset intermittently fails due to lack of index.json #337

Comments

plra commented Aug 20, 2024

🐛 Bug

Code sample

Environment

github-actions bot commented Aug 20, 2024

Borda commented Aug 21, 2024

schopra8 commented Dec 26, 2024 • edited Loading

schopra8 commented Dec 26, 2024

schopra8 commented Dec 26, 2024

schopra8 commented Dec 26, 2024

schopra8 commented Dec 26, 2024 •

edited

Loading