-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamingDataset intermittently fails due to lack of index.json #337
Comments
Hi! thanks for your contribution!, great first issue! |
thank you for sharing this with us, could we kindly ask you to share a full reproducible example? |
@Borda What's the best way to reproduce and share -- so we can pick up this thread and continue debugging, given we have confidential datasets? I'm having the same issue, streaming data down from R2 using the s3 API. I'm running on 4, 8xH100 Nodes. Each node has their own independent
|
One thing to note -- there is kind of a work around -- If I go on a node that failed to download a It's a bit like wacka-a-mole, since I'm using a CombinedDataset with K datasets, so I'm iteratively running My gut is that there is some weird race condition being triggered by multinode DDP training. |
I'm going to proceed the work around for now -- but I'm not sure if this race condition is only on index.json files -- or if it also persists for the individual chunks (the |
We have lift off 🚀. After using the workaround I mentioned above to iteratively retry training resumes I was able to get all the index.json files to download and training to start. There's still some sort of bug / race condition to tackle -- but glad that there's at least a temporary workaround :) |
🐛 Bug
My training job intermittently fails with
I see this occasionally when attempting to train in standard single-node DDP setups, but now that I've started using dual-node DDP I'm seeing it much more often (at least one node will fail ~80% of the time -- I want to say this is much higher than would be the case if the node-level failures were independent). My usual solution has been to just re-run the job, but this is now very impractical.
Code sample
My setup is very standard AFAIK. I have a
LightningDataModule
withStreamingDataLoader
s on top ofStreamingDataset
s pointing to a collection of files in an S3 bucket.Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: