You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tokenizing a dataset for LLM pre-training using optimize with greater than 1 workers leads to a segmentation fault when loading the dataset with a StreamingDataLoader and batch_size > 1. I think this might be a follow-on to #366 in that StreamingDataLoader errors are linked to the inputs to optimize.
To Reproduce
Steps to reproduce the behavior:
Run optimize with item_loader=TokensLoader() and num_workers>=2
Usebatch_size>1 with StreamingDataLoader
StreamingDataLoader throws a segmentation fault
Interestingly, streaming tokens works if:
num_workers>1 in optimize and batch_size==1 when loading.
num_workers<2 in optimize and batch_size>=1 when loading.
🐛 Bug
Tokenizing a dataset for LLM pre-training using
optimize
with greater than 1 workers leads to a segmentation fault when loading the dataset with aStreamingDataLoader
andbatch_size > 1
. I think this might be a follow-on to #366 in thatStreamingDataLoader
errors are linked to the inputs tooptimize
.To Reproduce
Steps to reproduce the behavior:
optimize
withitem_loader=TokensLoader()
andnum_workers>=2
batch_size>1
withStreamingDataLoader
StreamingDataLoader
throws a segmentation faultInterestingly, streaming tokens works if:
num_workers>1
in optimize andbatch_size==1
when loading.num_workers<2
in optimize andbatch_size>=1
when loading.MWE to give the error below:
Code sample
Here's the error I get after the segmentation fault:
which seems to be related to multiprocessing errors.
Expected behavior
Multi-worker dataset creation should lead to smooth dataset streaming with all batch sizes.
Additional context
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: