You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train a ResNet model on ImageNet-1000 dataset over S3. Training begins fine and usually completes a few epochs. Randomly during any epoch, PyTorch will error out trying to open/read any particular S3 object (image file):
Traceback (most recent call last):
File "/opt/pytorch/test_s3.py", line 134, in <module>
for i, (images, labels) in enumerate(tqdm(train_dataloader, desc=f'Epoch {epoch + 1}')):
File "/opt/pytorch/venv/lib64/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
return self._process_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
data.reraise()
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/_utils.py", line 706, in reraise
raise exception
PIL.UnidentifiedImageError: Caught UnidentifiedImageError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "/opt/pytorch/venv/lib64/python3.11/site-packages/s3torchconnector/s3map_dataset.py", line 144, in __getitem__
return self._transform(self._get_object(i))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/test_s3.py", line 44, in transform_image
img = Image.open(object)
^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/venv/lib64/python3.11/site-packages/PIL/Image.py", line 3298, in open
raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <s3torchconnector.s3reader.S3Reader object at 0x7f1f4abe9150>
I tried catching the exception to print the state of the object. The _size and _position variables are both 0. This is random and can happen with any object during any epoch.
I have only observed this behavior when I set num_workers for the train data loader to a higher value, greater than 1. If I set the num_workers to 0 or 1, then I don't notice the issue.
Relevant log output
No response
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
As we have not encountered such a problem before, would you be able to provide/craft a simple repro source code, to help our investigation? We are looking into it meanwhile.
Hi @aqibahmad, just a quick heads-up: I'm still investigating your issue; setting up the dataset in S3 has been taking some time (there are many files to upload), and I expect to have some results by the next days.
Hi again @aqibahmad, unfortunately I am unable to reproduce your issue as-is: could you provide more details as to:
what type of instances (machine/host) are you running this code on?
have you checked if you are not running out of memory?
do you experience the same kind of problems with other datasets? also, if you reduce the size of it (say, you restrict it to a single folder, e.g., s3://<bucket>/train/<folder>/), do you still experience the issue?
can you reproduce the same problem with S3IterableDataset?
Thanks,
-Matthieu
EDIT [16/01/25]: could you also clarify (if running the script on an EC2 instance) if you authenticated with an IAM role attached to the instance, or used the AWS env vars?
s3torchconnector version
s3torchconnector-1.3.0
s3torchconnectorclient version
s3torchconnectorclient-1.3.0
AWS Region
No response
Describe the running environment
OS: AlmaLinux 8.10
Python: Python 3.11.9
What happened?
I am trying to train a ResNet model on ImageNet-1000 dataset over S3. Training begins fine and usually completes a few epochs. Randomly during any epoch, PyTorch will error out trying to open/read any particular S3 object (image file):
I tried catching the exception to print the state of the object. The _size and _position variables are both 0. This is random and can happen with any object during any epoch.
I have only observed this behavior when I set num_workers for the train data loader to a higher value, greater than 1. If I set the num_workers to 0 or 1, then I don't notice the issue.
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: