PIL UnidentifiedImageError training with higher 'num_workers' #293

aqibahmad · 2025-01-07T17:48:10Z

s3torchconnector version

s3torchconnector-1.3.0

s3torchconnectorclient version

s3torchconnectorclient-1.3.0

AWS Region

No response

Describe the running environment

OS: AlmaLinux 8.10
Python: Python 3.11.9

What happened?

I am trying to train a ResNet model on ImageNet-1000 dataset over S3. Training begins fine and usually completes a few epochs. Randomly during any epoch, PyTorch will error out trying to open/read any particular S3 object (image file):

Traceback (most recent call last):
  File "/opt/pytorch/test_s3.py", line 134, in <module>
    for i, (images, labels) in enumerate(tqdm(train_dataloader, desc=f'Epoch {epoch + 1}')):
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
PIL.UnidentifiedImageError: Caught UnidentifiedImageError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/s3torchconnector/s3map_dataset.py", line 144, in __getitem__
    return self._transform(self._get_object(i))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/test_s3.py", line 44, in transform_image
    img = Image.open(object)
          ^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/PIL/Image.py", line 3298, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <s3torchconnector.s3reader.S3Reader object at 0x7f1f4abe9150>

I tried catching the exception to print the state of the object. The _size and _position variables are both 0. This is random and can happen with any object during any epoch.

exception caught! 
Object Key: train/n03743016/n03743016_1737.JPEG 
 
{'state_after': {'_bucket': 'imagenet', '_buffer': <_io.BytesIO object at 0x7f1f4abe3290>, '_get_object_info': functools.partial(<function _identity at 0x7f1e9bf8d8a0>, 
PyObjectInfo { key: "train/n03743016/n03743016_1737.JPEG", etag: "\"2f6e8e68b8ccef44c7f341149e8$ 43e5\"", 
size: 122938, last_modified: 1734842473, storage_class: Some("STANDARD"), restore_status: None }),
 '_get_stream': functools.partial(<bound method S3Client._get_object_stream of <s3torchconnector._s3client._s3client.S3Client object at 0x7f1e9c8af0d0>>,
 'imagenet', 'train/n$ 3743016/n03743016_1737.JPEG'),
 '_key': 'train/n03743016/n03743016_1737.JPEG', '_position': 0, '_size': 0, '_stream': <s3torchconnectorclient._mountpoint_s3_client.GetObjectStream object at 0x7f1ea423b690>}

I have only observed this behavior when I set num_workers for the train data loader to a higher value, greater than 1. If I set the num_workers to 0 or 1, then I don't notice the issue.

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

matthieu-d4r · 2025-01-08T13:38:46Z

Hi @aqibahmad, thanks for opening this issue.

As we have not encountered such a problem before, would you be able to provide/craft a simple repro source code, to help our investigation? We are looking into it meanwhile.

aqibahmad · 2025-01-08T17:36:12Z

Hi @matthieu-d4r, thanks for looking into this.

Provided a script at https://github.com/aqibahmad/pytorch-s3/blob/main/train_imagenet_s3.py

The dataset was downloaded from https://www.image-net.org/challenges/LSVRC/2012/index.php

matthieu-d4r · 2025-01-13T10:36:53Z

Hi @aqibahmad, just a quick heads-up: I'm still investigating your issue; setting up the dataset in S3 has been taking some time (there are many files to upload), and I expect to have some results by the next days.

matthieu-d4r · 2025-01-14T15:41:50Z

Hi again @aqibahmad, unfortunately I am unable to reproduce your issue as-is: could you provide more details as to:

what type of instances (machine/host) are you running this code on?
have you checked if you are not running out of memory?
do you experience the same kind of problems with other datasets? also, if you reduce the size of it (say, you restrict it to a single folder, e.g., s3://<bucket>/train/<folder>/), do you still experience the issue?
can you reproduce the same problem with S3IterableDataset?

Thanks,
-Matthieu

EDIT [16/01/25]: could you also clarify (if running the script on an EC2 instance) if you authenticated with an IAM role attached to the instance, or used the AWS env vars?

aqibahmad added the bug Something isn't working label Jan 7, 2025

matthieu-d4r self-assigned this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIL UnidentifiedImageError training with higher 'num_workers' #293

PIL UnidentifiedImageError training with higher 'num_workers' #293

aqibahmad commented Jan 7, 2025

matthieu-d4r commented Jan 8, 2025 •

edited

Loading

aqibahmad commented Jan 8, 2025

matthieu-d4r commented Jan 13, 2025

matthieu-d4r commented Jan 14, 2025 •

edited

Loading

PIL UnidentifiedImageError training with higher 'num_workers' #293

PIL UnidentifiedImageError training with higher 'num_workers' #293

Comments

aqibahmad commented Jan 7, 2025

s3torchconnector version

s3torchconnectorclient version

AWS Region

Describe the running environment

What happened?

Relevant log output

Code of Conduct

matthieu-d4r commented Jan 8, 2025 • edited Loading

aqibahmad commented Jan 8, 2025

matthieu-d4r commented Jan 13, 2025

matthieu-d4r commented Jan 14, 2025 • edited Loading

matthieu-d4r commented Jan 8, 2025 •

edited

Loading

matthieu-d4r commented Jan 14, 2025 •

edited

Loading