Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIL UnidentifiedImageError training with higher 'num_workers' #293

Open
1 task done
aqibahmad opened this issue Jan 7, 2025 · 4 comments
Open
1 task done

PIL UnidentifiedImageError training with higher 'num_workers' #293

aqibahmad opened this issue Jan 7, 2025 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@aqibahmad
Copy link

s3torchconnector version

s3torchconnector-1.3.0

s3torchconnectorclient version

s3torchconnectorclient-1.3.0

AWS Region

No response

Describe the running environment

OS: AlmaLinux 8.10
Python: Python 3.11.9

What happened?

I am trying to train a ResNet model on ImageNet-1000 dataset over S3. Training begins fine and usually completes a few epochs. Randomly during any epoch, PyTorch will error out trying to open/read any particular S3 object (image file):

Traceback (most recent call last):
  File "/opt/pytorch/test_s3.py", line 134, in <module>
    for i, (images, labels) in enumerate(tqdm(train_dataloader, desc=f'Epoch {epoch + 1}')):
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
PIL.UnidentifiedImageError: Caught UnidentifiedImageError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/s3torchconnector/s3map_dataset.py", line 144, in __getitem__
    return self._transform(self._get_object(i))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/test_s3.py", line 44, in transform_image
    img = Image.open(object)
          ^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/venv/lib64/python3.11/site-packages/PIL/Image.py", line 3298, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <s3torchconnector.s3reader.S3Reader object at 0x7f1f4abe9150>

I tried catching the exception to print the state of the object. The _size and _position variables are both 0. This is random and can happen with any object during any epoch.

exception caught! 
Object Key: train/n03743016/n03743016_1737.JPEG 
 
{'state_after': {'_bucket': 'imagenet', '_buffer': <_io.BytesIO object at 0x7f1f4abe3290>, '_get_object_info': functools.partial(<function _identity at 0x7f1e9bf8d8a0>, 
PyObjectInfo { key: "train/n03743016/n03743016_1737.JPEG", etag: "\"2f6e8e68b8ccef44c7f341149e8$ 43e5\"", 
size: 122938, last_modified: 1734842473, storage_class: Some("STANDARD"), restore_status: None }),
 '_get_stream': functools.partial(<bound method S3Client._get_object_stream of <s3torchconnector._s3client._s3client.S3Client object at 0x7f1e9c8af0d0>>,
 'imagenet', 'train/n$ 3743016/n03743016_1737.JPEG'),
 '_key': 'train/n03743016/n03743016_1737.JPEG', '_position': 0, '_size': 0, '_stream': <s3torchconnectorclient._mountpoint_s3_client.GetObjectStream object at 0x7f1ea423b690>}

I have only observed this behavior when I set num_workers for the train data loader to a higher value, greater than 1. If I set the num_workers to 0 or 1, then I don't notice the issue.

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@aqibahmad aqibahmad added the bug Something isn't working label Jan 7, 2025
@matthieu-d4r
Copy link
Contributor

matthieu-d4r commented Jan 8, 2025

Hi @aqibahmad, thanks for opening this issue.

As we have not encountered such a problem before, would you be able to provide/craft a simple repro source code, to help our investigation? We are looking into it meanwhile.

@aqibahmad
Copy link
Author

Hi @matthieu-d4r, thanks for looking into this.

Provided a script at https://github.com/aqibahmad/pytorch-s3/blob/main/train_imagenet_s3.py

The dataset was downloaded from https://www.image-net.org/challenges/LSVRC/2012/index.php

@matthieu-d4r
Copy link
Contributor

Hi @aqibahmad, just a quick heads-up: I'm still investigating your issue; setting up the dataset in S3 has been taking some time (there are many files to upload), and I expect to have some results by the next days.

@matthieu-d4r matthieu-d4r self-assigned this Jan 13, 2025
@matthieu-d4r
Copy link
Contributor

matthieu-d4r commented Jan 14, 2025

Hi again @aqibahmad, unfortunately I am unable to reproduce your issue as-is: could you provide more details as to:

  • what type of instances (machine/host) are you running this code on?
  • have you checked if you are not running out of memory?
  • do you experience the same kind of problems with other datasets? also, if you reduce the size of it (say, you restrict it to a single folder, e.g., s3://<bucket>/train/<folder>/), do you still experience the issue?
  • can you reproduce the same problem with S3IterableDataset?

Thanks,
-Matthieu


EDIT [16/01/25]: could you also clarify (if running the script on an EC2 instance) if you authenticated with an IAM role attached to the instance, or used the AWS env vars?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants