-
Notifications
You must be signed in to change notification settings - Fork 158
Description
PILLOW is unable to save/convert an image with the colour mode of CMYK into a PNG file. If preprocessing a multimodal dataset, which contains within it an image with colour mode CMYK, the data preprocessing will fail with the below error. This happened for me when pre-processing the following dataset huggingface.co/datasets/microsoft/cats_vs_dogs for image 14907 (which can be seen on Hugging Face with the following query SELECT * FROM train LIMIT 1 OFFSET 14907)
Full error stack
(vision_tut_env) bash-4.4$ python ~/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py --config cat_v_dog_preproc.yaml
INFO:cerebras.modelzoo.data_preparation.data_preprocessing.data_preprocessor:
Writing data to ./preprocessed/cats_dogs.
INFO:cerebras.modelzoo.data_preparation.data_preprocessing.data_preprocessor:Initializing finetuning mode
INFO:cerebras.modelzoo.data_preparation.data_preprocessing.data_preprocessor:
Chunk size : 1.00 MB.
Input directory already contains file(s). Do you want to delete the folder and download the dataset again? (yes/no): yes
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 330M/330M [00:02<00:00, 161MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 391M/391M [00:02<00:00, 185MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████| 23410/23410 [00:01<00:00, 18355.72 examples/s]
Map (num_proc=2): 64%|██████████████████████████████████████████████████████▊ | 14907/23410 [10:02<05:43, 24.75 examples/s]
INFO:/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py:Received signal 15. Preparing to exit gracefully...
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/PIL/PngImagePlugin.py", line 1299, in _save
rawmode, mode = _OUTMODES[mode]
KeyError: 'CMYK'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/utils.py", line 1994, in save_image_locally
image_data.save(image_path)
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/PIL/Image.py", line 2431, in save
save_handler(self, fp, filename)
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/PIL/PngImagePlugin.py", line 1302, in _save
raise OSError(msg) from e
OSError: cannot write mode CMYK as PNG
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py", line 115, in <module>
main()
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py", line 95, in main
preprocess_data(params)
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py", line 102, in preprocess_data
dataset_processor = DataPreprocessor(updated_params, exit_event)
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/data_preprocessor.py", line 102, in __init__
self.process_params()
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/data_preprocessor.py", line 114, in process_params
self.handle_input_files()
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/data_preprocessor.py", line 160, in handle_input_files
input_dir = load_dataset_wrapper(input_data_params, **kwargs)
File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/utils.py", line 2054, in load_dataset_wrapper
dataset = dataset.map(
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 3248, in map
for rank, done, content in iflatmap_unordered(
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/utils/py_utils.py", line 718, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/utils/py_utils.py", line 718, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/multiprocess/pool.py", line 768, in get
raise self._value
OSError: cannot write mode CMYK as PNG
I have been able to work around this issue by adjusting "modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/utils.py line 1992 to add the following conversion
image_path = os.path.join(image_dir, f"{idx}.png")
if isinstance(image_data, Image.Image):
if image_data.mode == "CMYK":
image_data = image_data.convert("RGB")
image_data.save(image_path, format="PNG")
example[image_key] = f"{idx}.png"
elif isinstance(image_data, str):
example[image_key] = image_data
else:
raise ValueError(
f" Image data format - {type(image_data)} is not supported"
)Should a change be required a similar change to the branch for processing lists in the same function.
Please let me know if there is a way I am not working correctly to get around this issue.