Skip to content

Data preprocessing fails for multimodal images that have an image mode of "CMYK" #65

@BenjaminCarpenter480

Description

@BenjaminCarpenter480

PILLOW is unable to save/convert an image with the colour mode of CMYK into a PNG file. If preprocessing a multimodal dataset, which contains within it an image with colour mode CMYK, the data preprocessing will fail with the below error. This happened for me when pre-processing the following dataset huggingface.co/datasets/microsoft/cats_vs_dogs for image 14907 (which can be seen on Hugging Face with the following query SELECT * FROM train LIMIT 1 OFFSET 14907)

Full error stack
(vision_tut_env) bash-4.4$ python ~/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py --config cat_v_dog_preproc.yaml
INFO:cerebras.modelzoo.data_preparation.data_preprocessing.data_preprocessor:
Writing data to ./preprocessed/cats_dogs.

INFO:cerebras.modelzoo.data_preparation.data_preprocessing.data_preprocessor:Initializing finetuning mode
INFO:cerebras.modelzoo.data_preparation.data_preprocessing.data_preprocessor:
Chunk size : 1.00 MB.

Input directory already contains file(s). Do you want to delete the folder and download the dataset again? (yes/no): yes
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 330M/330M [00:02<00:00, 161MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 391M/391M [00:02<00:00, 185MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████| 23410/23410 [00:01<00:00, 18355.72 examples/s]
Map (num_proc=2):  64%|██████████████████████████████████████████████████████▊                               | 14907/23410 [10:02<05:43, 24.75 examples/s]
INFO:/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py:Received signal 15. Preparing to exit gracefully...
multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/PIL/PngImagePlugin.py", line 1299, in _save
    rawmode, mode = _OUTMODES[mode]
KeyError: 'CMYK'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/utils.py", line 1994, in save_image_locally
    image_data.save(image_path)
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/PIL/Image.py", line 2431, in save
    save_handler(self, fp, filename)
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/PIL/PngImagePlugin.py", line 1302, in _save
    raise OSError(msg) from e
OSError: cannot write mode CMYK as PNG
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py", line 115, in <module>
    main()
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py", line 95, in main
    preprocess_data(params)
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/preprocess_data.py", line 102, in preprocess_data
    dataset_processor = DataPreprocessor(updated_params, exit_event)
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/data_preprocessor.py", line 102, in __init__
    self.process_params()
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/data_preprocessor.py", line 114, in process_params
    self.handle_input_files()
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/data_preprocessor.py", line 160, in handle_input_files
    input_dir = load_dataset_wrapper(input_data_params, **kwargs)
  File "/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/utils.py", line 2054, in load_dataset_wrapper
    dataset = dataset.map(
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/arrow_dataset.py", line 3248, in map
    for rank, done, content in iflatmap_unordered(
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/utils/py_utils.py", line 718, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/datasets/utils/py_utils.py", line 718, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/vision_tutorial/vision_tut_env/lib64/python3.8/site-packages/multiprocess/pool.py", line 768, in get
    raise self._value
OSError: cannot write mode CMYK as PNG

I have been able to work around this issue by adjusting "modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing/utils.py line 1992 to add the following conversion

image_path = os.path.join(image_dir, f"{idx}.png")
if isinstance(image_data, Image.Image):
	if image_data.mode == "CMYK":
		image_data = image_data.convert("RGB")
	image_data.save(image_path, format="PNG")
	example[image_key] = f"{idx}.png"
elif isinstance(image_data, str):
	example[image_key] = image_data
else:
	raise ValueError(
		f" Image data format - {type(image_data)} is not supported"
	)

Should a change be required a similar change to the branch for processing lists in the same function.

Please let me know if there is a way I am not working correctly to get around this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions