BUG: users shouldn't be able to specify a local path when writing output on a cluster #4479

lionstory · 2022-05-22T03:18:19Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7.6.1810
Modin version (modin.__version__): 0.12.1
Python version: 3.7.6
Ray version: 1.12.0 ( Ray cluster: 1 head node + 2 worker nodes)
Code we can use to reproduce:

import os
os.environ["MODIN_ENGINE"] = "ray"
import ray
ray.init(address="auto")
import modin.pandas as mpd
from collections import Counter
import pandas as pd
import time

@ray.remote
def modin_csv_parquet_perf(csv_file_prefix: str):
# modin read csv
start_time = time.time()
modin_df = mpd.read_csv(csv_file_prefix + '.csv')
print('===>Time cost(s) - modin read_csv: ' + str(time.time() - start_time))
print(f'Modin csv DF len = {len(modin_df)}')

# modin write parquet
start_time = time.time()
modin_df.to_parquet(csv_file_prefix + '.parquet', engine='fastparquet')
print('===>Time cost(s) - modin to_parquet:  ' + str(time.time() - start_time))

# modin read parquet
start_time = time.time()
modin_df = mpd.read_parquet(csv_file_prefix + '.parquet', engine='fastparquet')
print('===>Time cost(s) - modin read_parquet:  ' + str(time.time() - start_time))
print(f'Modin parquet DF len = {len(modin_df)}')

if name == 'main':
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

CSV_FILE = '/home/term/wanlu/500w50f'

object_ids = [modin_csv_parquet_perf.remote(CSV_FILE) for _ in range(1)]
ip_addresses = ray.get(object_ids)
print('Tasks executed:')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

Describe the problem

modin_df.to_parquet will write the dataframe(5 million records) to parquet files automatically partitioned on 3 ray nodes. But when mpd.read_parquet, I can only read the parquet records(2499994) of the head ray node that the scripts were run on. How can I get all the 5million records on 3 nodes? I don't see any document about these details. Many thanks.

Source code / logs

[term@dev-ctb-xs-196-65 wanlu]$ python3 testmodinpandas.py
This cluster consists of
3 nodes in total
24.0 CPU resources in total

(modin_csv_parquet_perf pid=2961) ===>Time cost(s) - modin read_csv: 12.558054447174072
(modin_csv_parquet_perf pid=2961) Modin csv DF len = 5000000
(modin_csv_parquet_perf pid=2961) ===>Time cost(s) - modin to_parquet: 8.699442386627197
Tasks executed:
1 tasks on None
(modin_csv_parquet_perf pid=2961) ===>Time cost(s) - modin read_parquet: 6.788694381713867
(modin_csv_parquet_perf pid=2961) Modin parquet DF len = 2499994

The text was updated successfully, but these errors were encountered:

RehanSD · 2022-05-23T18:32:13Z

Hi Leon! Thank you so much for opening this issue! When working in a distributed setting, you want to use a distributed filesystem, such as AWS S3, or an NFS, rather than specifying a local file path, so that all of your files end up in the same place, and a distributed read will work. There's not really a good way to read a file split across multiple devices, and we'll be putting in a PR soon to warn users about using local file paths in a distributed setting! Hope that helps!

…n performing a distributed write Signed-off-by: Rehan Durrani <[email protected]>

lionstory · 2022-05-29T07:11:34Z

Hi Leon! Thank you so much for opening this issue! When working in a distributed setting, you want to use a distributed filesystem, such as AWS S3, or an NFS, rather than specifying a local file path, so that all of your files end up in the same place, and a distributed read will work. There's not really a good way to read a file split across multiple devices, and we'll be putting in a PR soon to warn users about using local file paths in a distributed setting! Hope that helps!

RehhanSD, thanks a lot for the reply.

RehanSD added a commit to RehanSD/modin that referenced this issue May 23, 2022

FIX-modin-project#4479: Prevent users from using a local filepath whe…

aca8db5

…n performing a distributed write Signed-off-by: Rehan Durrani <[email protected]>

RehanSD linked a pull request May 23, 2022 that will close this issue

FIX-#4479: Prevent users from using a local filepath when performing a distributed write #4484

Open

8 tasks

vnlitvinov added the P3 Very minor bugs, or features we can hopefully add some day. label Sep 6, 2022

This was referenced Oct 12, 2022

read_csv fails with multi-node Ray Client #3179

Open

Add documentation about reading local files in a cluster environment #3509

Closed

mvashishtha added documentation 📜 Updates and issues with the documentation bug 🦗 Something isn't working and removed documentation 📜 Updates and issues with the documentation labels Oct 12, 2022

mvashishtha changed the title ~~how to read all parquet partition files that wrote by modin df.to_parquet~~ BUG: users shouldn't be able to specify a local path when writing output on a cluster Oct 12, 2022

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: users shouldn't be able to specify a local path when writing output on a cluster #4479

BUG: users shouldn't be able to specify a local path when writing output on a cluster #4479

lionstory commented May 22, 2022

RehanSD commented May 23, 2022

lionstory commented May 29, 2022

BUG: users shouldn't be able to specify a local path when writing output on a cluster #4479

BUG: users shouldn't be able to specify a local path when writing output on a cluster #4479

Comments

lionstory commented May 22, 2022

System information

Describe the problem

Source code / logs

RehanSD commented May 23, 2022

lionstory commented May 29, 2022