You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lionstory opened this issue
May 22, 2022
· 2 comments
· May be fixed by #4484
Labels
bug 🦗Something isn't workingExternalPull requests and issues from people who do not regularly contribute to modinP3Very minor bugs, or features we can hopefully add some day.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7.6.1810
Modin version (modin.__version__): 0.12.1
Python version: 3.7.6
Ray version: 1.12.0 ( Ray cluster: 1 head node + 2 worker nodes)
Code we can use to reproduce:
import os
os.environ["MODIN_ENGINE"] = "ray"
import ray
ray.init(address="auto")
import modin.pandas as mpd
from collections import Counter
import pandas as pd
import time
if name == 'main':
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
CSV_FILE = '/home/term/wanlu/500w50f'
object_ids = [modin_csv_parquet_perf.remote(CSV_FILE) for _ in range(1)]
ip_addresses = ray.get(object_ids)
print('Tasks executed:')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
Describe the problem
modin_df.to_parquet will write the dataframe(5 million records) to parquet files automatically partitioned on 3 ray nodes. But when mpd.read_parquet, I can only read the parquet records(2499994) of the head ray node that the scripts were run on. How can I get all the 5million records on 3 nodes? I don't see any document about these details. Many thanks.
Source code / logs
[term@dev-ctb-xs-196-65 wanlu]$ python3 testmodinpandas.py
This cluster consists of
3 nodes in total
24.0 CPU resources in total
Hi Leon! Thank you so much for opening this issue! When working in a distributed setting, you want to use a distributed filesystem, such as AWS S3, or an NFS, rather than specifying a local file path, so that all of your files end up in the same place, and a distributed read will work. There's not really a good way to read a file split across multiple devices, and we'll be putting in a PR soon to warn users about using local file paths in a distributed setting! Hope that helps!
RehanSD
added a commit
to RehanSD/modin
that referenced
this issue
May 23, 2022
Hi Leon! Thank you so much for opening this issue! When working in a distributed setting, you want to use a distributed filesystem, such as AWS S3, or an NFS, rather than specifying a local file path, so that all of your files end up in the same place, and a distributed read will work. There's not really a good way to read a file split across multiple devices, and we'll be putting in a PR soon to warn users about using local file paths in a distributed setting! Hope that helps!
RehhanSD, thanks a lot for the reply.
vnlitvinov
added
the
P3
Very minor bugs, or features we can hopefully add some day.
label
Sep 6, 2022
mvashishtha
changed the title
how to read all parquet partition files that wrote by modin df.to_parquet
BUG: users shouldn't be able to specify a local path when writing output on a cluster
Oct 12, 2022
anmyachev
added
the
External
Pull requests and issues from people who do not regularly contribute to modin
label
Apr 19, 2023
bug 🦗Something isn't workingExternalPull requests and issues from people who do not regularly contribute to modinP3Very minor bugs, or features we can hopefully add some day.
System information
modin.__version__
): 0.12.1import os
os.environ["MODIN_ENGINE"] = "ray"
import ray
ray.init(address="auto")
import modin.pandas as mpd
from collections import Counter
import pandas as pd
import time
@ray.remote
def modin_csv_parquet_perf(csv_file_prefix: str):
# modin read csv
start_time = time.time()
modin_df = mpd.read_csv(csv_file_prefix + '.csv')
print('===>Time cost(s) - modin read_csv: ' + str(time.time() - start_time))
print(f'Modin csv DF len = {len(modin_df)}')
if name == 'main':
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
Describe the problem
modin_df.to_parquet will write the dataframe(5 million records) to parquet files automatically partitioned on 3 ray nodes. But when mpd.read_parquet, I can only read the parquet records(2499994) of the head ray node that the scripts were run on. How can I get all the 5million records on 3 nodes? I don't see any document about these details. Many thanks.
Source code / logs
[term@dev-ctb-xs-196-65 wanlu]$ python3 testmodinpandas.py
This cluster consists of
3 nodes in total
24.0 CPU resources in total
(modin_csv_parquet_perf pid=2961) ===>Time cost(s) - modin read_csv: 12.558054447174072
(modin_csv_parquet_perf pid=2961) Modin csv DF len = 5000000
(modin_csv_parquet_perf pid=2961) ===>Time cost(s) - modin to_parquet: 8.699442386627197
Tasks executed:
1 tasks on None
(modin_csv_parquet_perf pid=2961) ===>Time cost(s) - modin read_parquet: 6.788694381713867
(modin_csv_parquet_perf pid=2961) Modin parquet DF len = 2499994
The text was updated successfully, but these errors were encountered: