FEAT-#6990: Implement lazy execution for the Ray virtual partitions. #6991

AndreyPavlenko · 2024-03-01T20:32:34Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Implement lazy execution for the Ray virtual partitions. #6990
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py

… partitions.

anmyachev

Judging by the annotations, we need to write a lot more tests to cover most of the changes.

anmyachev · 2024-05-06T10:31:42Z

modin/core/execution/ray/common/__init__.py

 from .utils import initialize_ray

 __all__ = [
    "initialize_ray",
    "RayWrapper",
    "MaterializationHook",
    "SignalActor",
+    "RayObjectRefTypes",


This item has been deleted

anmyachev · 2024-05-06T10:33:24Z

modin/core/execution/ray/common/engine_wrapper.py

@@ -214,7 +214,7 @@ def wait(cls, obj_ids, num_returns=None):
        num_returns : int, optional
        """
        if not isinstance(obj_ids, Sequence):
-            obj_ids = list(obj_ids)
+            obj_ids = list(obj_ids) if isinstance(obj_ids, Iterable) else [obj_ids]


Can be deleted.

anmyachev · 2024-05-06T10:34:15Z

modin/config/envvars.py

@@ -868,7 +868,7 @@ class LazyExecution(EnvironmentVariable, type=str):
    """

    varname = "MODIN_LAZY_EXECUTION"
-    choices = ("Auto", "On", "Off")
+    choices = ("Auto", "On", "Off", "Axis")


Why introduce a new mode?

anmyachev · 2024-05-06T10:51:12Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py

+            try:
+                ref = ray.get(ref, timeout=0)
+            except ray.exceptions.GetTimeoutError:
+                return False


If an object has been calculated and placed in distributed storage, will materialization occur here?

If this approach can be effective, then it is worth considering the possibility of using it in other places.

anmyachev · 2024-05-06T10:55:02Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py

@@ -419,7 +424,7 @@ def eager_exec(self, func, *args, length=None, width=None, **kwargs):
 LazyExecution.subscribe(_configure_lazy_exec)


-class SlicerHook(MaterializationHook):
+class SlicerHook(MaterializationHook, DeferredExecution):


What is the idea behind this change?

anmyachev · 2024-05-06T10:58:50Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+from .partition import PandasOnRayDataframePartition
+
+
+class PandasOnRayDataframeVirtualPartition(BaseDataframeAxisPartition):


Why not such inheritance?

Suggested change

class PandasOnRayDataframeVirtualPartition(BaseDataframeAxisPartition):

class PandasOnRayDataframeVirtualPartition(PandasDataframeAxisPartition):

anmyachev · 2024-05-06T11:07:45Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

@@ -42,6 +54,82 @@ class PandasOnRayDataframePartitionManager(GenericRayDataframePartitionManager):
    _execution_wrapper = RayWrapper
    materialize_futures = RayWrapper.materialize

+    if LazyExecution.get() in ("On", "Axis"):


Whether to use this function or not is determined during the first import without the possibility of further replacement. As far as I remember, in all other places, functions are defined on each call.

anmyachev · 2024-05-06T11:09:38Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

+
+        @classmethod
+        @_inherit_docstrings(GenericRayDataframePartitionManager.get_indices)
+        def get_indices(cls, axis, partitions, index_func=None):


Have you tried making lazy changes to the already existing get_indices? (without overriding)

When we call this get_indices, do we trigger the entire lazy execution tree? If so, do we keep the result the consumers depend on?

E.g., if we had a lazy apply and computed indices, would we keep the result of the apply?

What this function is trying to do is to avoid the partitions concatenation. It could be possible in the case when all the partitions are the result of a deferred split operation. Look at the description of the find_non_split_block() function. There is an example of such an execution tree. If we can find in the tree the non-split partition, we can just get the index out of there and, thus, avoid the concatenation.

anmyachev · 2024-05-06T11:11:16Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

+        PandasOnRayDataframeColumnPartition,
+        PandasOnRayDataframeRowPartition,
+        PandasOnRayDataframeVirtualPartition,


Have you tried making changes to existing classes?

anmyachev · 2024-05-06T11:14:43Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+    axis = 0
+
+    @remote_function
+    def _remote_concat(dfs):  # pragma: no cover  # noqa: GL08


Are you sure that the concat works as intended, given the message about the naming of the function arguments?

anmyachev · 2024-05-06T11:20:35Z

modin/core/execution/ray/common/deferred_execution.py

@@ -29,13 +29,20 @@

 import pandas
 import ray
+import ray.exceptions


It seems that a lot of the changes in this file are not directly affected by this pull request and therefore it would be great to move them into a separate pull request.

YarShev · 2024-05-23T18:34:33Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

@@ -12,24 +12,36 @@
 # governing permissions and limitations under the License.

 """Module houses class that implements ``GenericRayDataframePartitionManager`` using Ray."""
+import math


Suggested change

import math

import math

YarShev · 2024-05-23T18:35:51Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

-    PandasOnRayDataframeRowPartition,
-)
+
+if LazyExecution.get() in ("On", "Axis"):


This logic should probably be placed in modin/core/execution/ray/implementations/pandas_on_ray/partitioning/init.py.

YarShev · 2024-05-23T18:47:07Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+# governing permissions and limitations under the License.
+
+"""Module houses classes responsible for storing a virtual partition and applying a function to it."""
+import math


Suggested change

import math

import math

YarShev · 2024-05-23T18:48:16Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+    """
+
+    partition_type = PandasOnRayDataframePartition
+    instance_type = ray.ObjectRef


Suggested change

instance_type = ray.ObjectRef

@anmyachev, can this be removed?

YarShev · 2024-05-23T18:53:23Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+        list of lengths or None
+            Estimated chunk lengths, that could be different form the real ones.
+        bool
+            Whether the specified partitions represent the full block or just the


Can you elaborate a little on this?

YarShev · 2024-05-23T19:54:33Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+        manual_partition=False,
+        **kwargs,
+    ) -> Union[List[PandasOnRayDataframePartition], PandasOnRayDataframePartition]:
+        if not manual_partition:


Why does this parameter have effect only in case of False? Should we copy the related logic from the base class?

YarShev · 2024-05-24T15:38:46Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+        lengths: Union[List[Union[ObjectRefType, int]], None],
+    ):
+        self.num_splits = num_splits
+        self.skip_chunks = set()


Let's put a comment what this is for.

YarShev · 2024-05-24T15:49:25Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/lazy_virtual_partition.py

+                        PandasOnRayDataframeColumnPartition
+                        if self.axis
+                        else PandasOnRayDataframeRowPartition


Suggested change

PandasOnRayDataframeColumnPartition

if self.axis

else PandasOnRayDataframeRowPartition

PandasOnRayDataframeRowPartition

if self.axis

else PandasOnRayDataframeColumnPartition

Should this be so?

arunjose696 · 2024-06-05T09:00:47Z

modin/core/execution/ray/common/deferred_execution.py

@@ -391,22 +408,24 @@ def _deconstruct_list(
        """
        for obj in lst:
            if isinstance(obj, DeferredExecution):
-                if out_pos := getattr(obj, "out_pos", None):
+                if obj.has_result:
+                    obj = obj.data


Suggested change

obj = obj.data

out_append(obj.data)

I think it would be better to append obj.data in this if branch and remove the continue statements in all the else statements.

If obj.data is a list, we need to deconstruct it either. Thus, we assign it to obj and go to the if isinstance(obj, ListOrTuple) check.

arunjose696 · 2024-06-05T09:01:15Z

modin/core/execution/ray/common/deferred_execution.py

+                    if obj.subscribers == 0:
+                        output[out_pos + 1] = 0
+                        result_consumers.remove(obj)
+                    continue


Suggested change

continue

arunjose696 · 2024-06-05T09:01:35Z

modin/core/execution/ray/common/deferred_execution.py

                else:
                    out_append(_Tag.CHAIN)
                    yield cls._deconstruct_chain(obj, output, stack, result_consumers)
                    out_append(_Tag.END)
-            elif isinstance(obj, ListOrTuple):
+                    continue


Suggested change

continue

arunjose696 · 2024-06-05T09:02:15Z

modin/core/execution/ray/common/deferred_execution.py

-            elif isinstance(obj, ListOrTuple):
+                    continue
+
+            if isinstance(obj, ListOrTuple):


Suggested change

if isinstance(obj, ListOrTuple):

elif isinstance(obj, ListOrTuple):

arunjose696 · 2024-06-05T09:06:54Z

modin/core/execution/ray/common/deferred_execution.py

+                    out_append(_Tag.REF)
+                    out_append(out_pos)
+                    output[out_pos] = out_pos
+                    if obj.subscribers == 0:
+                        output[out_pos + 1] = 0
+                        result_consumers.remove(obj)


As this code is duplicated

modin/modin/core/execution/ray/common/deferred_execution.py

Lines 326 to 333 in 92fe2f7

if de.subscribers == 0:

# We may have subscribed to the same node multiple times.

# It could happen, for example, if it's passed to the args

# multiple times, or it's one of the parent nodes and also

# passed to the args. In this case, there are no multiple

# subscribers, and we don't need to return the result.

output[out_pos + 1] = 0

result_consumers.remove(de)

and we have reason for this deconstruct_chain, could it be reused?

I don't think it makes sense to create a separate function just in order to reuse 3 lines of trivial code. Besides, it will cost a function call. Probably, a comment should be added here.

Yeah a comment should be sufficent.

AndreyPavlenko force-pushed the issue-6990 branch 2 times, most recently from bf2943d to ea540cc Compare March 1, 2024 20:38

github-advanced-security bot found potential problems Mar 1, 2024

View reviewed changes

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py Fixed Show fixed Hide fixed

YarShev mentioned this pull request Mar 13, 2024

FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

Merged

7 tasks

AndreyPavlenko force-pushed the issue-6990 branch 12 times, most recently from 2e3390b to d98324d Compare March 16, 2024 16:54

github-advanced-security bot found potential problems Mar 16, 2024

View reviewed changes

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py Fixed Show fixed Hide fixed

AndreyPavlenko force-pushed the issue-6990 branch 13 times, most recently from 128509d to 8ce0b34 Compare March 20, 2024 20:17

AndreyPavlenko force-pushed the issue-6990 branch 10 times, most recently from 8cc6583 to b09a944 Compare April 12, 2024 19:35

AndreyPavlenko marked this pull request as ready for review April 12, 2024 20:59

AndreyPavlenko requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners April 12, 2024 20:59

FEAT-modin-project#6990: Implement lazy execution for the Ray virtual…

92fe2f7

… partitions.

AndreyPavlenko force-pushed the issue-6990 branch from b09a944 to 92fe2f7 Compare April 16, 2024 13:21

anmyachev reviewed May 6, 2024

View reviewed changes

YarShev reviewed May 23, 2024

View reviewed changes

YarShev reviewed May 24, 2024

View reviewed changes

arunjose696 reviewed Jun 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6990: Implement lazy execution for the Ray virtual partitions. #6991

FEAT-#6990: Implement lazy execution for the Ray virtual partitions. #6991

AndreyPavlenko commented Mar 1, 2024

anmyachev left a comment

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

YarShev May 24, 2024

YarShev May 24, 2024

AndreyPavlenko May 24, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

anmyachev May 6, 2024

YarShev May 23, 2024

YarShev May 23, 2024

YarShev May 23, 2024

YarShev May 23, 2024

YarShev May 23, 2024

YarShev May 23, 2024

YarShev May 24, 2024

YarShev May 24, 2024

arunjose696 Jun 5, 2024 •

edited

Loading

AndreyPavlenko Jun 5, 2024

arunjose696 Jun 5, 2024

arunjose696 Jun 5, 2024

arunjose696 Jun 5, 2024

arunjose696 Jun 5, 2024 •

edited

Loading

AndreyPavlenko Jun 5, 2024

arunjose696 Jun 5, 2024

		from .partition import PandasOnRayDataframePartition


		class PandasOnRayDataframeVirtualPartition(BaseDataframeAxisPartition):

	if isinstance(obj, ListOrTuple):
	elif isinstance(obj, ListOrTuple):

	if de.subscribers == 0:
	# We may have subscribed to the same node multiple times.
	# It could happen, for example, if it's passed to the args
	# multiple times, or it's one of the parent nodes and also
	# passed to the args. In this case, there are no multiple
	# subscribers, and we don't need to return the result.
	output[out_pos + 1] = 0
	result_consumers.remove(de)

FEAT-#6990: Implement lazy execution for the Ray virtual partitions. #6991

Are you sure you want to change the base?

FEAT-#6990: Implement lazy execution for the Ray virtual partitions. #6991

Conversation

AndreyPavlenko commented Mar 1, 2024

What do these changes do?

anmyachev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 Jun 5, 2024 •

edited

Loading

arunjose696 Jun 5, 2024 •

edited

Loading