[Feat] Runtime optimizations#20
Draft
e-strauss wants to merge 11 commits into
Draft
Conversation
This commit introduces physical planning to the logical optimizer, enabling parallel execution of independent estimator tasks. Key changes include: * **Physical Planning Optimization**: A new optimization pass identifies independent estimator operations in the DAG and groups them into a `ParallelBlockOp` for concurrent execution. * **Parallel Execution Engine**: Implementation of `ParallelBlockOp` uses `joblib` to process estimator tasks in parallel, bypassing the Python Global Interpreter Lock (GIL). * **Scheduler Refactoring**: The `Scheduler` has been refactored into a base class with a `SequentialScheduler` implementation, and the internal logic was updated to handle optimized DAG sinks rather than flat lists. * **Configurable Activation**: A new `physical_planning` flag was added to the configuration and environment variables to toggle this optimization. * **Bug Fixes and Improvements**: Updates were made to estimator processing to ensure data is picklable for multiprocessing and to improve the display of performance statistics.
This follow-up commit refactors the parallel execution implementation, moving it from a compile-time DAG transformation to a runtime scheduling strategy. * **Parallel Scheduler Implementation**: A fully-featured `ParallelScheduler` was added that can execute independent operations concurrently using either thread-based or process-based parallelism, configurable via a `backend` parameter. * **Configuration Change**: The boolean `physical_planning` flag was replaced with a more flexible `scheduler_parallelism` option that accepts `"threading"`, `"process"`, or `"auto"` to select the parallel execution backend. * **Architecture Relocation**: The physical planning logic was moved from the logical optimizer to the runtime layer, making it a runtime concern rather than a compile-time DAG rewrite; it now marks ops with a `parallel_group` ID instead of restructuring the graph. * **Estimator/Transformer Distinction**: Operations are now explicitly categorized into `EstimatorOp` (predictors) and `TransformerOp`, each with dedicated processing functions that handle their specific fit/predict vs fit_transform/transform semantics. * **DAG Linearization**: The parallel scheduler linearizes the DAG into sequential blocks where some blocks are lists of independent ops that can be executed in parallel.
- Made Polars support configurable at runtime via FLAGS.force_polars - Added automatic sin/cos UDF rewriting to native Polars ops - Added compatibility layer to convert Polars→Pandas for unsupported estimators - Fixed estimator cloning for repeated fits (e.g., cross-validation)
Add memory estimation function for transformer operations that returns size multipliers based on estimator type (TableVectorizer: 10x, StringEncoder: 3x). Filter parallelization candidates to only include operations with known memory estimates, focusing on transformer operations while temporarily disabling estimator parallelization.
7c07432 to
6b52268
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adding: