[Feat] Runtime optimizations by e-strauss · Pull Request #20 · deem-data/stratum

e-strauss · 2026-03-03T22:15:26Z

Adding:

ParallelScheduler
More DataframeOps
Intermediate Clean Up for Scheduler
Minor fixes

This commit introduces physical planning to the logical optimizer, enabling parallel execution of independent estimator tasks. Key changes include: * **Physical Planning Optimization**: A new optimization pass identifies independent estimator operations in the DAG and groups them into a `ParallelBlockOp` for concurrent execution. * **Parallel Execution Engine**: Implementation of `ParallelBlockOp` uses `joblib` to process estimator tasks in parallel, bypassing the Python Global Interpreter Lock (GIL). * **Scheduler Refactoring**: The `Scheduler` has been refactored into a base class with a `SequentialScheduler` implementation, and the internal logic was updated to handle optimized DAG sinks rather than flat lists. * **Configurable Activation**: A new `physical_planning` flag was added to the configuration and environment variables to toggle this optimization. * **Bug Fixes and Improvements**: Updates were made to estimator processing to ensure data is picklable for multiprocessing and to improve the display of performance statistics.

This follow-up commit refactors the parallel execution implementation, moving it from a compile-time DAG transformation to a runtime scheduling strategy. * **Parallel Scheduler Implementation**: A fully-featured `ParallelScheduler` was added that can execute independent operations concurrently using either thread-based or process-based parallelism, configurable via a `backend` parameter. * **Configuration Change**: The boolean `physical_planning` flag was replaced with a more flexible `scheduler_parallelism` option that accepts `"threading"`, `"process"`, or `"auto"` to select the parallel execution backend. * **Architecture Relocation**: The physical planning logic was moved from the logical optimizer to the runtime layer, making it a runtime concern rather than a compile-time DAG rewrite; it now marks ops with a `parallel_group` ID instead of restructuring the graph. * **Estimator/Transformer Distinction**: Operations are now explicitly categorized into `EstimatorOp` (predictors) and `TransformerOp`, each with dedicated processing functions that handle their specific fit/predict vs fit_transform/transform semantics. * **DAG Linearization**: The parallel scheduler linearizes the DAG into sequential blocks where some blocks are lists of independent ops that can be executed in parallel.

- Made Polars support configurable at runtime via FLAGS.force_polars - Added automatic sin/cos UDF rewriting to native Polars ops - Added compatibility layer to convert Polars→Pandas for unsupported estimators - Fixed estimator cloning for repeated fits (e.g., cross-validation)

Add memory estimation function for transformer operations that returns size multipliers based on estimator type (TableVectorizer: 10x, StringEncoder: 3x). Filter parallelization candidates to only include operations with known memory estimates, focusing on transformer operations while temporarily disabling estimator parallelization.

…using experiment

e-strauss and others added 10 commits March 9, 2026 13:37

[Minor] more constraints on dependencies + numeric selector for uk ho…

b878b33

…using experiment

caching implementatation + paper exp

de60787

[Feat] GC for scheduler

7b7a6da

minor fixes + exp updates

295a1dd

rm benchmark files

0264a3f

fixed failing tests

6b52268

e-strauss force-pushed the Runtime-Optimizations branch from 7c07432 to 6b52268 Compare March 9, 2026 17:56

fix hashes

5493741

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Runtime optimizations#20

[Feat] Runtime optimizations#20
e-strauss wants to merge 11 commits into
mainfrom
Runtime-Optimizations

e-strauss commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

e-strauss commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant