Skip to content

[BufferPool] BufferPool API design and necessary scheduler / runtime adaptions#37

Closed
e-strauss wants to merge 1 commit into
deem-data:mainfrom
e-strauss:buffer-pool-squashed
Closed

[BufferPool] BufferPool API design and necessary scheduler / runtime adaptions#37
e-strauss wants to merge 1 commit into
deem-data:mainfrom
e-strauss:buffer-pool-squashed

Conversation

@e-strauss
Copy link
Copy Markdown
Collaborator

@e-strauss e-strauss commented Apr 10, 2026

This PR updates how stratum handles intermediates. Previously, we stored all intermediates directly in the Ops and each Op was directly accesses these fields on their inputs operators. These changes decouple the intermediate storage from the Op execution, by storing the intermediates in the BufferPool. The BufferPool API consists mainly of three methods:

PUT(Op, Intermediate) -> None
GET(Op) -> Intermediate
RELEASE(Op) -> None

In the optimization step, after linerization of the DAG, we do analysis pass and find out at which point we can release a Op's intermediate by looking at all consumers of an Op and especially if an Op's output is consumed multiple times. E.g. the inputs of the split operation is needed for all CV folds.

Right now, the buffer pool is merely a simple HashMap of Ops and intermediates. Next steps, are to add memory thresholds and intermediate sizes for building eviction logic and serialization to avoid running out of memory.

Benchmark

The lower memory consumption because of releasing of unnecessary intermediates can be seen in the following plots of the reserved physical memory of the stratum process on this benchmark use case:

def dummy_func(x, t: float=0.1):
    indices = np.arange(len(x))
    out = x.iloc[indices]
    sleep(t)
    return out
    
def main():
     df = skrub.as_data_op(f"input_{n}.csv").skb.apply_func(pd.read_csv)
     X = df.drop("y", axis=1).skb.mark_as_X()
    y = df["y"].skb.mark_as_y()
    for i in range(5):   # 5 times copy the input frame to simulate some processing
        X = X.skb.apply_func(dummy_func, t=0.3)
    model = DummyRegressor()
    pred = X.skb.apply(model, y=y)
    pred.skb.make_grid_search()

With the input release planning, we are able to immediately free intermediate of our custom UDF once it is processed and keep the required memory low.

Without planning

Screenshot 2026-04-10 at 17 04 10

With release planning

Screenshot 2026-04-10 at 17 04 06

Most important changes:

  • Introduce DAG linearization and input release planning in the optimizer pipeline
  • Add BufferManager and BufferPool with static consumer counting, pinning, and fold-aware lifecycle management
  • Refactor Op.process() to return computed results instead of mutating internal state; extract argument resolution helpers
  • Update scheduler to plan buffer lifecycles ahead of execution and flush stale buffers across folds
  • Adjust scheduler and buffer pool to support pinned op planning data and deterministic intermediate release
  • Add runtime components for buffer lifecycle tracking and handle-based storage
  • Update API and tests to work with linearized execution plans and new release semantics
  • Include comprehensive test suite for buffer pool (registration, retrieval, pinning, and split-phase scenarios) and benchmark for intermediate release

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 97.08029% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
stratum/runtime/_scheduler.py 92.50% 3 Missing ⚠️
stratum/optimizer/_linearization.py 93.54% 1 Missing and 1 partial ⚠️
stratum/optimizer/ir/_dataframe_ops.py 96.77% 2 Missing ⚠️
stratum/optimizer/_op_utils.py 85.71% 1 Missing ⚠️
Flag Coverage Δ
unittests 89.50% <97.08%> (+0.74%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
stratum/_api.py 93.10% <100.00%> (ø)
stratum/optimizer/_input_release_planning.py 100.00% <100.00%> (ø)
stratum/optimizer/_optimize.py 92.85% <100.00%> (+1.32%) ⬆️
stratum/optimizer/ir/_numeric_ops.py 100.00% <100.00%> (ø)
stratum/optimizer/ir/_ops.py 98.02% <100.00%> (+0.04%) ⬆️
stratum/runtime/_buffer_pool.py 100.00% <100.00%> (ø)
stratum/optimizer/_op_utils.py 86.02% <85.71%> (+0.31%) ⬆️
stratum/optimizer/_linearization.py 93.54% <93.54%> (ø)
stratum/optimizer/ir/_dataframe_ops.py 94.52% <96.77%> (-0.09%) ⬇️
stratum/runtime/_scheduler.py 87.58% <92.50%> (+3.37%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@e-strauss e-strauss force-pushed the buffer-pool-squashed branch from 50b4d66 to 8d217b9 Compare April 16, 2026 12:25
…adaptions

- Introduce DAG linearization and input release planning in the optimizer pipeline
- Add BufferManager and BufferPool with static consumer counting, pinning, and fold-aware lifecycle management
- Refactor Op.process() to return computed results instead of mutating internal state; extract argument resolution helpers
- Update scheduler to plan buffer lifecycles ahead of execution and flush stale buffers across folds
- Adjust scheduler and buffer pool to support pinned op planning data and deterministic intermediate release
- Add runtime components for buffer lifecycle tracking and handle-based storage
- Update API and tests to work with linearized execution plans and new release semantics
- Include comprehensive test suite for buffer pool (registration, retrieval, pinning, and split-phase scenarios) and benchmark for intermediate release
@e-strauss e-strauss force-pushed the buffer-pool-squashed branch from 8d217b9 to e2ec0c0 Compare April 16, 2026 12:35
@e-strauss e-strauss closed this in d0f7af4 Apr 23, 2026
@e-strauss e-strauss deleted the buffer-pool-squashed branch April 29, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant