Skip to content

[StandardScaler] Rust- and numpy-backed implementation of StandardScaler#30

Open
e-strauss wants to merge 2 commits into
deem-data:mainfrom
e-strauss:standard-scaler
Open

[StandardScaler] Rust- and numpy-backed implementation of StandardScaler#30
e-strauss wants to merge 2 commits into
deem-data:mainfrom
e-strauss:standard-scaler

Conversation

@e-strauss
Copy link
Copy Markdown
Collaborator

@e-strauss e-strauss commented Mar 13, 2026

Rust- and numpy-backed implementation of StandardScaler

  • Introduces a Rust-backed and numpy-backed StandardScaler
  • Adds comprehensive tests and a benchmark script to validate correctness and measure performance.

This PR does not contain the integration into stratum optimizer yet, e.g. for operator selection.

The benchmark evaluates the impact of the number of rows and columns of the Input X. Each datapoint is the average of 5 repeated runs.

Benchmark results:

  • x-axis number columns
  • y-axis runtime per fit and transform
  • each sub-figure contains a fixed number of rows
standard_scaler_benchmark_rows

Micro Benchmark number of concurrent tasks for Rust

standard_scaler_n_jobs_benchmark_rows

Parallel StandardScaler Benchmark

parallel_scalers_benchmark_rows

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
unittests 81.70% <100.00%> (+0.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
stratum/_rust_backend.py 93.33% <100.00%> (+0.35%) ⬆️
stratum/adapters/standard_scaler.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@e-strauss e-strauss marked this pull request as ready for review March 13, 2026 17:14
@e-strauss e-strauss requested a review from phaniarnab March 13, 2026 17:14
Comment on lines +18 to +20
let mut compute = || {
let stats: Vec<(f32, f32)> = (0..n_cols)
.into_par_iter()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but we may want to use SIMD (std::simd) for the kernels in the future.

Comment thread _rust/src/standard_scaler.rs Outdated
Comment thread benchmarks/bench_standard_scaler.py Outdated
Establishes a standardized framework for commit messages and issue referencing to maintain an organized and scannable project history. This provides clear instructions for contributors on how to categorize changes and link them to trackable work items.
@e-strauss e-strauss changed the title [StandardScaler] Alternative Rust- and numpy-backed implementation of StandardScaler [StandardScaler] Rust- and numpy-backed implementation of StandardScaler Mar 26, 2026
Introduces a parallel Rust `StandardScaler` kernel (`fit` + `transform`) and a NumPy-backed `StandardScaler`, wired through the Python adapter with sklearn-compatible fallback behavior.
Adds correctness/fallback tests for both backends, and adds comprehensively benchmark and comparison of backends.
@e-strauss
Copy link
Copy Markdown
Collaborator Author

@phaniarnab
the rust scaler releases now the GIL for fit and transform. Which allows us to run the standard scaler in parallel threads. The improvements are visible in third benchmark plot, where I compare the rust scaler against the numpy implementation and scikit's. The other two scalers also benefit from the parallelization, since a single scaler in numpy or scikit is not leverage all cores. Which is also the reason why the single rust scaler performs much better than the other implementation, since the rust scaler is data parellel over row chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants