Skip to content

Conversation

asp2286
Copy link
Contributor

@asp2286 asp2286 commented Oct 12, 2025

📖 Overview

This PR introduces a new pluggable random number generation (RNG) infrastructure into ML.NET, replacing the previous System.Random dependency with a deterministic, high-performance SIMD-accelerated Mersenne Twister (MT19937) implementation.

The goal is to provide a faster, deterministic, and cross-platform reproducible RNG foundation for all stochastic algorithms (e.g., Random Forest, KMeans++, Isolation Forest, etc.) while maintaining full backward compatibility.


🚀 Key Changes

  • New interfaces:

    • IRandomSource — unified, injectable RNG abstraction
    • IRandomBulkSource — efficient vectorized bulk fill API
  • New RNG backend:

    • MersenneTwister — pure C# MT19937 implementation
    • MersenneTwisterRandomSource — SIMD-optimized version using
      • System.Runtime.Intrinsics.X86.Avx2
      • System.Runtime.Intrinsics.Arm.AdvSimd
        with automatic scalar fallback
  • Integration:

    • Updated HostEnvironmentBase, ConsoleEnvironment, LocalEnvironment, and MLContext
    • New RandomSource property available on all IHost and MLContext instances
    • Backward-compatible Rand property retained and wired through adapters
  • Adapters for compatibility:

    • RandomSourceAdapter
    • RandomFromRandomSource
    • RandomShim
  • Testing and validation:

    • Determinism tests for same-seed consistency
    • Mixed consumption tests (Rand + RandomSource)
    • Cross-platform reproducibility (Windows/macOS/ARM64)
    • Performance microbenchmarks

⚡ Performance Impact

The new RNG is up to 5× faster in real workloads.
It eliminates per-call allocations and leverages vectorized bit-generation via SIMD instructions.

📊 Benchmark results (Isolation Forest prototype)
Environment Library Mean Fit Time Speedup
.NET 9 + ML.NET (SIMD MT19937) C# 0.21 s 🟢 5× faster
Python 3.11 + scikit-learn 1.5 C/Python 1.05 s

All benchmarks use identical seeds and datasets.
Deterministic equivalence confirmed across runs and architectures.


🔬 Determinism & Reproducibility

  • Bit-for-bit identical sequences across architectures (x86 ↔ ARM64)
  • Fallback to scalar path ensures deterministic output when SIMD unavailable
  • Each IHost and MLContext obtains an independent deterministic stream
  • Legacy IHost.Rand remains functional and maps to new RNG internally

🧠 Motivation

This refactor lays the foundation for future high-performance stochastic algorithms in ML.NET.
Reliable, cross-platform determinism and reproducible random streams are critical for modern ML workloads, testing, and research reproducibility.

It also unlocks future optimizations for:

  • Tree-based ensembles (RandomForest, IsolationForest, GBDT)
  • Sampling-based learners (KMeans++, NaiveBayes)
  • Data shuffling, augmentation, and stochastic pipelines

🔜 Next Steps

In the next PR, I will introduce a native Isolation Forest implementation built entirely in C# using this RNG backend.

Preliminary testing shows the Isolation Forest algorithm using MersenneTwisterRandomSource performs ~5× faster than scikit-learn’s Python version while producing numerically consistent anomaly scores.

This follow-up contribution will:

  • Add the new IsolationForestTrainer to ML.NET
  • Include SHAP-style explainability support
  • Extend anomaly detection benchmarks and documentation

✅ Checklist

  • Added RNG abstraction layer (IRandomSource, IRandomBulkSource)
  • Implemented SIMD-accelerated Mersenne Twister (MT19937)
  • Integrated with host environments and MLContext
  • Ensured deterministic results across platforms
  • Added backward-compatibility adapters (Rand, RandomShim, etc.)
  • Added extensive unit tests
  • Benchmarked vs scikit-learn
  • Upcoming: Isolation Forest algorithm using this RNG

🧾 References


🧩 Example usage

var mlContext = new MLContext(seed: 2024);
var rng = mlContext.RandomSource;

// deterministic sequence
uint a = rng.NextUInt();
uint b = rng.NextUInt();

…st/MLContext

- Introduce internal IRandomSource and IRandomBulkSource
- Add adapters/shims: RandomSourceAdapter, RandomFromRandomSource, RandomShim
- Implement SIMD-backed MersenneTwisterRandomSource (MT19937) and core MersenneTwister
- Wire IRandomSource through HostEnvironmentBase, ConsoleEnvironment, LocalEnvironment, MLContext
- Add tests for determinism and mixed-call consumption
@asp2286
Copy link
Contributor Author

asp2286 commented Oct 12, 2025

/azp list

Copy link

Commenter does not have sufficient privileges for PR 7525 in repo dotnet/machinelearning

Copy link

codecov bot commented Oct 12, 2025

Codecov Report

❌ Patch coverage is 73.11828% with 250 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.03%. Comparing base (694bc60) to head (bfd8543).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Core/Utilities/MersenneTwister.cs 68.28% 60 Missing and 12 partials ⚠️
...eAnalyzer.Tests/Helpers/CompatibleXUnitVerifier.cs 44.44% 53 Missing and 12 partials ⚠️
...icrosoft.ML.Core/Utilities/ResourceManagerUtils.cs 58.75% 26 Missing and 7 partials ⚠️
...oft.ML.Core/Utilities/FuncInstanceMethodInfo3`3.cs 0.00% 13 Missing ⚠️
...oft.ML.Core/Utilities/FuncInstanceMethodInfo3`4.cs 0.00% 13 Missing ⚠️
...oft.ML.Core/Utilities/FuncInstanceMethodInfo1`2.cs 30.76% 6 Missing and 3 partials ⚠️
...oft.ML.Core/Utilities/FuncInstanceMethodInfo1`3.cs 30.76% 6 Missing and 3 partials ⚠️
...oft.ML.Core/Utilities/FuncInstanceMethodInfo1`4.cs 30.76% 6 Missing and 3 partials ⚠️
...oft.ML.Core/Utilities/FuncInstanceMethodInfo2`4.cs 30.76% 6 Missing and 3 partials ⚠️
...rosoft.ML.Core/Utilities/RandomFromRandomSource.cs 73.68% 3 Missing and 2 partials ⚠️
... and 5 more
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #7525    +/-   ##
========================================
  Coverage   69.02%   69.03%            
========================================
  Files        1482     1490     +8     
  Lines      274092   274934   +842     
  Branches    28266    28375   +109     
========================================
+ Hits       189200   189792   +592     
- Misses      77503    77713   +210     
- Partials     7389     7429    +40     
Flag Coverage Δ
Debug 69.03% <73.11%> (+<0.01%) ⬆️
production 63.30% <65.73%> (-0.01%) ⬇️
test 89.43% <83.20%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Core/Data/IHostEnvironment.cs 97.56% <ø> (ø)
src/Microsoft.ML.Core/Utilities/Random.cs 82.69% <ø> (-2.57%) ⬇️
...rc/Microsoft.ML.Data/Utilities/LocalEnvironment.cs 86.20% <100.00%> (+0.24%) ⬆️
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs 92.87% <100.00%> (ø)
...yzer.Tests/Helpers/AdditionalMetadataReferences.cs 100.00% <100.00%> (ø)
...ft.ML.Core.Tests/UnitTests/MersenneTwisterTests.cs 100.00% <100.00%> (ø)
...osoft.ML.Core.Tests/UnitTests/RandomSourceTests.cs 100.00% <100.00%> (ø)
...Microsoft.ML.Core/Utilities/RandomSourceAdapter.cs 95.65% <95.65%> (ø)
...eAnalyzer.Tests/Helpers/CSharpCodeFixVerifier`2.cs 70.00% <66.66%> (ø)
src/Microsoft.ML.Data/MLContext.cs 82.60% <62.50%> (-3.11%) ⬇️
... and 12 more

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…FromRandomSource, and ResourceManagerUtils logic; fix Windows native cmake script; add APICompat suppression
@asp2286
Copy link
Contributor Author

asp2286 commented Oct 12, 2025

/azp run

Copy link

Commenter does not have sufficient privileges for PR 7525 in repo dotnet/machinelearning

@asp2286
Copy link
Contributor Author

asp2286 commented Oct 12, 2025

/azp run MachineLearning-CI

Copy link

Commenter does not have sufficient privileges for PR 7525 in repo dotnet/machinelearning

@asp2286
Copy link
Contributor Author

asp2286 commented Oct 14, 2025

@artl93 @ericstj Could you please review this PR when you have a moment? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant