Change default storage mode to SharedStorage on UMA devices #717

KaanKesginLW · 2025-12-04T19:31:29Z

Summary

Change Metal.jl's default storage mode from PrivateStorage to SharedStorage on Apple Silicon (unified memory architecture) devices. Intel Macs with discrete GPUs retain PrivateStorage as default.

This change:

Aligns with Apple's official guidance for unified memory platforms
Enables zero-copy CPU access via unsafe_wrap(Array, gpu_arr) (1.8x - 6x faster)
Has no performance regression (verified with 67 benchmarks)

Motivation

1. Apple's Guidance for Unified Memory

Apple's Metal Best Practices Guide explicitly states:

iOS/tvOS (unified memory): "The Shared mode is usually the correct choice"

Apple Silicon Macs have the same unified memory architecture as iOS/tvOS. The current PrivateStorage default was based on guidance intended for discrete GPU systems where CPU and GPU have separate memory pools.

2. Performance is Equivalent (67 Tests with Statistical Significance)

Comprehensive benchmarks on M2 Max with 95% confidence interval testing show no performance difference at production sizes:

Metric	Count
Total tests	67
Ties (no significant difference)	52 (78%)
SharedStorage wins	7 (10%)
PrivateStorage wins	8 (12%)

Key findings:

At sizes ≥512 MB: virtually all operations are ties
copyto!, fill!: Identical at all sizes
MPS matmul: Identical at all sizes (4/4 ties)
Element-wise at ≥1 GB: All ties

Benchmark breakdown by size

Size	Private Wins	Shared Wins	Ties
64 MB	1	2	6
128 MB	2	2	5
256 MB	0	2	7
512 MB	0	0	10
1024 MB	3*	0	7
2048 MB	0	1	9
4096 MB	0	0	10

*1024 MB "regressions" disappeared with more iterations (measurement noise)

3. SharedStorage Enables Faster CPU Access

SharedStorage enables unsafe_wrap(Array, gpu_arr) for zero-copy CPU access, avoiding the allocation + copy overhead of Array():

Size	Private: `Array()` + use	Shared: `unsafe_wrap()` + use	Speedup
512 MB	17.0 ms	9.0 ms	1.9x
1 GB	33.8 ms	18.1 ms	1.9x
2 GB	65.1 ms	36.4 ms	1.8x
4 GB	442.5 ms	73.9 ms	6.0x
8 GB	888.6 ms	147.3 ms	6.0x

Benchmarked on M2 Max, median of 5 runs. Both paths include sum() to measure actual data access.

The speedup comes from avoiding allocation and memory copy - with SharedStorage, CPU and GPU share the same physical memory on Apple Silicon.

Implementation

The default preference changes from "private" to "auto", which detects UMA at module load:

const DefaultStorageMode = let str = @load_preference("default_storage", "auto")
    if str == "private"
        PrivateStorage
    elseif str == "shared"
        SharedStorage
    elseif str == "auto"
        if Sys.isapple() && !isempty(devices())
            MTLDevice(1).hasUnifiedMemory ? SharedStorage : PrivateStorage
        else
            PrivateStorage
        end
    else
        error("unknown default storage mode: $str")
    end
end

Preference options:

"auto" (new default): SharedStorage on UMA, PrivateStorage otherwise
"shared": Force SharedStorage
"private": Force PrivateStorage (previous behavior)

Backward Compatibility

Preference override: Users can set default_storage = "private" in LocalPreferences.toml
Explicit storage: MtlArray{T,N,PrivateStorage}(...) always works
Test suite ready: Already handles non-Private default (see test/runtests.jl:85-88)

- Add 'auto' mode (new default) that detects unified memory architecture - Apple Silicon (UMA): defaults to SharedStorage (zero-copy CPU access) - Intel discrete GPU: defaults to PrivateStorage - User preference override still works via LocalPreferences.toml - Add tests for default storage mode selection

Guard MTLDevice(1).hasUnifiedMemory with Sys.isapple() and devices() check to prevent module load failure on ubuntu/windows.

github-actions

Metal Benchmarks

Benchmark suite	Current: `fa51600`	Previous: `239fa4d`	Ratio
`latency/precompile`	`24210071833` ns	`24383716541` ns	`0.99`
`latency/ttfp`	`2285845834` ns	`2324081375` ns	`0.98`
`latency/import`	`1404204333.5` ns	`1427504083` ns	`0.98`
`integration/metaldevrt`	`836834` ns	`837292` ns	`1.00`
`integration/byval/slices=1`	`1562604.5` ns	`1598354` ns	`0.98`
`integration/byval/slices=3`	`8870458` ns	`19021791.5` ns	`0.47`
`integration/byval/reference`	`1546625` ns	`1590708.5` ns	`0.97`
`integration/byval/slices=2`	`2609583` ns	`2727250` ns	`0.96`
`kernel/indexing`	`580875` ns	`459062.5` ns	`1.27`
`kernel/indexing_checked`	`600833` ns	`463104.5` ns	`1.30`
`kernel/launch`	`12042` ns	`11625` ns	`1.04`
`kernel/rand`	`563084` ns	`526667` ns	`1.07`
`array/construct`	`6083` ns	`5958` ns	`1.02`
`array/broadcast`	`601375` ns	`545375` ns	`1.10`
`array/random/randn/Float32`	`823417` ns	`886167` ns	`0.93`
`array/random/randn!/Float32`	`614375` ns	`578875` ns	`1.06`
`array/random/rand!/Int64`	`552709` ns	`539083` ns	`1.03`
`array/random/rand!/Float32`	`581583` ns	`533229.5` ns	`1.09`
`array/random/rand/Int64`	`777520.5` ns	`887000` ns	`0.88`
`array/random/rand/Float32`	`607958` ns	`840959` ns	`0.72`
`array/accumulate/Int64/1d`	`1256209` ns	`1292146` ns	`0.97`
`array/accumulate/Int64/dims=1`	`1825750` ns	`1865375` ns	`0.98`
`array/accumulate/Int64/dims=2`	`2132083` ns	`2215437` ns	`0.96`
`array/accumulate/Int64/dims=1L`	`11449208` ns	`12096125` ns	`0.95`
`array/accumulate/Int64/dims=2L`	`9721166.5` ns	`10003417` ns	`0.97`
`array/accumulate/Float32/1d`	`1113083` ns	`1086042` ns	`1.02`
`array/accumulate/Float32/dims=1`	`1546896` ns	`1581542` ns	`0.98`
`array/accumulate/Float32/dims=2`	`1861500` ns	`1998167` ns	`0.93`
`array/accumulate/Float32/dims=1L`	`9774271` ns	`10248396` ns	`0.95`
`array/accumulate/Float32/dims=2L`	`7288584` ns	`7422792` ns	`0.98`
`array/reductions/reduce/Int64/1d`	`993271` ns	`1312917` ns	`0.76`
`array/reductions/reduce/Int64/dims=1`	`1088500` ns	`1120125` ns	`0.97`
`array/reductions/reduce/Int64/dims=2`	`1121959` ns	`1153917` ns	`0.97`
`array/reductions/reduce/Int64/dims=1L`	`2028708` ns	`2041417` ns	`0.99`
`array/reductions/reduce/Int64/dims=2L`	`4263729` ns	`3778125` ns	`1.13`
`array/reductions/reduce/Float32/1d`	`531792` ns	`796167` ns	`0.67`
`array/reductions/reduce/Float32/dims=1`	`825750` ns	`794000` ns	`1.04`
`array/reductions/reduce/Float32/dims=2`	`845562` ns	`818562.5` ns	`1.03`
`array/reductions/reduce/Float32/dims=1L`	`1325166.5` ns	`1329000` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`1823187.5` ns	`1796708.5` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`1174334` ns	`1298666` ns	`0.90`
`array/reductions/mapreduce/Int64/dims=1`	`1136125` ns	`1086313` ns	`1.05`
`array/reductions/mapreduce/Int64/dims=2`	`1172020.5` ns	`1122666` ns	`1.04`
`array/reductions/mapreduce/Int64/dims=1L`	`2000479.5` ns	`2025395.5` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=2L`	`3654250` ns	`3647583` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`597250` ns	`774083.5` ns	`0.77`
`array/reductions/mapreduce/Float32/dims=1`	`822750` ns	`791417` ns	`1.04`
`array/reductions/mapreduce/Float32/dims=2`	`849791.5` ns	`826542` ns	`1.03`
`array/reductions/mapreduce/Float32/dims=1L`	`1323666.5` ns	`1322667` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`1817145.5` ns	`1817916.5` ns	`1.00`
`array/private/copyto!/gpu_to_gpu`	`630625` ns	`533917` ns	`1.18`
`array/private/copyto!/cpu_to_gpu`	`801208` ns	`690271` ns	`1.16`
`array/private/copyto!/gpu_to_cpu`	`799625` ns	`668542` ns	`1.20`
`array/private/iteration/findall/int`	`1580042` ns	`1565687.5` ns	`1.01`
`array/private/iteration/findall/bool`	`1428292` ns	`1465333.5` ns	`0.97`
`array/private/iteration/findfirst/int`	`2078917` ns	`2079042` ns	`1.00`
`array/private/iteration/findfirst/bool`	`2020417` ns	`2020083` ns	`1.00`
`array/private/iteration/scalar`	`5337208` ns	`2787125` ns	`1.91`
`array/private/iteration/logical`	`2552625` ns	`2599208` ns	`0.98`
`array/private/iteration/findmin/1d`	`2207541` ns	`2265458` ns	`0.97`
`array/private/iteration/findmin/2d`	`1516833` ns	`1528791` ns	`0.99`
`array/private/copy`	`602937.5` ns	`847041.5` ns	`0.71`
`array/shared/copyto!/gpu_to_gpu`	`83416` ns	`84333` ns	`0.99`
`array/shared/copyto!/cpu_to_gpu`	`85792` ns	`83042` ns	`1.03`
`array/shared/copyto!/gpu_to_cpu`	`82542` ns	`83479.5` ns	`0.99`
`array/shared/iteration/findall/int`	`1573042` ns	`1558208` ns	`1.01`
`array/shared/iteration/findall/bool`	`1436542` ns	`1470708` ns	`0.98`
`array/shared/iteration/findfirst/int`	`1636312.5` ns	`1682792` ns	`0.97`
`array/shared/iteration/findfirst/bool`	`1597291` ns	`1644334` ns	`0.97`
`array/shared/iteration/scalar`	`211583` ns	`202000` ns	`1.05`
`array/shared/iteration/logical`	`2427000` ns	`2368458` ns	`1.02`
`array/shared/iteration/findmin/1d`	`1816667` ns	`1845542` ns	`0.98`
`array/shared/iteration/findmin/2d`	`1517792` ns	`1521583` ns	`1.00`
`array/shared/copy`	`255979` ns	`210959` ns	`1.21`
`array/permutedims/4d`	`2356625` ns	`2473375` ns	`0.95`
`array/permutedims/2d`	`1146000` ns	`1178666.5` ns	`0.97`
`array/permutedims/3d`	`1655042` ns	`1780750` ns	`0.93`
`metal/synchronization/stream`	`19292` ns	`19334` ns	`1.00`
`metal/synchronization/context`	`20000` ns	`20000` ns	`1`

This comment was automatically generated by workflow using github-action-benchmark.

codecov · 2025-12-05T09:23:10Z

Codecov Report

❌ Patch coverage is 93.18182% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.08%. Comparing base (dce2026) to head (fa51600).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/array.jl	93.18%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
+ Coverage   80.91%   81.08%   +0.17%     
==========================================
  Files          62       62              
  Lines        2829     2829              
==========================================
+ Hits         2289     2294       +5     
+ Misses        540      535       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

KaanKesginLW force-pushed the feature/storage-mode-optimization branch from 2f591e1 to f696653 Compare December 4, 2025 19:37

Fix DefaultStorageMode to handle non-Apple platforms

d2ccf89

Guard MTLDevice(1).hasUnifiedMemory with Sys.isapple() and devices() check to prevent module load failure on ubuntu/windows.

github-actions bot reviewed Dec 4, 2025

View reviewed changes

Retrigger CI for flaky mag-api-validation test

fa51600

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change default storage mode to SharedStorage on UMA devices #717

Change default storage mode to SharedStorage on UMA devices #717

Uh oh!

KaanKesginLW commented Dec 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Change default storage mode to SharedStorage on UMA devices #717

Are you sure you want to change the base?

Change default storage mode to SharedStorage on UMA devices #717

Uh oh!

Conversation

KaanKesginLW commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

1. Apple's Guidance for Unified Memory

2. Performance is Equivalent (67 Tests with Statistical Significance)

3. SharedStorage Enables Faster CPU Access

Implementation

Backward Compatibility

Related

Uh oh!

github-actions bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Metal Benchmarks

Uh oh!

codecov bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KaanKesginLW commented Dec 4, 2025 •

edited

Loading

github-actions bot commented Dec 4, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

codecov bot commented Dec 5, 2025 •

edited

Loading