Skip to content

Conversation

@KaanKesginLW
Copy link
Contributor

@KaanKesginLW KaanKesginLW commented Dec 4, 2025

Summary

Change Metal.jl's default storage mode from PrivateStorage to SharedStorage on Apple Silicon (unified memory architecture) devices. Intel Macs with discrete GPUs retain PrivateStorage as default.

This change:

  • Aligns with Apple's official guidance for unified memory platforms
  • Enables zero-copy CPU access via unsafe_wrap(Array, gpu_arr) (1.8x - 6x faster)
  • Has no performance regression (verified with 67 benchmarks)

Motivation

1. Apple's Guidance for Unified Memory

Apple's Metal Best Practices Guide explicitly states:

iOS/tvOS (unified memory): "The Shared mode is usually the correct choice"

Apple Silicon Macs have the same unified memory architecture as iOS/tvOS. The current PrivateStorage default was based on guidance intended for discrete GPU systems where CPU and GPU have separate memory pools.

2. Performance is Equivalent (67 Tests with Statistical Significance)

Comprehensive benchmarks on M2 Max with 95% confidence interval testing show no performance difference at production sizes:

Metric Count
Total tests 67
Ties (no significant difference) 52 (78%)
SharedStorage wins 7 (10%)
PrivateStorage wins 8 (12%)

Key findings:

  • At sizes ≥512 MB: virtually all operations are ties
  • copyto!, fill!: Identical at all sizes
  • MPS matmul: Identical at all sizes (4/4 ties)
  • Element-wise at ≥1 GB: All ties
Benchmark breakdown by size
Size Private Wins Shared Wins Ties
64 MB 1 2 6
128 MB 2 2 5
256 MB 0 2 7
512 MB 0 0 10
1024 MB 3* 0 7
2048 MB 0 1 9
4096 MB 0 0 10

*1024 MB "regressions" disappeared with more iterations (measurement noise)

3. SharedStorage Enables Faster CPU Access

SharedStorage enables unsafe_wrap(Array, gpu_arr) for zero-copy CPU access, avoiding the allocation + copy overhead of Array():

Size Private: Array() + use Shared: unsafe_wrap() + use Speedup
512 MB 17.0 ms 9.0 ms 1.9x
1 GB 33.8 ms 18.1 ms 1.9x
2 GB 65.1 ms 36.4 ms 1.8x
4 GB 442.5 ms 73.9 ms 6.0x
8 GB 888.6 ms 147.3 ms 6.0x

Benchmarked on M2 Max, median of 5 runs. Both paths include sum() to measure actual data access.

The speedup comes from avoiding allocation and memory copy - with SharedStorage, CPU and GPU share the same physical memory on Apple Silicon.

Implementation

The default preference changes from "private" to "auto", which detects UMA at module load:

const DefaultStorageMode = let str = @load_preference("default_storage", "auto")
    if str == "private"
        PrivateStorage
    elseif str == "shared"
        SharedStorage
    elseif str == "auto"
        if Sys.isapple() && !isempty(devices())
            MTLDevice(1).hasUnifiedMemory ? SharedStorage : PrivateStorage
        else
            PrivateStorage
        end
    else
        error("unknown default storage mode: $str")
    end
end

Preference options:

  • "auto" (new default): SharedStorage on UMA, PrivateStorage otherwise
  • "shared": Force SharedStorage
  • "private": Force PrivateStorage (previous behavior)

Backward Compatibility

  1. Preference override: Users can set default_storage = "private" in LocalPreferences.toml
  2. Explicit storage: MtlArray{T,N,PrivateStorage}(...) always works
  3. Test suite ready: Already handles non-Private default (see test/runtests.jl:85-88)

Related

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Your PR no longer requires formatting changes. Thank you for your contribution!

- Add 'auto' mode (new default) that detects unified memory architecture
- Apple Silicon (UMA): defaults to SharedStorage (zero-copy CPU access)
- Intel discrete GPU: defaults to PrivateStorage
- User preference override still works via LocalPreferences.toml
- Add tests for default storage mode selection
@KaanKesginLW KaanKesginLW force-pushed the feature/storage-mode-optimization branch from 2f591e1 to f696653 Compare December 4, 2025 19:37
Guard MTLDevice(1).hasUnifiedMemory with Sys.isapple() and devices()
check to prevent module load failure on ubuntu/windows.
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: fa51600 Previous: 239fa4d Ratio
latency/precompile 24210071833 ns 24383716541 ns 0.99
latency/ttfp 2285845834 ns 2324081375 ns 0.98
latency/import 1404204333.5 ns 1427504083 ns 0.98
integration/metaldevrt 836834 ns 837292 ns 1.00
integration/byval/slices=1 1562604.5 ns 1598354 ns 0.98
integration/byval/slices=3 8870458 ns 19021791.5 ns 0.47
integration/byval/reference 1546625 ns 1590708.5 ns 0.97
integration/byval/slices=2 2609583 ns 2727250 ns 0.96
kernel/indexing 580875 ns 459062.5 ns 1.27
kernel/indexing_checked 600833 ns 463104.5 ns 1.30
kernel/launch 12042 ns 11625 ns 1.04
kernel/rand 563084 ns 526667 ns 1.07
array/construct 6083 ns 5958 ns 1.02
array/broadcast 601375 ns 545375 ns 1.10
array/random/randn/Float32 823417 ns 886167 ns 0.93
array/random/randn!/Float32 614375 ns 578875 ns 1.06
array/random/rand!/Int64 552709 ns 539083 ns 1.03
array/random/rand!/Float32 581583 ns 533229.5 ns 1.09
array/random/rand/Int64 777520.5 ns 887000 ns 0.88
array/random/rand/Float32 607958 ns 840959 ns 0.72
array/accumulate/Int64/1d 1256209 ns 1292146 ns 0.97
array/accumulate/Int64/dims=1 1825750 ns 1865375 ns 0.98
array/accumulate/Int64/dims=2 2132083 ns 2215437 ns 0.96
array/accumulate/Int64/dims=1L 11449208 ns 12096125 ns 0.95
array/accumulate/Int64/dims=2L 9721166.5 ns 10003417 ns 0.97
array/accumulate/Float32/1d 1113083 ns 1086042 ns 1.02
array/accumulate/Float32/dims=1 1546896 ns 1581542 ns 0.98
array/accumulate/Float32/dims=2 1861500 ns 1998167 ns 0.93
array/accumulate/Float32/dims=1L 9774271 ns 10248396 ns 0.95
array/accumulate/Float32/dims=2L 7288584 ns 7422792 ns 0.98
array/reductions/reduce/Int64/1d 993271 ns 1312917 ns 0.76
array/reductions/reduce/Int64/dims=1 1088500 ns 1120125 ns 0.97
array/reductions/reduce/Int64/dims=2 1121959 ns 1153917 ns 0.97
array/reductions/reduce/Int64/dims=1L 2028708 ns 2041417 ns 0.99
array/reductions/reduce/Int64/dims=2L 4263729 ns 3778125 ns 1.13
array/reductions/reduce/Float32/1d 531792 ns 796167 ns 0.67
array/reductions/reduce/Float32/dims=1 825750 ns 794000 ns 1.04
array/reductions/reduce/Float32/dims=2 845562 ns 818562.5 ns 1.03
array/reductions/reduce/Float32/dims=1L 1325166.5 ns 1329000 ns 1.00
array/reductions/reduce/Float32/dims=2L 1823187.5 ns 1796708.5 ns 1.01
array/reductions/mapreduce/Int64/1d 1174334 ns 1298666 ns 0.90
array/reductions/mapreduce/Int64/dims=1 1136125 ns 1086313 ns 1.05
array/reductions/mapreduce/Int64/dims=2 1172020.5 ns 1122666 ns 1.04
array/reductions/mapreduce/Int64/dims=1L 2000479.5 ns 2025395.5 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 3654250 ns 3647583 ns 1.00
array/reductions/mapreduce/Float32/1d 597250 ns 774083.5 ns 0.77
array/reductions/mapreduce/Float32/dims=1 822750 ns 791417 ns 1.04
array/reductions/mapreduce/Float32/dims=2 849791.5 ns 826542 ns 1.03
array/reductions/mapreduce/Float32/dims=1L 1323666.5 ns 1322667 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 1817145.5 ns 1817916.5 ns 1.00
array/private/copyto!/gpu_to_gpu 630625 ns 533917 ns 1.18
array/private/copyto!/cpu_to_gpu 801208 ns 690271 ns 1.16
array/private/copyto!/gpu_to_cpu 799625 ns 668542 ns 1.20
array/private/iteration/findall/int 1580042 ns 1565687.5 ns 1.01
array/private/iteration/findall/bool 1428292 ns 1465333.5 ns 0.97
array/private/iteration/findfirst/int 2078917 ns 2079042 ns 1.00
array/private/iteration/findfirst/bool 2020417 ns 2020083 ns 1.00
array/private/iteration/scalar 5337208 ns 2787125 ns 1.91
array/private/iteration/logical 2552625 ns 2599208 ns 0.98
array/private/iteration/findmin/1d 2207541 ns 2265458 ns 0.97
array/private/iteration/findmin/2d 1516833 ns 1528791 ns 0.99
array/private/copy 602937.5 ns 847041.5 ns 0.71
array/shared/copyto!/gpu_to_gpu 83416 ns 84333 ns 0.99
array/shared/copyto!/cpu_to_gpu 85792 ns 83042 ns 1.03
array/shared/copyto!/gpu_to_cpu 82542 ns 83479.5 ns 0.99
array/shared/iteration/findall/int 1573042 ns 1558208 ns 1.01
array/shared/iteration/findall/bool 1436542 ns 1470708 ns 0.98
array/shared/iteration/findfirst/int 1636312.5 ns 1682792 ns 0.97
array/shared/iteration/findfirst/bool 1597291 ns 1644334 ns 0.97
array/shared/iteration/scalar 211583 ns 202000 ns 1.05
array/shared/iteration/logical 2427000 ns 2368458 ns 1.02
array/shared/iteration/findmin/1d 1816667 ns 1845542 ns 0.98
array/shared/iteration/findmin/2d 1517792 ns 1521583 ns 1.00
array/shared/copy 255979 ns 210959 ns 1.21
array/permutedims/4d 2356625 ns 2473375 ns 0.95
array/permutedims/2d 1146000 ns 1178666.5 ns 0.97
array/permutedims/3d 1655042 ns 1780750 ns 0.93
metal/synchronization/stream 19292 ns 19334 ns 1.00
metal/synchronization/context 20000 ns 20000 ns 1

This comment was automatically generated by workflow using github-action-benchmark.

@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 93.18182% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.08%. Comparing base (dce2026) to head (fa51600).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/array.jl 93.18% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
+ Coverage   80.91%   81.08%   +0.17%     
==========================================
  Files          62       62              
  Lines        2829     2829              
==========================================
+ Hits         2289     2294       +5     
+ Misses        540      535       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant