-
Notifications
You must be signed in to change notification settings - Fork 48
Change default storage mode to SharedStorage on UMA devices #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Change default storage mode to SharedStorage on UMA devices #717
Conversation
|
Your PR no longer requires formatting changes. Thank you for your contribution! |
- Add 'auto' mode (new default) that detects unified memory architecture - Apple Silicon (UMA): defaults to SharedStorage (zero-copy CPU access) - Intel discrete GPU: defaults to PrivateStorage - User preference override still works via LocalPreferences.toml - Add tests for default storage mode selection
2f591e1 to
f696653
Compare
Guard MTLDevice(1).hasUnifiedMemory with Sys.isapple() and devices() check to prevent module load failure on ubuntu/windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
| Benchmark suite | Current: fa51600 | Previous: 239fa4d | Ratio |
|---|---|---|---|
latency/precompile |
24210071833 ns |
24383716541 ns |
0.99 |
latency/ttfp |
2285845834 ns |
2324081375 ns |
0.98 |
latency/import |
1404204333.5 ns |
1427504083 ns |
0.98 |
integration/metaldevrt |
836834 ns |
837292 ns |
1.00 |
integration/byval/slices=1 |
1562604.5 ns |
1598354 ns |
0.98 |
integration/byval/slices=3 |
8870458 ns |
19021791.5 ns |
0.47 |
integration/byval/reference |
1546625 ns |
1590708.5 ns |
0.97 |
integration/byval/slices=2 |
2609583 ns |
2727250 ns |
0.96 |
kernel/indexing |
580875 ns |
459062.5 ns |
1.27 |
kernel/indexing_checked |
600833 ns |
463104.5 ns |
1.30 |
kernel/launch |
12042 ns |
11625 ns |
1.04 |
kernel/rand |
563084 ns |
526667 ns |
1.07 |
array/construct |
6083 ns |
5958 ns |
1.02 |
array/broadcast |
601375 ns |
545375 ns |
1.10 |
array/random/randn/Float32 |
823417 ns |
886167 ns |
0.93 |
array/random/randn!/Float32 |
614375 ns |
578875 ns |
1.06 |
array/random/rand!/Int64 |
552709 ns |
539083 ns |
1.03 |
array/random/rand!/Float32 |
581583 ns |
533229.5 ns |
1.09 |
array/random/rand/Int64 |
777520.5 ns |
887000 ns |
0.88 |
array/random/rand/Float32 |
607958 ns |
840959 ns |
0.72 |
array/accumulate/Int64/1d |
1256209 ns |
1292146 ns |
0.97 |
array/accumulate/Int64/dims=1 |
1825750 ns |
1865375 ns |
0.98 |
array/accumulate/Int64/dims=2 |
2132083 ns |
2215437 ns |
0.96 |
array/accumulate/Int64/dims=1L |
11449208 ns |
12096125 ns |
0.95 |
array/accumulate/Int64/dims=2L |
9721166.5 ns |
10003417 ns |
0.97 |
array/accumulate/Float32/1d |
1113083 ns |
1086042 ns |
1.02 |
array/accumulate/Float32/dims=1 |
1546896 ns |
1581542 ns |
0.98 |
array/accumulate/Float32/dims=2 |
1861500 ns |
1998167 ns |
0.93 |
array/accumulate/Float32/dims=1L |
9774271 ns |
10248396 ns |
0.95 |
array/accumulate/Float32/dims=2L |
7288584 ns |
7422792 ns |
0.98 |
array/reductions/reduce/Int64/1d |
993271 ns |
1312917 ns |
0.76 |
array/reductions/reduce/Int64/dims=1 |
1088500 ns |
1120125 ns |
0.97 |
array/reductions/reduce/Int64/dims=2 |
1121959 ns |
1153917 ns |
0.97 |
array/reductions/reduce/Int64/dims=1L |
2028708 ns |
2041417 ns |
0.99 |
array/reductions/reduce/Int64/dims=2L |
4263729 ns |
3778125 ns |
1.13 |
array/reductions/reduce/Float32/1d |
531792 ns |
796167 ns |
0.67 |
array/reductions/reduce/Float32/dims=1 |
825750 ns |
794000 ns |
1.04 |
array/reductions/reduce/Float32/dims=2 |
845562 ns |
818562.5 ns |
1.03 |
array/reductions/reduce/Float32/dims=1L |
1325166.5 ns |
1329000 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
1823187.5 ns |
1796708.5 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
1174334 ns |
1298666 ns |
0.90 |
array/reductions/mapreduce/Int64/dims=1 |
1136125 ns |
1086313 ns |
1.05 |
array/reductions/mapreduce/Int64/dims=2 |
1172020.5 ns |
1122666 ns |
1.04 |
array/reductions/mapreduce/Int64/dims=1L |
2000479.5 ns |
2025395.5 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
3654250 ns |
3647583 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
597250 ns |
774083.5 ns |
0.77 |
array/reductions/mapreduce/Float32/dims=1 |
822750 ns |
791417 ns |
1.04 |
array/reductions/mapreduce/Float32/dims=2 |
849791.5 ns |
826542 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=1L |
1323666.5 ns |
1322667 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
1817145.5 ns |
1817916.5 ns |
1.00 |
array/private/copyto!/gpu_to_gpu |
630625 ns |
533917 ns |
1.18 |
array/private/copyto!/cpu_to_gpu |
801208 ns |
690271 ns |
1.16 |
array/private/copyto!/gpu_to_cpu |
799625 ns |
668542 ns |
1.20 |
array/private/iteration/findall/int |
1580042 ns |
1565687.5 ns |
1.01 |
array/private/iteration/findall/bool |
1428292 ns |
1465333.5 ns |
0.97 |
array/private/iteration/findfirst/int |
2078917 ns |
2079042 ns |
1.00 |
array/private/iteration/findfirst/bool |
2020417 ns |
2020083 ns |
1.00 |
array/private/iteration/scalar |
5337208 ns |
2787125 ns |
1.91 |
array/private/iteration/logical |
2552625 ns |
2599208 ns |
0.98 |
array/private/iteration/findmin/1d |
2207541 ns |
2265458 ns |
0.97 |
array/private/iteration/findmin/2d |
1516833 ns |
1528791 ns |
0.99 |
array/private/copy |
602937.5 ns |
847041.5 ns |
0.71 |
array/shared/copyto!/gpu_to_gpu |
83416 ns |
84333 ns |
0.99 |
array/shared/copyto!/cpu_to_gpu |
85792 ns |
83042 ns |
1.03 |
array/shared/copyto!/gpu_to_cpu |
82542 ns |
83479.5 ns |
0.99 |
array/shared/iteration/findall/int |
1573042 ns |
1558208 ns |
1.01 |
array/shared/iteration/findall/bool |
1436542 ns |
1470708 ns |
0.98 |
array/shared/iteration/findfirst/int |
1636312.5 ns |
1682792 ns |
0.97 |
array/shared/iteration/findfirst/bool |
1597291 ns |
1644334 ns |
0.97 |
array/shared/iteration/scalar |
211583 ns |
202000 ns |
1.05 |
array/shared/iteration/logical |
2427000 ns |
2368458 ns |
1.02 |
array/shared/iteration/findmin/1d |
1816667 ns |
1845542 ns |
0.98 |
array/shared/iteration/findmin/2d |
1517792 ns |
1521583 ns |
1.00 |
array/shared/copy |
255979 ns |
210959 ns |
1.21 |
array/permutedims/4d |
2356625 ns |
2473375 ns |
0.95 |
array/permutedims/2d |
1146000 ns |
1178666.5 ns |
0.97 |
array/permutedims/3d |
1655042 ns |
1780750 ns |
0.93 |
metal/synchronization/stream |
19292 ns |
19334 ns |
1.00 |
metal/synchronization/context |
20000 ns |
20000 ns |
1 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #717 +/- ##
==========================================
+ Coverage 80.91% 81.08% +0.17%
==========================================
Files 62 62
Lines 2829 2829
==========================================
+ Hits 2289 2294 +5
+ Misses 540 535 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Change Metal.jl's default storage mode from
PrivateStoragetoSharedStorageon Apple Silicon (unified memory architecture) devices. Intel Macs with discrete GPUs retainPrivateStorageas default.This change:
unsafe_wrap(Array, gpu_arr)(1.8x - 6x faster)Motivation
1. Apple's Guidance for Unified Memory
Apple's Metal Best Practices Guide explicitly states:
Apple Silicon Macs have the same unified memory architecture as iOS/tvOS. The current
PrivateStoragedefault was based on guidance intended for discrete GPU systems where CPU and GPU have separate memory pools.2. Performance is Equivalent (67 Tests with Statistical Significance)
Comprehensive benchmarks on M2 Max with 95% confidence interval testing show no performance difference at production sizes:
Key findings:
copyto!,fill!: Identical at all sizesBenchmark breakdown by size
*1024 MB "regressions" disappeared with more iterations (measurement noise)
3. SharedStorage Enables Faster CPU Access
SharedStorage enables
unsafe_wrap(Array, gpu_arr)for zero-copy CPU access, avoiding the allocation + copy overhead ofArray():Array()+ useunsafe_wrap()+ useBenchmarked on M2 Max, median of 5 runs. Both paths include
sum()to measure actual data access.The speedup comes from avoiding allocation and memory copy - with SharedStorage, CPU and GPU share the same physical memory on Apple Silicon.
Implementation
The default preference changes from
"private"to"auto", which detects UMA at module load:Preference options:
"auto"(new default): SharedStorage on UMA, PrivateStorage otherwise"shared": Force SharedStorage"private": Force PrivateStorage (previous behavior)Backward Compatibility
default_storage = "private"inLocalPreferences.tomlMtlArray{T,N,PrivateStorage}(...)always workstest/runtests.jl:85-88)Related