Fix broadcast for arrays >16GB by using 64-bit indexing #719

KaanKesginLW · 2025-12-05T06:58:37Z

Problem

Broadcast operations silently produce incorrect results for arrays with more than ~4.29 billion elements (~16GB for Float32). Elements beyond the UInt32 limit receive zeros or wrong values due to integer overflow.

Reproduction:

using Metal
arr = MtlArray{Float32}(undef, 4_500_000_000)  # 18GB
arr .= 42.0f0
Array(arr[end])  # Returns 0.0 instead of 42.0

Root Cause

Metal Shading Language's thread_position_in_grid attribute only supports uint (32-bit) or ushort (16-bit) types per Apple's MSL specification. This is consistent with other GPU frameworks - CUDA's threadIdx/blockIdx are also 32-bit.

The current broadcast kernels use this UInt32 value directly in arithmetic:

i = thread_position_in_grid().x      # UInt32
stride = threads_per_grid().x        # UInt32
while 1 <= i <= length(dest)
    dest[i] = bc[i]
    i += stride                      # UInt32 + UInt32 = OVERFLOW at 4.29B!
end

Solution

Convert to Int64 immediately and use 64-bit arithmetic throughout:

i = Int64(thread_position_in_grid().x)
stride = Int64(threads_per_grid().x)
len = Int64(length(dest))
while i <= len
    dest[i] = bc[i]
    i += stride                      # Int64, no overflow
end

Why this is the correct fix (not a hack):

Standard GPU pattern: This is exactly how CUDA and other GPU frameworks handle >4B elements. See NVIDIA's blog on grid-stride loops. The thread ID only needs to be unique (up to 2^32 threads is plenty), while the index computation uses 64-bit to reach any element. The RAPIDS cuDF project had to fix the same overflow issue.
No performance overhead: Benchmarked at 1B elements - Int64 version is within noise of UInt32 version. Apple Silicon GPUs handle 64-bit integers efficiently (native ARM64).
Matches Julia conventions: length() returns Int64, array indexing uses Int, so Int64 indices are natural.

Why Int64 instead of UInt64?

Both would work correctly for this use case (we only add to indices, never subtract). However, Int64 is preferred because:

Julia's array indexing is optimized for signed integers (Int)
length() returns Int64, avoiding mixed-sign comparisons
Benchmarking shows Int64 is actually ~30% faster than UInt64 for this kernel (likely due to Julia's indexing optimizations)

Testing

All 319 existing broadcast tests pass
Verified working at 18GB (4.5B elements) on M2 Max
Tested edge cases: 1 element, 1 warp, Int32 max, Int32 max+1, etc.

Why This Matters

Apple Silicon Macs now have up to 192GB unified memory (M3 Ultra). Scientific computing users need to work with arrays larger than 16GB, and silent data corruption is the worst possible failure mode.

Metal's thread_position_in_grid() returns UInt32, which overflows at 4,294,967,295 elements (~16GB for Float32). This caused silent data corruption where elements beyond the 32-bit limit received wrong values or zeros. Root cause: The grid-stride loop arithmetic used UInt32 throughout: i = thread_position_in_grid().x # UInt32 stride = threads_per_grid().x # UInt32 i += stride # UInt32 overflow! Fix: Convert to Int64 immediately and use 64-bit arithmetic: i = Int64(thread_position_in_grid().x) stride = Int64(threads_per_grid().x) i += stride # Int64, no overflow This is the standard pattern for handling >4B elements on GPUs. The UInt32 thread position is a Metal Shading Language constraint (per Apple's MSL spec), but we only need unique thread IDs up to 2^32. The index computation uses 64-bit to reach any element. Performance: Benchmarked at 1B elements - no measurable overhead. Apple Silicon GPUs handle Int64 arithmetic efficiently (native 64-bit). Verified working at 18GB (4.5B elements) on M2 Max. All 319 broadcast tests pass.

github-actions · 2025-12-05T06:59:05Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/broadcast.jl b/src/broadcast.jl
index b0107775..84380a7b 100644
--- a/src/broadcast.jl
+++ b/src/broadcast.jl
@@ -66,12 +66,12 @@ end
     if _broadcast_shapes[Is] > BROADCAST_SPECIALIZATION_THRESHOLD
         ## COV_EXCL_START
         function broadcast_cartesian_static(dest, bc, Is)
-             # Use Int64 to avoid UInt32 overflow for arrays > 4.29B elements (16GB Float32)
-             # Metal's thread_position_in_grid() returns UInt32, but we need 64-bit indexing
-             i = Int64(thread_position_in_grid().x)
-             stride = Int64(threads_per_grid().x)
-             len = Int64(length(dest))
-             while i <= len
+            # Use Int64 to avoid UInt32 overflow for arrays > 4.29B elements (16GB Float32)
+            # Metal's thread_position_in_grid() returns UInt32, but we need 64-bit indexing
+            i = Int64(thread_position_in_grid().x)
+            stride = Int64(threads_per_grid().x)
+            len = Int64(length(dest))
+            while i <= len
                 I = @inbounds Is[i]
                 @inbounds dest[I] = bc[I]
                 i += stride
@@ -94,11 +94,11 @@ end
        (isa(IndexStyle(dest), IndexLinear) && isa(IndexStyle(bc), IndexLinear))
         ## COV_EXCL_START
         function broadcast_linear(dest, bc)
-             # Use Int64 to avoid UInt32 overflow for arrays > 4.29B elements (16GB Float32)
-             i = Int64(thread_position_in_grid().x)
-             stride = Int64(threads_per_grid().x)
-             len = Int64(length(dest))
-             while i <= len
+            # Use Int64 to avoid UInt32 overflow for arrays > 4.29B elements (16GB Float32)
+            i = Int64(thread_position_in_grid().x)
+            stride = Int64(threads_per_grid().x)
+            len = Int64(length(dest))
+            while i <= len
                  @inbounds dest[i] = bc[i]
                  i += stride
              end
@@ -113,13 +113,13 @@ end
     elseif ndims(dest) == 2
         ## COV_EXCL_START
         function broadcast_2d(dest, bc)
-             # Use Int64 to avoid UInt32 overflow for large 2D arrays
-             pos = thread_position_in_grid_2d()
+            # Use Int64 to avoid UInt32 overflow for large 2D arrays
+            pos = thread_position_in_grid_2d()
              stride = threads_per_grid_2d()
-             is = (Int64(pos.x), Int64(pos.y))
-             strides = (Int64(stride.x), Int64(stride.y))
-             dims = (Int64(size(dest, 1)), Int64(size(dest, 2)))
-             while is[1] <= dims[1] && is[2] <= dims[2]
+            is = (Int64(pos.x), Int64(pos.y))
+            strides = (Int64(stride.x), Int64(stride.y))
+            dims = (Int64(size(dest, 1)), Int64(size(dest, 2)))
+            while is[1] <= dims[1] && is[2] <= dims[2]
                 I = CartesianIndex(is)
                 @inbounds dest[I] = bc[I]
                 is = (is[1] + strides[1], is[2] + strides[2])
@@ -136,13 +136,13 @@ end
     elseif ndims(dest) == 3
         ## COV_EXCL_START
         function broadcast_3d(dest, bc)
-             # Use Int64 to avoid UInt32 overflow for large 3D arrays
-             pos = thread_position_in_grid_3d()
+            # Use Int64 to avoid UInt32 overflow for large 3D arrays
+            pos = thread_position_in_grid_3d()
              stride = threads_per_grid_3d()
-             is = (Int64(pos.x), Int64(pos.y), Int64(pos.z))
-             strides = (Int64(stride.x), Int64(stride.y), Int64(stride.z))
-             dims = (Int64(size(dest, 1)), Int64(size(dest, 2)), Int64(size(dest, 3)))
-             while is[1] <= dims[1] && is[2] <= dims[2] && is[3] <= dims[3]
+            is = (Int64(pos.x), Int64(pos.y), Int64(pos.z))
+            strides = (Int64(stride.x), Int64(stride.y), Int64(stride.z))
+            dims = (Int64(size(dest, 1)), Int64(size(dest, 2)), Int64(size(dest, 3)))
+            while is[1] <= dims[1] && is[2] <= dims[2] && is[3] <= dims[3]
                 I = CartesianIndex(is)
                 @inbounds dest[I] = bc[I]
                 is = (is[1] + strides[1], is[2] + strides[2], is[3] + strides[3])
@@ -161,11 +161,11 @@ end
     else
         ## COV_EXCL_START
         function broadcast_cartesian(dest, bc)
-             # Use Int64 to avoid UInt32 overflow for arrays > 4.29B elements (16GB Float32)
-             i = Int64(thread_position_in_grid().x)
-             stride = Int64(threads_per_grid().x)
-             len = Int64(length(dest))
-             while i <= len
+            # Use Int64 to avoid UInt32 overflow for arrays > 4.29B elements (16GB Float32)
+            i = Int64(thread_position_in_grid().x)
+            stride = Int64(threads_per_grid().x)
+            len = Int64(length(dest))
+            while i <= len
                 I = @inbounds CartesianIndices(dest)[i]
                 @inbounds dest[I] = bc[I]
                 i += stride

codecov · 2025-12-05T07:56:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.93%. Comparing base (239fa4d) to head (88bc25b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #719      +/-   ##
==========================================
- Coverage   80.96%   80.93%   -0.04%     
==========================================
  Files          62       62              
  Lines        2837     2837              
==========================================
- Hits         2297     2296       -1     
- Misses        540      541       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

Metal Benchmarks

Benchmark suite	Current: `88bc25b`	Previous: `239fa4d`	Ratio
`latency/precompile`	`24364681000` ns	`24383716541` ns	`1.00`
`latency/ttfp`	`2331944125.5` ns	`2324081375` ns	`1.00`
`latency/import`	`1431538250` ns	`1427504083` ns	`1.00`
`integration/metaldevrt`	`835562.5` ns	`837292` ns	`1.00`
`integration/byval/slices=1`	`1558000` ns	`1598354` ns	`0.97`
`integration/byval/slices=3`	`20215208.5` ns	`19021791.5` ns	`1.06`
`integration/byval/reference`	`1557479` ns	`1590708.5` ns	`0.98`
`integration/byval/slices=2`	`2683416` ns	`2727250` ns	`0.98`
`kernel/indexing`	`498895.5` ns	`459062.5` ns	`1.09`
`kernel/indexing_checked`	`503750` ns	`463104.5` ns	`1.09`
`kernel/launch`	`12250` ns	`11625` ns	`1.05`
`kernel/rand`	`517104.5` ns	`526667` ns	`0.98`
`array/construct`	`6291` ns	`5958` ns	`1.06`
`array/broadcast`	`937458.5` ns	`545375` ns	`1.72`
`array/random/randn/Float32`	`911667` ns	`886167` ns	`1.03`
`array/random/randn!/Float32`	`579375` ns	`578875` ns	`1.00`
`array/random/rand!/Int64`	`532792` ns	`539083` ns	`0.99`
`array/random/rand!/Float32`	`546583` ns	`533229.5` ns	`1.03`
`array/random/rand/Int64`	`843437.5` ns	`887000` ns	`0.95`
`array/random/rand/Float32`	`839791.5` ns	`840959` ns	`1.00`
`array/accumulate/Int64/1d`	`1303250` ns	`1292146` ns	`1.01`
`array/accumulate/Int64/dims=1`	`1857687.5` ns	`1865375` ns	`1.00`
`array/accumulate/Int64/dims=2`	`2219458` ns	`2215437` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`12220750` ns	`12096125` ns	`1.01`
`array/accumulate/Int64/dims=2L`	`9905875` ns	`10003417` ns	`0.99`
`array/accumulate/Float32/1d`	`1077854.5` ns	`1086042` ns	`0.99`
`array/accumulate/Float32/dims=1`	`1592416.5` ns	`1581542` ns	`1.01`
`array/accumulate/Float32/dims=2`	`1957542` ns	`1998167` ns	`0.98`
`array/accumulate/Float32/dims=1L`	`10348500.5` ns	`10248396` ns	`1.01`
`array/accumulate/Float32/dims=2L`	`7390792` ns	`7422792` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`1281812.5` ns	`1312917` ns	`0.98`
`array/reductions/reduce/Int64/dims=1`	`1110479` ns	`1120125` ns	`0.99`
`array/reductions/reduce/Int64/dims=2`	`1139959` ns	`1153917` ns	`0.99`
`array/reductions/reduce/Int64/dims=1L`	`2042354.5` ns	`2041417` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`4051500` ns	`3778125` ns	`1.07`
`array/reductions/reduce/Float32/1d`	`757000` ns	`796167` ns	`0.95`
`array/reductions/reduce/Float32/dims=1`	`797333` ns	`794000` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`831917` ns	`818562.5` ns	`1.02`
`array/reductions/reduce/Float32/dims=1L`	`1331417` ns	`1329000` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`1797583.5` ns	`1796708.5` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`1288541` ns	`1298666` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1`	`1085833` ns	`1086313` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`1131166` ns	`1122666` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1L`	`2005209` ns	`2025395.5` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=2L`	`3622562.5` ns	`3647583` ns	`0.99`
`array/reductions/mapreduce/Float32/1d`	`805125` ns	`774083.5` ns	`1.04`
`array/reductions/mapreduce/Float32/dims=1`	`796562` ns	`791417` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2`	`824687.5` ns	`826542` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`1341291.5` ns	`1322667` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2L`	`1791895.5` ns	`1817916.5` ns	`0.99`
`array/private/copyto!/gpu_to_gpu`	`550187.5` ns	`533917` ns	`1.03`
`array/private/copyto!/cpu_to_gpu`	`767083` ns	`690271` ns	`1.11`
`array/private/copyto!/gpu_to_cpu`	`729687.5` ns	`668542` ns	`1.09`
`array/private/iteration/findall/int`	`1570041.5` ns	`1565687.5` ns	`1.00`
`array/private/iteration/findall/bool`	`1475250` ns	`1465333.5` ns	`1.01`
`array/private/iteration/findfirst/int`	`2076187.5` ns	`2079042` ns	`1.00`
`array/private/iteration/findfirst/bool`	`2013999.5` ns	`2020083` ns	`1.00`
`array/private/iteration/scalar`	`3163271` ns	`2787125` ns	`1.13`
`array/private/iteration/logical`	`2656395.5` ns	`2599208` ns	`1.02`
`array/private/iteration/findmin/1d`	`2251312.5` ns	`2265458` ns	`0.99`
`array/private/iteration/findmin/2d`	`1537104.5` ns	`1528791` ns	`1.01`
`array/private/copy`	`831583` ns	`847041.5` ns	`0.98`
`array/shared/copyto!/gpu_to_gpu`	`84708` ns	`84333` ns	`1.00`
`array/shared/copyto!/cpu_to_gpu`	`84395.5` ns	`83042` ns	`1.02`
`array/shared/copyto!/gpu_to_cpu`	`83417` ns	`83479.5` ns	`1.00`
`array/shared/iteration/findall/int`	`1560333` ns	`1558208` ns	`1.00`
`array/shared/iteration/findall/bool`	`1485541` ns	`1470708` ns	`1.01`
`array/shared/iteration/findfirst/int`	`1694958` ns	`1682792` ns	`1.01`
`array/shared/iteration/findfirst/bool`	`1637083` ns	`1644334` ns	`1.00`
`array/shared/iteration/scalar`	`206375` ns	`202000` ns	`1.02`
`array/shared/iteration/logical`	`2295792` ns	`2368458` ns	`0.97`
`array/shared/iteration/findmin/1d`	`1881250` ns	`1845542` ns	`1.02`
`array/shared/iteration/findmin/2d`	`1550334` ns	`1521583` ns	`1.02`
`array/shared/copy`	`213292` ns	`210959` ns	`1.01`
`array/permutedims/4d`	`2467167` ns	`2473375` ns	`1.00`
`array/permutedims/2d`	`1183792` ns	`1178666.5` ns	`1.00`
`array/permutedims/3d`	`1758812.5` ns	`1780750` ns	`0.99`
`metal/synchronization/stream`	`19458` ns	`19334` ns	`1.01`
`metal/synchronization/context`	`19375` ns	`20000` ns	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions bot reviewed Dec 5, 2025

View reviewed changes

christiangnrd mentioned this pull request Dec 8, 2025

Add UInt16 argument intrinsics #725

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix broadcast for arrays >16GB by using 64-bit indexing #719

Fix broadcast for arrays >16GB by using 64-bit indexing #719

Uh oh!

KaanKesginLW commented Dec 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

codecov bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix broadcast for arrays >16GB by using 64-bit indexing #719

Are you sure you want to change the base?

Fix broadcast for arrays >16GB by using 64-bit indexing #719

Uh oh!

Conversation

KaanKesginLW commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Testing

Why This Matters

Related

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

codecov bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Metal Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KaanKesginLW commented Dec 5, 2025 •

edited

Loading

codecov bot commented Dec 5, 2025 •

edited

Loading