Add native BFloat16 `atomic_add!` support #3007

AntonOresten · 2026-01-02T14:35:23Z

Note that sequentially accumulating ones in BFloat16 precision will fail after 256, so the tests use fewer threads for BFloat16.

The runners don't seem to have the required CC of 9.0 to hit the tests.

github-actions · 2026-01-02T14:35:58Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/device/intrinsics/atomics.jl b/src/device/intrinsics/atomics.jl
index 9a5249114..dfde517ef 100644
--- a/src/device/intrinsics/atomics.jl
+++ b/src/device/intrinsics/atomics.jl
@@ -442,7 +442,8 @@ end
     atomic_arrayset(A, Base._to_linear_index(A, Is...), op, convert(T, val))
 
 # native atomics
-for (op,impl,typ) in [(:(+), :(atomic_add!), [:UInt32,:Int32,:UInt64,:Int64,:Float32,:Float16,:BFloat16]),
+for (op, impl, typ) in [
+        (:(+), :(atomic_add!), [:UInt32, :Int32, :UInt64, :Int64, :Float32, :Float16, :BFloat16]),
                       (:(-), :(atomic_sub!), [:UInt32,:Int32,:UInt64,:Int64,:Float32]),
                       (:(&), :(atomic_and!), [:UInt32,:Int32,:UInt64,:Int64]),
                       (:(|), :(atomic_or!),  [:UInt32,:Int32,:UInt64,:Int64]),
diff --git a/test/core/device/intrinsics/atomics.jl b/test/core/device/intrinsics/atomics.jl
index 77f39aeea..41f6e1518 100644
--- a/test/core/device/intrinsics/atomics.jl
+++ b/test/core/device/intrinsics/atomics.jl
@@ -10,7 +10,7 @@ using BFloat16s: BFloat16
     types = [Int32, Int64, UInt32, UInt64, Float32]
     capability(device()) >= v"6.0" && push!(types, Float64)
     capability(device()) >= v"7.0" && push!(types, Float16)
-    capability(device()) >= v"9.0" && push!(types, BFloat16)
+        capability(device()) >= v"9.0" && push!(types, BFloat16)
 
     @testset for T in types
         a = CuArray(T[0])
@@ -20,9 +20,9 @@ using BFloat16s: BFloat16
             return
         end
 
-        nthreads = T == BFloat16 ? 128 : 1024 # BFloat16(256) + 1 == 256
-        @cuda threads=nthreads kernel(a, one(T))
-        @test Array(a)[1] == nthreads
+            nthreads = T == BFloat16 ? 128 : 1024 # BFloat16(256) + 1 == 256
+            @cuda threads = nthreads kernel(a, one(T))
+            @test Array(a)[1] == nthreads
     end
 end
 
@@ -214,7 +214,7 @@ end
 @testset "add" begin
     types = [Int32, Int64, UInt32, UInt64, Float32, Float64]
     capability(device()) >= v"7.0" && append!(types, [Int16, UInt16, Float16])
-    capability(device()) >= v"9.0" && push!(types, BFloat16)
+        capability(device()) >= v"9.0" && push!(types, BFloat16)
 
     @testset for T in types
         a = CuArray([zero(T)])
@@ -225,9 +225,9 @@ end
             return
         end
 
-        nthreads = T == BFloat16 ? 64 : 1024
-        @cuda threads=nthreads kernel(T, a)
-        @test Array(a)[1] == 2 * nthreads
+            nthreads = T == BFloat16 ? 64 : 1024
+            @cuda threads = nthreads kernel(T, a)
+            @test Array(a)[1] == 2 * nthreads
     end
 end

codecov · 2026-01-02T19:17:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.35%. Comparing base (ca67075) to head (9672643).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #3007       +/-   ##
===========================================
+ Coverage   76.53%   89.35%   +12.82%     
===========================================
  Files         148      148               
  Lines       12860    12947       +87     
===========================================
+ Hits         9842    11569     +1727     
+ Misses       3018     1378     -1640

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AntonOresten · 2026-01-03T21:21:40Z

I ran the example from the first comment again on master and am confused by it working. Something may have been wrong with my setup.

EDIT: I had to backtrack to understand what was happening. CUDA.@atomic seems to have worked with BFloat16 even before, but not KA.@atomic, which uses some modify! from Atomix.jl, so BFloat16 wasn't hooking up to the necessary interfaces:

julia> Zygote.gradient(x .|> BFloat16) do x
           ONIONop.flash_attention(x, x, x; causal=true) |> sum
       end[1]
ERROR: InvalidIRError: compiling MethodInstance for ONIONop.gpu__flash_attention_bwd!(::KernelAbstractions.CompilerMetadata{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::BFloat16, ::Nothing, ::Nothing, ::Val{…}, ::Val{…}, ::Val{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_throw_methoderror)
Stacktrace:
 [1] modify!
   @ ~/.julia/packages/Atomix/0UMek/ext/AtomixCUDAExt.jl:38
 [2] macro expansion
   @ ~/.julia/packages/KernelAbstractions/X5fk1/src/extras/loopinfo.jl:31
 [3] macro expansion
   @ ~/.julia/packages/ONIONop/m70QD/src/attention/attention_bwd.jl:103
 [4] gpu__flash_attention_bwd!
   @ ./none:0

In the midst of having a partially complete implementation, I must've forgotten to remove a line when comparing.

AntonOresten added 2 commits January 2, 2026 16:10

Bind PTX for atomic add of BFloat16

3a9cda6

fix CC requirement

9672643

AntonOresten marked this pull request as draft January 3, 2026 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add native BFloat16 `atomic_add!` support #3007

Add native BFloat16 `atomic_add!` support #3007

AntonOresten commented Jan 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 2, 2026

Uh oh!

AntonOresten commented Jan 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add native BFloat16 atomic_add! support #3007

Are you sure you want to change the base?

Add native BFloat16 atomic_add! support #3007

Conversation

AntonOresten commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 2, 2026

Codecov Report

Uh oh!

AntonOresten commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add native BFloat16 `atomic_add!` support #3007

Add native BFloat16 `atomic_add!` support #3007

AntonOresten commented Jan 2, 2026 •

edited

Loading

github-actions bot commented Jan 2, 2026 •

edited

Loading

AntonOresten commented Jan 3, 2026 •

edited

Loading