Skip to content

Conversation

@AntonOresten
Copy link
Contributor

@AntonOresten AntonOresten commented Jan 2, 2026

Note that sequentially accumulating ones in BFloat16 precision will fail after 256, so the tests use fewer threads for BFloat16.

The runners don't seem to have the required CC of 9.0 to hit the tests.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/device/intrinsics/atomics.jl b/src/device/intrinsics/atomics.jl
index 9a5249114..dfde517ef 100644
--- a/src/device/intrinsics/atomics.jl
+++ b/src/device/intrinsics/atomics.jl
@@ -442,7 +442,8 @@ end
     atomic_arrayset(A, Base._to_linear_index(A, Is...), op, convert(T, val))
 
 # native atomics
-for (op,impl,typ) in [(:(+), :(atomic_add!), [:UInt32,:Int32,:UInt64,:Int64,:Float32,:Float16,:BFloat16]),
+for (op, impl, typ) in [
+        (:(+), :(atomic_add!), [:UInt32, :Int32, :UInt64, :Int64, :Float32, :Float16, :BFloat16]),
                       (:(-), :(atomic_sub!), [:UInt32,:Int32,:UInt64,:Int64,:Float32]),
                       (:(&), :(atomic_and!), [:UInt32,:Int32,:UInt64,:Int64]),
                       (:(|), :(atomic_or!),  [:UInt32,:Int32,:UInt64,:Int64]),
diff --git a/test/core/device/intrinsics/atomics.jl b/test/core/device/intrinsics/atomics.jl
index 77f39aeea..41f6e1518 100644
--- a/test/core/device/intrinsics/atomics.jl
+++ b/test/core/device/intrinsics/atomics.jl
@@ -10,7 +10,7 @@ using BFloat16s: BFloat16
     types = [Int32, Int64, UInt32, UInt64, Float32]
     capability(device()) >= v"6.0" && push!(types, Float64)
     capability(device()) >= v"7.0" && push!(types, Float16)
-    capability(device()) >= v"9.0" && push!(types, BFloat16)
+        capability(device()) >= v"9.0" && push!(types, BFloat16)
 
     @testset for T in types
         a = CuArray(T[0])
@@ -20,9 +20,9 @@ using BFloat16s: BFloat16
             return
         end
 
-        nthreads = T == BFloat16 ? 128 : 1024 # BFloat16(256) + 1 == 256
-        @cuda threads=nthreads kernel(a, one(T))
-        @test Array(a)[1] == nthreads
+            nthreads = T == BFloat16 ? 128 : 1024 # BFloat16(256) + 1 == 256
+            @cuda threads = nthreads kernel(a, one(T))
+            @test Array(a)[1] == nthreads
     end
 end
 
@@ -214,7 +214,7 @@ end
 @testset "add" begin
     types = [Int32, Int64, UInt32, UInt64, Float32, Float64]
     capability(device()) >= v"7.0" && append!(types, [Int16, UInt16, Float16])
-    capability(device()) >= v"9.0" && push!(types, BFloat16)
+        capability(device()) >= v"9.0" && push!(types, BFloat16)
 
     @testset for T in types
         a = CuArray([zero(T)])
@@ -225,9 +225,9 @@ end
             return
         end
 
-        nthreads = T == BFloat16 ? 64 : 1024
-        @cuda threads=nthreads kernel(T, a)
-        @test Array(a)[1] == 2 * nthreads
+            nthreads = T == BFloat16 ? 64 : 1024
+            @cuda threads = nthreads kernel(T, a)
+            @test Array(a)[1] == 2 * nthreads
     end
 end
 

@codecov
Copy link

codecov bot commented Jan 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.35%. Comparing base (ca67075) to head (9672643).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3007       +/-   ##
===========================================
+ Coverage   76.53%   89.35%   +12.82%     
===========================================
  Files         148      148               
  Lines       12860    12947       +87     
===========================================
+ Hits         9842    11569     +1727     
+ Misses       3018     1378     -1640     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AntonOresten AntonOresten marked this pull request as draft January 3, 2026 20:44
@AntonOresten
Copy link
Contributor Author

AntonOresten commented Jan 3, 2026

I ran the example from the first comment again on master and am confused by it working. Something may have been wrong with my setup.

EDIT: I had to backtrack to understand what was happening. CUDA.@atomic seems to have worked with BFloat16 even before, but not KA.@atomic, which uses some modify! from Atomix.jl, so BFloat16 wasn't hooking up to the necessary interfaces:

julia> Zygote.gradient(x .|> BFloat16) do x
           ONIONop.flash_attention(x, x, x; causal=true) |> sum
       end[1]
ERROR: InvalidIRError: compiling MethodInstance for ONIONop.gpu__flash_attention_bwd!(::KernelAbstractions.CompilerMetadata{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::BFloat16, ::Nothing, ::Nothing, ::Val{…}, ::Val{…}, ::Val{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_throw_methoderror)
Stacktrace:
 [1] modify!
   @ ~/.julia/packages/Atomix/0UMek/ext/AtomixCUDAExt.jl:38
 [2] macro expansion
   @ ~/.julia/packages/KernelAbstractions/X5fk1/src/extras/loopinfo.jl:31
 [3] macro expansion
   @ ~/.julia/packages/ONIONop/m70QD/src/attention/attention_bwd.jl:103
 [4] gpu__flash_attention_bwd!
   @ ./none:0

In the midst of having a partially complete implementation, I must've forgotten to remove a line when comparing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant