-
Notifications
You must be signed in to change notification settings - Fork 262
Add native BFloat16 atomic_add! support
#3007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/device/intrinsics/atomics.jl b/src/device/intrinsics/atomics.jl
index 9a5249114..dfde517ef 100644
--- a/src/device/intrinsics/atomics.jl
+++ b/src/device/intrinsics/atomics.jl
@@ -442,7 +442,8 @@ end
atomic_arrayset(A, Base._to_linear_index(A, Is...), op, convert(T, val))
# native atomics
-for (op,impl,typ) in [(:(+), :(atomic_add!), [:UInt32,:Int32,:UInt64,:Int64,:Float32,:Float16,:BFloat16]),
+for (op, impl, typ) in [
+ (:(+), :(atomic_add!), [:UInt32, :Int32, :UInt64, :Int64, :Float32, :Float16, :BFloat16]),
(:(-), :(atomic_sub!), [:UInt32,:Int32,:UInt64,:Int64,:Float32]),
(:(&), :(atomic_and!), [:UInt32,:Int32,:UInt64,:Int64]),
(:(|), :(atomic_or!), [:UInt32,:Int32,:UInt64,:Int64]),
diff --git a/test/core/device/intrinsics/atomics.jl b/test/core/device/intrinsics/atomics.jl
index 77f39aeea..41f6e1518 100644
--- a/test/core/device/intrinsics/atomics.jl
+++ b/test/core/device/intrinsics/atomics.jl
@@ -10,7 +10,7 @@ using BFloat16s: BFloat16
types = [Int32, Int64, UInt32, UInt64, Float32]
capability(device()) >= v"6.0" && push!(types, Float64)
capability(device()) >= v"7.0" && push!(types, Float16)
- capability(device()) >= v"9.0" && push!(types, BFloat16)
+ capability(device()) >= v"9.0" && push!(types, BFloat16)
@testset for T in types
a = CuArray(T[0])
@@ -20,9 +20,9 @@ using BFloat16s: BFloat16
return
end
- nthreads = T == BFloat16 ? 128 : 1024 # BFloat16(256) + 1 == 256
- @cuda threads=nthreads kernel(a, one(T))
- @test Array(a)[1] == nthreads
+ nthreads = T == BFloat16 ? 128 : 1024 # BFloat16(256) + 1 == 256
+ @cuda threads = nthreads kernel(a, one(T))
+ @test Array(a)[1] == nthreads
end
end
@@ -214,7 +214,7 @@ end
@testset "add" begin
types = [Int32, Int64, UInt32, UInt64, Float32, Float64]
capability(device()) >= v"7.0" && append!(types, [Int16, UInt16, Float16])
- capability(device()) >= v"9.0" && push!(types, BFloat16)
+ capability(device()) >= v"9.0" && push!(types, BFloat16)
@testset for T in types
a = CuArray([zero(T)])
@@ -225,9 +225,9 @@ end
return
end
- nthreads = T == BFloat16 ? 64 : 1024
- @cuda threads=nthreads kernel(T, a)
- @test Array(a)[1] == 2 * nthreads
+ nthreads = T == BFloat16 ? 64 : 1024
+ @cuda threads = nthreads kernel(T, a)
+ @test Array(a)[1] == 2 * nthreads
end
end
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3007 +/- ##
===========================================
+ Coverage 76.53% 89.35% +12.82%
===========================================
Files 148 148
Lines 12860 12947 +87
===========================================
+ Hits 9842 11569 +1727
+ Misses 3018 1378 -1640 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I ran the example from the first comment again on master and am confused by it working. Something may have been wrong with my setup. EDIT: I had to backtrack to understand what was happening. julia> Zygote.gradient(x .|> BFloat16) do x
ONIONop.flash_attention(x, x, x; causal=true) |> sum
end[1]
ERROR: InvalidIRError: compiling MethodInstance for ONIONop.gpu__flash_attention_bwd!(::KernelAbstractions.CompilerMetadata{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::Type{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::CuDeviceArray{…}, ::BFloat16, ::Nothing, ::Nothing, ::Val{…}, ::Val{…}, ::Val{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_throw_methoderror)
Stacktrace:
[1] modify!
@ ~/.julia/packages/Atomix/0UMek/ext/AtomixCUDAExt.jl:38
[2] macro expansion
@ ~/.julia/packages/KernelAbstractions/X5fk1/src/extras/loopinfo.jl:31
[3] macro expansion
@ ~/.julia/packages/ONIONop/m70QD/src/attention/attention_bwd.jl:103
[4] gpu__flash_attention_bwd!
@ ./none:0In the midst of having a partially complete implementation, I must've forgotten to remove a line when comparing. |
Note that sequentially accumulating ones in BFloat16 precision will fail after 256, so the tests use fewer threads for BFloat16.
The runners don't seem to have the required CC of 9.0 to hit the tests.