Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different result for personal argmax on CPU and GPU if array size is large enough #476

Open
rveltz opened this issue Nov 5, 2024 · 8 comments

Comments

@rveltz
Copy link

rveltz commented Nov 5, 2024

Hi,
I tried coding an argmax using KernelAbstraction in need for particles simulation. Sadly, the results from Metal and CPU differ.
Basically I have a field::Array{Float32, 4} and I want to compute in parallel argmax(field[x1,x2,x3,:])cfor many (basically Nnmc ) vectors (x1,x2,x3) in parallel. In the code below, this vector is fixed x1,x2,x3 = (1, 1, 1).

I found that the argmax differ whether the code is run on CPU or on Metal and only if field is large enough. This is the bulk of the issue.

using Revise, LinearAlgebra
using Metal
using KernelAbstractions

function _sample_gpu(field;
                    Nnmc = 1000,
                    TA = Array
                    )
    result = TA(zeros(Float32, 2, Nnmc))
    npb = size(field, 4)
    # launch gpu kernel
    backend = get_backend(result)
    nth = backend isa KernelAbstractions.GPU ? 256 : 8
    kernel! = _sample_mtl!(backend, nth)    
    kernel!(result,
            TA(field),
            npb,
            ndrange = Nnmc
    )
    result
    
end

@kernel function _sample_mtl!(result,
                                @Const(field),
                                nd,
                                )
    nₙₘ = @index(Global)
    voxel₁ = voxel₂ = voxel₃ = 1
    # compute argmax of field[voxel₁, voxel₂, voxel₃, :]
    _val_max::Float32 = 0f0
    ind_u = 0
    for ii in axes(field, 4)
        val = field[voxel₁, voxel₂, voxel₃, ii]
        if val > _val_max
            _val_max = val
            ind_u = ii
        end
    end

    result[1, nₙₘ] = nₙₘ
    # save argmax
    result[2, nₙₘ] = ind_u
    
end

all_od = Float32.(rand(Float32,100,108,100, 1000));
res_a = _sample_gpu(all_od,
                )

res_g =  _sample_gpu(all_od,
                TA = MtlArray,
                ) |> Array

norm(res_g[2,:] - res_a[2,:], Inf)
# returns 232.0f0

If the field is smaller the discrepancy seems to disappear:

all_od = Float32.(rand(Float32,100,107,100, 1000));
res_a = _sample_gpu(all_od,
                )

res_g =  _sample_gpu(all_od,
                TA = MtlArray,
                ) |> Array

norm(res_g[2,:] - res_a[2,:], Inf)
# returns 0.0f0
@christiangnrd
Copy link
Contributor

Can you share the output of

using Metal
Metal.versioninfo()

?

@rveltz
Copy link
Author

rveltz commented Nov 6, 2024

julia> using Metal

julia> Metal.versioninfo()
macOS 14.7.0, Darwin 23.6.0

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.4.2
- GPUArrays: 10.3.1
- GPUCompiler: 0.27.8
- KernelAbstractions: 0.9.29
- ObjectiveC: 3.1.0
- LLVM: 9.1.3
- LLVMDowngrader_jll: 0.3.0+2

1 device:
- Apple M2 Max (64.000 KiB allocated)

@christiangnrd
Copy link
Contributor

I was able to reproduce this on macOS 14 but not 15.2 (developer beta).

@rveltz
Copy link
Author

rveltz commented Nov 6, 2024

Thank you for trying this.

Wow that's stiff. I guess I have to wait for the transition unless something obvious in my kernel can be fixed

@christiangnrd
Copy link
Contributor

Interestingly, the threshold for functional seems to be 4GiB:

# Broken on macOS14
julia> all_od = Float32.(rand(Float32,100,108,100, 1000)); all_od |> sizeof |> Base.format_bytes
"4.023 GiB"

# Not broken on macOS14
julia> all_od = Float32.(rand(Float32,100,107,100, 1000)); all_od |> sizeof |> Base.format_bytes
"3.986 GiB"

@rveltz
Copy link
Author

rveltz commented Nov 6, 2024

Did you check this using a M2 or a M3?

@tgymnich
Copy link
Member

tgymnich commented Nov 6, 2024

Works for me on M1 and 15.1

julia> norm(res_g[2,:] - res_a[2,:], Inf)
       # returns 232.0f0
0.0f0

julia>

julia> Metal.versioninfo()
macOS 15.1.0, Darwin 24.1.0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

Julia packages:
- Metal.jl: 1.4.0
- GPUArrays: 11.1.0
- GPUCompiler: 1.0.1
- KernelAbstractions: 0.9.29
- ObjectiveC: 3.1.0
- LLVM: 9.1.3
- LLVMDowngrader_jll: 0.3.0+2

1 device:
- Apple M1 Max (384.000 KiB allocated)

@christiangnrd
Copy link
Contributor

Did you check this using a M2 or a M3?

M2 Max like you.

If anyone reading this has an M3 and is still on macOS 14, could you please try running the MWE and report back if it is broken or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants