Skip to content

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sacha0 opened this issue Sep 30, 2018 · 3 comments
Closed
Labels
compiler:simd instruction-level vectorization performance Must go faster

Comments

@Sacha0
Copy link
Member

Sacha0 commented Sep 30, 2018

A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
        @inbounds C[i, j] += A[i, p] * B[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* C, double* A, double* B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

using Libdl
const CGemmLib = tempname()
open(`clang -fPIC -O3 -xc -shared -o $(CGemmLib * "." * Libdl.dlext) -`, "w") do f
    print(f, gemm_pji_ccode) 
end

gemm_pji_c!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}) =
    (ccall(("gemm_pji_c", CGemmLib), Cvoid,
            (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Cint, Cint, Cint),
            C, A, B, size(C, 1), size(A, 2), size(C, 2)); return C)

using Test
let
    m, n, k = 48*3, 48*2, 48
    C = rand(m, n)
    A = rand(m, k)
    B = rand(k, n)
    Cref = A * B
    @test gemm_pji!(fill!(C, 0), A, B)  Cref
    @test gemm_pji_c!(fill!(C, 0), A, B)  Cref
end

using BenchmarkTools
mnk = 48;
A = rand(mnk, mnk);
B = rand(mnk, mnk);
C = rand(mnk, mnk);
@benchmark gemm_pji!($C, $A, $B)
@benchmark gemm_pji_c!($C, $A, $B)

yielding

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.311 μs (0.00% GC)
  median time:      66.397 μs (0.00% GC)
  mean time:        67.214 μs (0.00% GC)
  maximum time:     244.067 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.923 μs (0.00% GC)
  median time:      32.973 μs (0.00% GC)
  mean time:        34.007 μs (0.00% GC)
  maximum time:     133.430 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

i.e. almost precisely a factor of two discrepancy. Best!

@nlw0
Copy link
Contributor

nlw0 commented Apr 18, 2019

I was working on #31442, where the first idea was to go the same way as #30320 where vectorization was activated by unrolling the loop in chunks of 4 steps. I am now thinking that is actually an issue in the code generation that is preventing the vectorization, and it should deserve more attention because it may affect many other functions. Only now I had the idea of looking for other similar open issues and I found this one.

One hypothesis raised before is that this could be something down in LLVM, but your test here using clang is a very compelling argument for the idea it must be something a bit higher level. I suspect we would see the same result if we did this test with maximum.

I imagine there is either something missing in the non-optimized IR code produced by Julia, or if this is really something missing in LLVM then clang must be doing some automatic vectorization by itself, in which case Julia would probably have to start doing the same. I don't think this is likely, though.

Other existing vectorization issues seem to be #29441, #28331 and #30290, apart from the two I mentioned and this one here. Could there be a common cause for some of them? And can anyone give an idea of how one could go about working in these issues? For instance, what is the best way to produce the non-optimized IR, change the code generation, and to test just the LLVM optimizations given the initial IR?

EDIT: Answering my own question, read the fine manual... https://docs.julialang.org/en/v1.1/devdocs/llvm/

@vchuravy
Copy link
Member

Looking briefly at this, the function vectorized just fine (4x vectorization 4x unroll), but we have a rather costly alias-check that we do each 16 iterations.

Adding an @aliasscope annotation we get to:

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.339 μs (0.00% GC)
  median time:      15.321 μs (0.00% GC)
  mean time:        15.621 μs (0.00% GC)
  maximum time:     92.682 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.485 μs (0.00% GC)
  median time:      29.610 μs (0.00% GC)
  mean time:        29.633 μs (0.00% GC)
  maximum time:     148.273 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Interestingly the restrict annotation didn't help the C code.

import Base.Experimental: Const, @aliasscope

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
          @aliasscope @inbounds C[i, j] += Const(A)[i, p] * Const(B)[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* restrict C, double* restrict A, double* restrict B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

So more investigation is needed, but it is not just a case of failed vectorization

@adienes
Copy link
Member

adienes commented May 13, 2025

these are within ~10% runtime of one another now. which seems close enough to me to close, but feel free to re-open if we think that last bit is achievable (and this MWE is the one to do it)

@adienes adienes closed this as completed May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants