likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Sacha0 · 2018-09-30T17:22:17Z

A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
        @inbounds C[i, j] += A[i, p] * B[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* C, double* A, double* B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

using Libdl
const CGemmLib = tempname()
open(`clang -fPIC -O3 -xc -shared -o $(CGemmLib * "." * Libdl.dlext) -`, "w") do f
    print(f, gemm_pji_ccode) 
end

gemm_pji_c!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}) =
    (ccall(("gemm_pji_c", CGemmLib), Cvoid,
            (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Cint, Cint, Cint),
            C, A, B, size(C, 1), size(A, 2), size(C, 2)); return C)

using Test
let
    m, n, k = 48*3, 48*2, 48
    C = rand(m, n)
    A = rand(m, k)
    B = rand(k, n)
    Cref = A * B
    @test gemm_pji!(fill!(C, 0), A, B) ≈ Cref
    @test gemm_pji_c!(fill!(C, 0), A, B) ≈ Cref
end

using BenchmarkTools
mnk = 48;
A = rand(mnk, mnk);
B = rand(mnk, mnk);
C = rand(mnk, mnk);
@benchmark gemm_pji!($C, $A, $B)
@benchmark gemm_pji_c!($C, $A, $B)

yielding

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.311 μs (0.00% GC)
  median time:      66.397 μs (0.00% GC)
  mean time:        67.214 μs (0.00% GC)
  maximum time:     244.067 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.923 μs (0.00% GC)
  median time:      32.973 μs (0.00% GC)
  mean time:        34.007 μs (0.00% GC)
  maximum time:     133.430 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

i.e. almost precisely a factor of two discrepancy. Best!

The text was updated successfully, but these errors were encountered:

nlw0 · 2019-04-18T11:20:14Z

I was working on #31442, where the first idea was to go the same way as #30320 where vectorization was activated by unrolling the loop in chunks of 4 steps. I am now thinking that is actually an issue in the code generation that is preventing the vectorization, and it should deserve more attention because it may affect many other functions. Only now I had the idea of looking for other similar open issues and I found this one.

One hypothesis raised before is that this could be something down in LLVM, but your test here using clang is a very compelling argument for the idea it must be something a bit higher level. I suspect we would see the same result if we did this test with maximum.

I imagine there is either something missing in the non-optimized IR code produced by Julia, or if this is really something missing in LLVM then clang must be doing some automatic vectorization by itself, in which case Julia would probably have to start doing the same. I don't think this is likely, though.

Other existing vectorization issues seem to be #29441, #28331 and #30290, apart from the two I mentioned and this one here. Could there be a common cause for some of them? And can anyone give an idea of how one could go about working in these issues? For instance, what is the best way to produce the non-optimized IR, change the code generation, and to test just the LLVM optimizations given the initial IR?

EDIT: Answering my own question, read the fine manual... https://docs.julialang.org/en/v1.1/devdocs/llvm/

vchuravy · 2019-04-18T12:20:57Z

Looking briefly at this, the function vectorized just fine (4x vectorization 4x unroll), but we have a rather costly alias-check that we do each 16 iterations.

Adding an @aliasscope annotation we get to:

julia> @benchmark gemm_pji!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.339 μs (0.00% GC)
  median time:      15.321 μs (0.00% GC)
  mean time:        15.621 μs (0.00% GC)
  maximum time:     92.682 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_pji_c!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.485 μs (0.00% GC)
  median time:      29.610 μs (0.00% GC)
  mean time:        29.633 μs (0.00% GC)
  maximum time:     148.273 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Interestingly the restrict annotation didn't help the C code.

import Base.Experimental: Const, @aliasscope

function gemm_pji!(C, A, B)
    for p in 1:size(A, 2),
         j in 1:size(C, 2),
          i in 1:size(C, 1)
          @aliasscope @inbounds C[i, j] += Const(A)[i, p] * Const(B)[p, j]
    end
    return C
end

gemm_pji_ccode = """
void gemm_pji_c(double* restrict C, double* restrict A, double* restrict B, int m, int k, int n) {
    for ( int p = 0; p < k; ++p )
        for ( int j = 0; j < n; ++j )
            for ( int i = 0; i < m; ++i )
                C[j*m + i] += A[p*m + i] * B[j*k + p];
}
""";

So more investigation is needed, but it is not just a case of failed vectorization

adienes · 2025-05-13T19:55:17Z

these are within ~10% runtime of one another now. which seems close enough to me to close, but feel free to re-open if we think that last bit is achievable (and this MWE is the one to do it)

Sacha0 added performance Must go faster compiler:simd instruction-level vectorization labels Sep 30, 2018

Sacha0 mentioned this issue Sep 30, 2018

Loop Vectorizer gives up on non-unitary steps in a range #29441

Open

nlw0 mentioned this issue Apr 18, 2019

sum(a) is now 30% slower than NumPy #30290

Closed

adienes closed this as completed May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Sacha0 commented Sep 30, 2018

nlw0 commented Apr 18, 2019 •

edited

Loading

Uh oh!

vchuravy commented Apr 18, 2019

Uh oh!

adienes commented May 13, 2025

Uh oh!

Uh oh!

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Comments

Sacha0 commented Sep 30, 2018

nlw0 commented Apr 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vchuravy commented Apr 18, 2019

Uh oh!

adienes commented May 13, 2025

Uh oh!

nlw0 commented Apr 18, 2019 •

edited

Loading