-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was working on #31442, where the first idea was to go the same way as #30320 where vectorization was activated by unrolling the loop in chunks of 4 steps. I am now thinking that is actually an issue in the code generation that is preventing the vectorization, and it should deserve more attention because it may affect many other functions. Only now I had the idea of looking for other similar open issues and I found this one. One hypothesis raised before is that this could be something down in LLVM, but your test here using clang is a very compelling argument for the idea it must be something a bit higher level. I suspect we would see the same result if we did this test with I imagine there is either something missing in the non-optimized IR code produced by Julia, or if this is really something missing in LLVM then clang must be doing some automatic vectorization by itself, in which case Julia would probably have to start doing the same. I don't think this is likely, though. Other existing vectorization issues seem to be #29441, #28331 and #30290, apart from the two I mentioned and this one here. Could there be a common cause for some of them? And can anyone give an idea of how one could go about working in these issues? For instance, what is the best way to produce the non-optimized IR, change the code generation, and to test just the LLVM optimizations given the initial IR? EDIT: Answering my own question, read the fine manual... https://docs.julialang.org/en/v1.1/devdocs/llvm/ |
Looking briefly at this, the function vectorized just fine (4x vectorization 4x unroll), but we have a rather costly alias-check that we do each 16 iterations. Adding an
Interestingly the
So more investigation is needed, but it is not just a case of failed vectorization |
these are within ~10% runtime of one another now. which seems close enough to me to close, but feel free to re-open if we think that last bit is achievable (and this MWE is the one to do it) |
A performance discrepancy between julia and clang pji-ordered triple-nested-loop gemm implementations is evident in https://github.com/Sacha0/TripleNestedLoopDemo.jl. Though I haven't had the bandwidth to check yet, I suspect the discrepancy comes from vectorization differences. Repro code:
yielding
i.e. almost precisely a factor of two discrepancy. Best!
The text was updated successfully, but these errors were encountered: