-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: rocblas_gemm_ex with m==1 fp16 inputs/outputs f32 compute slower than a quite naive gemv kernel on MI100 #1425
Comments
yeah this has been an issue for a while: #1238 |
I updated the kernel from my reproducer, it saturates memory bandwidth (contrary to rocBLAS). |
I see that @daineAMD replied to the other issue, so mentioning here as well, in case that helps in any way. |
its also pretty silly since just using the gemv kernels in these cases should be trivial the suboptimiality of this is ofc also easly shown with rocblas's own tool: rocblas-bench -f gemm_ex -m 1 -n 16192 -k 16192 --transposeA N --transposeB N -r s --compute_type s -i 50
rocblas-bench -f gemv -r s -m 16192 -n 16192 --lda 16192 -i 50
|
also rocblas_hgemv would also be great since there is opportunity here to use dual-issue |
Hi @Epliz, thanks for brining this up. Yes, the disparity between gemm with m == 1/n == 1 and gemv has been brought up in the past as noted by @IMbackK. Back when it was originally brought up, it wasn't straightforward on if the best approach would be to re-direct the gemm call to gemv (which has source kernels in rocblas) or to continue to gemm (which is handled within the Tensile library) since performance was somewhat of a mixed-bag; and handling this on a case-by-case basis seemed infeasible. Regardless, it's good that this has been brought up again, and I'll discuss with the team on what the best approach is. If we can get gemv to outperform gemm in every case, then the changes to redirect to gemv would be straightforward, but most of the work would lie in ensuring that gemv is faster. I'll keep you updated with any progress here. The request for Thanks, |
Hi @daineAMD Thank you for the detailed comment on this matter and for: Out of curiosity: |
Hi @daineAMD , Following up after a week. If not, can you please proceed with making gemm call gemv for those cases? If the rocBlas team cannot tackle this task, would a pull request from my side be potentially merged? I can sign whatever contribution agreement you might need. |
Hi @Epliz and @IMbackK, sorry for the delay. Looking at my past notes, it looks like the areas of most concern were where the incx parameter is large (with various exceptions), specifically gemm cases where (transA == transB == T && ldb >> 1) and (transA == transB == N && lda >> 1). Other cases where I'm seeing gemm perform better than gemv is for small sizes, e.g.: I have a ticket to investigate further to see if we can call gemv from cases where it outperforms gemm and/or see what optimizations can be done for the current gemv to make this easier; I'll be looking at this in the coming weeks. You are free to take a look yourself and open a PR, you can take a look at the contributing guide if you're interested, but merging the PR will still take some time as most of the work still lies in ensuring no performance regressions. Thanks again, |
Hi @daineAMD, Thank you for your examples, this has been useful in determining when to use gemv in my code to work around this issue an when not. |
Hi @IMbackK, Yes it's good to keep this topic up-to-date since it's been delayed for so long, thanks for your reminder. There have been no decisions made on a way forward yet. Currently, we are working on some potential optimizations for the gemv function, so I thought it best to hold off on making any changes until I can evaluate the performance of any changes to gemv in case it makes the decision easier. In the meantime, I've mocked up some changes to potentially allow users to opt-in to using gemv kernels from Also, regarding half-precision gemv support, the following functions are in rocBLAS as of ROCm 6.0:
You can see their definitions in rocblas_functions.h. The precision prefixes represent input-compute-output types (e.g. hss is half-precision input, single-precision compute and output). It looks like they weren't added in the docs until ROCm 6.2, so they should be in the rocBLAS Documentation with ROCm 6.2. Sorry for not mentioning them previously, they slipped my mind. Thanks, |
Thanks @daineAMD for the reply. For example, for the mistral 7b model, the matrix shapes are:
|
Hi @daineAMD, Thank you for your quick update.
Having this selectable via rocblas's api or evvar would work great for me as a interim solution and presumably also for @Epliz.
Indeed i was not aware of these functions due to the lack of documentation, thank you for bringing these to my attention! Thus far i have been up casting to fp32. |
Hi @daineAMD, Given the elapsed time, i would like to inquire as to if a direction forward with this has been decided internally. |
Hi @IMbackK, I have a pull request open which redirects some calls to rocblas_gemm_ex() to use our internal gemv kernels rather than gemm kernels from Tensile where I found our kernels to have better performance. This PR is currently under review and I am in discussion with the Tensile team on how to handle some edge cases regarding the solution selection process. Thanks, |
Hi @daineAMD, That sounds great, thank you for working on resolving this! Im also interested if you know how this issue interacts with the recently added support for rocblas to cross dispatch gemm to hipblaslt. |
Hi @daineAMD , Thanks a lot for pushing for this, and I would also be happy to give a try as early tester, if you want and can. |
Describe the bug
As described in the title, rocblas_gemm_ex seems quite suboptimal when m==1 inputs/outputs are fp16 and compute is fp32 on MI100.
A quite naive kernel I implemented beats it.
Causes ROCm/pytorch#1408 in pytorch.
It make LLM inference on Mistral 7b fp16 slower compared to what it could easily be.
To Reproduce
Here is a C++ reproducer:
Expected behavior
It should be at least as fast as my naive kernel.
But running the above, I get:
Environment
environment.txt
Additional context
Add any other context about the problem here.
EDIT: put a better kernel than originally included one
EDIT2: put a better kernel
The text was updated successfully, but these errors were encountered: