-
Notifications
You must be signed in to change notification settings - Fork 14.4k
HIP: adjust RDNA3.5 MMQ kernel selction logic #18666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP: adjust RDNA3.5 MMQ kernel selction logic #18666
Conversation
IMbackK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lgtm
| } | ||
|
|
||
| // For some quantization types MMQ can have lower peak TOPS than hipBLAS | ||
| // so it's only faster for sufficiently small batch sizes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional since the sentence is spanning multiple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greping around in the codebase this is not the style used making it a bit awkward. but its not a big deal
Beinsezii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't have the chance to test at the moment but it looks good. surprised that 3_0 is so much worse in mmq than everything else
|
for CDNA mmq is also a mixed bag, generally gfx1100 and cdna1 and cdna2 have the best tuned tensile kernels so i think its more a case of blas doing better there than mmq doing worse. |
|
Probably a visit to q2/q6 perf would help everyone then. |
|
iirc from previous discussions the q2 performance anomaly also exists on cuda + mmq. someone could take a look at those kernels specifically, i havent because i dont find the q2 variants a very interesting datatype. |
For me Q6 is the one that hurts as it's perfect for Mistral 3.2 on 24GiB. Otherwise I probably wouldn't have ever found this problem. |
Follow-up to #18537 .
I was able to solve the technical issues I was having with my Strix Halo system and tested the performance change:
Details
This PR changes the kernel selection logic to use MMQ if either the performance of the hipBLAS path is worse of if the speedup is small and it would not really be worth the increase in memory use.