-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Linear Layout to describe 2D block loads 1/? #3708
base: main
Are you sure you want to change the base?
Conversation
c136dc1
to
a44cc6d
Compare
Looks like some of the benchmark cases still have a bug: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/13956151531/job/39067632585 |
offset=32 -> (2, 0) | ||
offset=64 -> (4, 0) | ||
offset=128 -> (8, 0) | ||
- iteration=1 -> (0, 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you replicated the DPAS layout first in the inner dimension (while you replicated it first in the outer dimension for the A operand). I assume this is defined by the order
attribute, but if you could make this more explicit, it would help understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the outer and inner dimensions are determined by order - but you can replicate in either dimension depending on the parameters of the DPAS layout. I'm not sure where to make it more explicit in the doc - any suggestions?
where out dims are: [dim0 (size 16), dim1 (size 8)] | ||
``` | ||
|
||
For this load we have two iterations in the outer dimension: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand where does this number of iterations come from. I think this come from hardware limitations, but could you clarify this point please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of iterations is computed by taking the maximum size of contiguous dpas tiles defined by the DPAS layout, subject to hardware limitations. I didn't want to try and enumerate all of the conditions because those are part of the algorithm prior to the linear layout. Maybe @chengjunlu can help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left inline comments mostly about code style. Overall this is a nice addition. I appreciate the use of LLVM_DEBUG macros to trace the code, they will make future debugging efforts (if necessary) easier. Thanks Alex.
@@ -1445,6 +1716,14 @@ struct LoadOpConversion | |||
for (int outer = 0; outer < numRepOuter; ++outer) { | |||
for (int k = 0; k < numRepInner; ++k) { | |||
for (int rep = 0; rep < repCluster[unsigned(opIdx)]; ++rep) { | |||
if (loadVals.find({outer * repCluster[unsigned(opIdx)] + rep, k}) == | |||
loadVals.end()) { | |||
// generate a nice error message before the throw below aborts our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The throw you are referring to is it in a function called from this loop? Can we abort immediately after emitting the error message ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the at
method on vector throws if the key cannot be found. If we want to modify that behavior, I think we should track separately from this PR.
LLVM_DEBUG(llvm::dbgs() << "dimOuterStr: " << dimOuterStr << "\n"); | ||
LLVM_DEBUG(llvm::dbgs() << "dimInnerStr: " << dimInnerStr << "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM_DEBUG(llvm::dbgs() << "dimOuterStr: " << dimOuterStr << "\n"); | |
LLVM_DEBUG(llvm::dbgs() << "dimInnerStr: " << dimInnerStr << "\n"); | |
LLVM_DEBUG({llvm::dbgs() << "dimOuterStr: " << dimOuterStr << "\n"; | |
llvm::dbgs() << "dimInnerStr: " << dimInnerStr << "\n";}); |
auto dimOuterStr = S("dim" + std::to_string(dimOuter)); | ||
auto dimInnerStr = S("dim" + std::to_string(dimInner)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto dimOuterStr = S("dim" + std::to_string(dimOuter)); | |
auto dimInnerStr = S("dim" + std::to_string(dimInner)); | |
StringAttr dimOuterStr = S("dim" + std::to_string(dimOuter)); | |
StringAttr dimInnerStr = S("dim" + std::to_string(dimInner)); |
No a big deal but using the static type is consistent with what you have already done belo at lines 1261-1263
StringAttr kOffset = S("offset"); | ||
StringAttr kIteration = S("iteration"); | ||
StringAttr kLoad = S("load"); | ||
auto createTileLayout = [&](const SmallVector<unsigned> &threadOrder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal but the lambda can receive the argument using a SmallVectorImpl<unsigned>
. This is customary in LLVM when passing arguments to a function because it permits the caller to pass in SmallVectors of any size.
Example:
mlir::LogicalResult WarpGroupDotOp::inferReturnTypes(
MLIRContext *context, std::optional<Location> location, ValueRange operands,
DictionaryAttr attributes, OpaqueProperties properties, RegionRange regions,
SmallVectorImpl<Type> &inferredReturnTypes)
auto outDimNames = standardOutDimNames(ctx, tensorShape.size()); | ||
LinearLayout layout = LinearLayout::empty(); | ||
SmallVector<StringAttr> kOffsetDims; | ||
auto totalOffsets = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto totalOffsets = 1; | |
unsigned totalOffsets = 1; |
const auto widthDim = threadOrder[rank - 2]; | ||
const auto origTileWidth = tileShape[widthDim]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to use the static type rather than auto
. Just following the LLVM recommendation for the use of auto
here:
"Use auto if and only if it makes the code more readable or easier to maintain. Don’t “almost always” use auto"
Reference in: https://releases.llvm.org/10.0.0/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable
This comment applies to similar instances of this pattern in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - will change to unsigned. There are probably some inconsistencies in unsigned vs int - shouldn't we get compiler warnings for those?
I do like using auto
when the method type is opaque, but I understand the LLVM developers' rationale.
|
||
kOffsetDims.push_back(kOffset); | ||
|
||
assert(llvm::isPowerOf2_32(tileShape[dim])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need assert message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is based in part on the upstream layout conversion ops which do not have assert messages - e.g. https://github.com/intel/intel-xpu-backend-for-triton/blob/main/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L293 and https://github.com/intel/intel-xpu-backend-for-triton/blob/main/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L430 . I am not sure what message we would provide other than the differences in the argument, which should already be covered.
{ | ||
size_t offset = 0; | ||
auto tensorVals = | ||
tileLayout.apply({{kOffset, offset}, {kIteration, itr}}); | ||
assert(tensorVals.size() == 2); | ||
llvm::dbgs() << itr << ", " << offset << " : " << tensorVals[0].second | ||
<< ", " << tensorVals[1].second << "\n"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to make this code block a lambda function so that it can be reused in this loop where this code patterns appears several time.
{ | ||
size_t offset = 0; | ||
auto tensorVals = tileLayout.apply( | ||
{{kOffset, offset}, {kIteration, itr}, {kLoad, load}}); | ||
assert(tensorVals.size() == 2); | ||
llvm::dbgs() << load << ", " << itr << ", " << offset << " : " | ||
<< tensorVals[0].second << ", " << tensorVals[1].second | ||
<< "\n"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce a lambda function as in comment at line 1387
|
||
auto offset = tileLayout.apply( | ||
{{kOffset, 0}, {kIteration, 0}, {kLoad, loadIdx}}); | ||
assert(offset.size() == 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add assert msg.
@chengjunlu ping. WDYT of this PR ? |
I am having some issues with the int8 layouts - probably because unlike the bf16/tf32 examples we pack 4 elements into the B matrix "slots". I think I will try encoding the packed elements into the linear layout tile size - so for bf16, the B matrix tile becomes 8x16 because each row has two packed elements, so we only need 8 rows. |
52ba86a
to
6f69c41
Compare
309328e
to
cb00f2e
Compare
Adjusted the layout to better reflect packing corresponding to vnni transform, adjusted the approach for computing offsets from the layout, and then simplified the offset computations to better match the loop variables. I'm going to check benchmarks/CI and then remove the debug statements and respond to the remaining review comments. |
This PR introduces a new linear layout in the Triton Load to LLVM lowering for block loads. I split the creation of the layouts out of the larger PR and focused on using the layouts to compute the
(x,y)
offsets for the 2D block load instructions to ensure correctness of the layout. The shuffle vectors are still being generated using existing loop variables.The layout describes the block load in terms of three input parameters:
offset
which is the 1D offset into the loaded data for a single DPAS invocation inside a sub-groupiteration
which identifies the DPAS invocation when multiple DPAS invocations share a single loadload
which identifies the load index when multiple loads occur for a given operandThe output of the layout function identifies the global (x,y) tensor coordinate within a given load. This was designed to allow composition of the DPAS layout and the load layout to go from offset, iteration, load to block, warp, lane, register or vice versa. Note that I do not encode all the information about the load into the layout currently - I wanted to maintain surjective properties of the layout and it's a bit easier to construct this way. So, sometimes a manual offset must be applied depending on the desired layout parameter.
Currently the block load / tile layout is implemented within the existing loop structure. But, the layout was designed to be used to generate the 2D block loads by iterating over layout parameters. The existing loop structure is still in place and debug info can be enabled which prints the previously generated values and the linear layout values for easy debugging. I am planning to generate the shuffle vectors using composition of layouts between the DPAS layout and load layout next.
cc #3008
supersedes #3487