Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Linear Layout to describe 2D block loads 1/? #3708

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

alexbaden
Copy link
Contributor

This PR introduces a new linear layout in the Triton Load to LLVM lowering for block loads. I split the creation of the layouts out of the larger PR and focused on using the layouts to compute the (x,y) offsets for the 2D block load instructions to ensure correctness of the layout. The shuffle vectors are still being generated using existing loop variables.

The layout describes the block load in terms of three input parameters:

  • offset which is the 1D offset into the loaded data for a single DPAS invocation inside a sub-group
  • iteration which identifies the DPAS invocation when multiple DPAS invocations share a single load
  • load which identifies the load index when multiple loads occur for a given operand

The output of the layout function identifies the global (x,y) tensor coordinate within a given load. This was designed to allow composition of the DPAS layout and the load layout to go from offset, iteration, load to block, warp, lane, register or vice versa. Note that I do not encode all the information about the load into the layout currently - I wanted to maintain surjective properties of the layout and it's a bit easier to construct this way. So, sometimes a manual offset must be applied depending on the desired layout parameter.

Currently the block load / tile layout is implemented within the existing loop structure. But, the layout was designed to be used to generate the 2D block loads by iterating over layout parameters. The existing loop structure is still in place and debug info can be enabled which prints the previously generated values and the linear layout values for easy debugging. I am planning to generate the shuffle vectors using composition of layouts between the DPAS layout and load layout next.

cc #3008

supersedes #3487

@alexbaden alexbaden requested review from whitneywhtsang, etiotto, chengjunlu and a team March 18, 2025 22:12
@alexbaden alexbaden force-pushed the alex/block_loads_layouts_2 branch from c136dc1 to a44cc6d Compare March 19, 2025 13:08
@alexbaden
Copy link
Contributor Author

Looks like some of the benchmark cases still have a bug: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/13956151531/job/39067632585

offset=32 -> (2, 0)
offset=64 -> (4, 0)
offset=128 -> (8, 0)
- iteration=1 -> (0, 16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you replicated the DPAS layout first in the inner dimension (while you replicated it first in the outer dimension for the A operand). I assume this is defined by the order attribute, but if you could make this more explicit, it would help understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the outer and inner dimensions are determined by order - but you can replicate in either dimension depending on the parameters of the DPAS layout. I'm not sure where to make it more explicit in the doc - any suggestions?

where out dims are: [dim0 (size 16), dim1 (size 8)]
```

For this load we have two iterations in the outer dimension:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand where does this number of iterations come from. I think this come from hardware limitations, but could you clarify this point please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of iterations is computed by taking the maximum size of contiguous dpas tiles defined by the DPAS layout, subject to hardware limitations. I didn't want to try and enumerate all of the conditions because those are part of the algorithm prior to the linear layout. Maybe @chengjunlu can help?

Copy link
Contributor

@etiotto etiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left inline comments mostly about code style. Overall this is a nice addition. I appreciate the use of LLVM_DEBUG macros to trace the code, they will make future debugging efforts (if necessary) easier. Thanks Alex.

@@ -1445,6 +1716,14 @@ struct LoadOpConversion
for (int outer = 0; outer < numRepOuter; ++outer) {
for (int k = 0; k < numRepInner; ++k) {
for (int rep = 0; rep < repCluster[unsigned(opIdx)]; ++rep) {
if (loadVals.find({outer * repCluster[unsigned(opIdx)] + rep, k}) ==
loadVals.end()) {
// generate a nice error message before the throw below aborts our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The throw you are referring to is it in a function called from this loop? Can we abort immediately after emitting the error message ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the at method on vector throws if the key cannot be found. If we want to modify that behavior, I think we should track separately from this PR.

Comment on lines +1253 to +1254
LLVM_DEBUG(llvm::dbgs() << "dimOuterStr: " << dimOuterStr << "\n");
LLVM_DEBUG(llvm::dbgs() << "dimInnerStr: " << dimInnerStr << "\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(llvm::dbgs() << "dimOuterStr: " << dimOuterStr << "\n");
LLVM_DEBUG(llvm::dbgs() << "dimInnerStr: " << dimInnerStr << "\n");
LLVM_DEBUG({llvm::dbgs() << "dimOuterStr: " << dimOuterStr << "\n";
llvm::dbgs() << "dimInnerStr: " << dimInnerStr << "\n";});

Comment on lines +1251 to +1252
auto dimOuterStr = S("dim" + std::to_string(dimOuter));
auto dimInnerStr = S("dim" + std::to_string(dimInner));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto dimOuterStr = S("dim" + std::to_string(dimOuter));
auto dimInnerStr = S("dim" + std::to_string(dimInner));
StringAttr dimOuterStr = S("dim" + std::to_string(dimOuter));
StringAttr dimInnerStr = S("dim" + std::to_string(dimInner));

No a big deal but using the static type is consistent with what you have already done belo at lines 1261-1263

StringAttr kOffset = S("offset");
StringAttr kIteration = S("iteration");
StringAttr kLoad = S("load");
auto createTileLayout = [&](const SmallVector<unsigned> &threadOrder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but the lambda can receive the argument using a SmallVectorImpl<unsigned>. This is customary in LLVM when passing arguments to a function because it permits the caller to pass in SmallVectors of any size.
Example:

mlir::LogicalResult WarpGroupDotOp::inferReturnTypes(
    MLIRContext *context, std::optional<Location> location, ValueRange operands,
    DictionaryAttr attributes, OpaqueProperties properties, RegionRange regions,
    SmallVectorImpl<Type> &inferredReturnTypes) 

auto outDimNames = standardOutDimNames(ctx, tensorShape.size());
LinearLayout layout = LinearLayout::empty();
SmallVector<StringAttr> kOffsetDims;
auto totalOffsets = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto totalOffsets = 1;
unsigned totalOffsets = 1;

Comment on lines 1273 to 1274
const auto widthDim = threadOrder[rank - 2];
const auto origTileWidth = tileShape[widthDim];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to use the static type rather than auto. Just following the LLVM recommendation for the use of auto here:
"Use auto if and only if it makes the code more readable or easier to maintain. Don’t “almost always” use auto"

Reference in: https://releases.llvm.org/10.0.0/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable

This comment applies to similar instances of this pattern in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - will change to unsigned. There are probably some inconsistencies in unsigned vs int - shouldn't we get compiler warnings for those?

I do like using auto when the method type is opaque, but I understand the LLVM developers' rationale.


kOffsetDims.push_back(kOffset);

assert(llvm::isPowerOf2_32(tileShape[dim]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need assert message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is based in part on the upstream layout conversion ops which do not have assert messages - e.g. https://github.com/intel/intel-xpu-backend-for-triton/blob/main/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L293 and https://github.com/intel/intel-xpu-backend-for-triton/blob/main/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L430 . I am not sure what message we would provide other than the differences in the argument, which should already be covered.

Comment on lines +1380 to +1406
{
size_t offset = 0;
auto tensorVals =
tileLayout.apply({{kOffset, offset}, {kIteration, itr}});
assert(tensorVals.size() == 2);
llvm::dbgs() << itr << ", " << offset << " : " << tensorVals[0].second
<< ", " << tensorVals[1].second << "\n";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to make this code block a lambda function so that it can be reused in this loop where this code patterns appears several time.

Comment on lines +1468 to +1503
{
size_t offset = 0;
auto tensorVals = tileLayout.apply(
{{kOffset, offset}, {kIteration, itr}, {kLoad, load}});
assert(tensorVals.size() == 2);
llvm::dbgs() << load << ", " << itr << ", " << offset << " : "
<< tensorVals[0].second << ", " << tensorVals[1].second
<< "\n";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce a lambda function as in comment at line 1387


auto offset = tileLayout.apply(
{{kOffset, 0}, {kIteration, 0}, {kLoad, loadIdx}});
assert(offset.size() == 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add assert msg.

@etiotto
Copy link
Contributor

etiotto commented Mar 20, 2025

@chengjunlu ping. WDYT of this PR ?

@alexbaden
Copy link
Contributor Author

I am having some issues with the int8 layouts - probably because unlike the bf16/tf32 examples we pack 4 elements into the B matrix "slots". I think I will try encoding the packed elements into the linear layout tile size - so for bf16, the B matrix tile becomes 8x16 because each row has two packed elements, so we only need 8 rows.

@alexbaden alexbaden force-pushed the alex/block_loads_layouts_2 branch from 52ba86a to 6f69c41 Compare March 21, 2025 01:03
@alexbaden alexbaden force-pushed the alex/block_loads_layouts_2 branch from 309328e to cb00f2e Compare March 24, 2025 17:11
@alexbaden
Copy link
Contributor Author

Adjusted the layout to better reflect packing corresponding to vnni transform, adjusted the approach for computing offsets from the layout, and then simplified the offset computations to better match the loop variables. I'm going to check benchmarks/CI and then remove the debug statements and respond to the remaining review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants