Use Linear Layout to describe 2D block loads 1/? #3708

alexbaden · 2025-03-18T22:12:45Z

This PR introduces a new linear layout in the Triton Load to LLVM lowering for block loads. I split the creation of the layouts out of the larger PR and focused on using the layouts to compute the (x,y) offsets for the 2D block load instructions to ensure correctness of the layout. The shuffle vectors are still being generated using existing loop variables.

The layout describes the block load in terms of three input parameters:

offset which is the 1D offset into the loaded data for a single DPAS invocation inside a sub-group
iteration which identifies the DPAS invocation when multiple DPAS invocations share a single load
load which identifies the load index when multiple loads occur for a given operand

The output of the layout function identifies the global (x,y) tensor coordinate within a given load. This was designed to allow composition of the DPAS layout and the load layout to go from offset, iteration, load to block, warp, lane, register or vice versa. Note that I do not encode all the information about the load into the layout currently - I wanted to maintain surjective properties of the layout and it's a bit easier to construct this way. So, sometimes a manual offset must be applied depending on the desired layout parameter.

Currently the block load / tile layout is implemented within the existing loop structure. But, the layout was designed to be used to generate the 2D block loads by iterating over layout parameters. The existing loop structure is still in place and debug info can be enabled which prints the previously generated values and the linear layout values for easy debugging. I am planning to generate the shuffle vectors using composition of layouts between the DPAS layout and load layout next.

cc #3008

supersedes #3487

alexbaden · 2025-03-19T21:34:15Z

Looks like some of the benchmark cases still have a bug: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/13956151531/job/39067632585

docs/BLOCK_LOADS_LAYOUT.md

mfrancepillois · 2025-03-20T16:20:30Z

docs/BLOCK_LOADS_LAYOUT.md

+   offset=32 -> (2, 0)
+   offset=64 -> (4, 0)
+   offset=128 -> (8, 0)
+ - iteration=1 -> (0, 16)


Here, you replicated the DPAS layout first in the inner dimension (while you replicated it first in the outer dimension for the A operand). I assume this is defined by the order attribute, but if you could make this more explicit, it would help understanding.

Yes, the outer and inner dimensions are determined by order - but you can replicate in either dimension depending on the parameters of the DPAS layout. I'm not sure where to make it more explicit in the doc - any suggestions?

mfrancepillois · 2025-03-20T16:27:22Z

docs/BLOCK_LOADS_LAYOUT.md

+where out dims are: [dim0 (size 16), dim1 (size 8)]
+```
+
+For this load we have two iterations in the outer dimension:


I don't understand where does this number of iterations come from. I think this come from hardware limitations, but could you clarify this point please?

The number of iterations is computed by taking the maximum size of contiguous dpas tiles defined by the DPAS layout, subject to hardware limitations. I didn't want to try and enumerate all of the conditions because those are part of the algorithm prior to the linear layout. Maybe @chengjunlu can help?

docs/BLOCK_LOADS_LAYOUT.md

etiotto

Left inline comments mostly about code style. Overall this is a nice addition. I appreciate the use of LLVM_DEBUG macros to trace the code, they will make future debugging efforts (if necessary) easier. Thanks Alex.

etiotto · 2025-03-20T17:56:14Z