Skip to content

Commit 52ba86a

Browse files
committed
format
1 parent 0dfa313 commit 52ba86a

File tree

1 file changed

+30
-30
lines changed

1 file changed

+30
-30
lines changed

docs/BLOCK_LOADS_LAYOUT.md

+30-30
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
# Using Linear Layout for Intel 2D Block Loads
1+
# Using Linear Layout for Intel 2D Block Loads
22

3-
The Intel Xe2/Xe3 Triton backend relies on the [`Dot Product Accumulate Systolic`](https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md) (DPAS) instruction for cooperative subgroup mma operations. Cooperative load instructions are important for maximizing performance when using DPAS to accelerate mma. A typical GEMM kernel will generate several DPAS instructions for a subgroup. Even with a 2D block load instruction, loading the data for each individual DPAS instruction would be inefficient. Therefore, when lowering to LLVM the Intel backend expands load instructions to create 2D block loads which load data for all contiguous blocks used by DPAS instructions in a subgroup, up to hardware limitations.
3+
The Intel Xe2/Xe3 Triton backend relies on the [`Dot Product Accumulate Systolic`](https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md) (DPAS) instruction for cooperative subgroup mma operations. Cooperative load instructions are important for maximizing performance when using DPAS to accelerate mma. A typical GEMM kernel will generate several DPAS instructions for a subgroup. Even with a 2D block load instruction, loading the data for each individual DPAS instruction would be inefficient. Therefore, when lowering to LLVM the Intel backend expands load instructions to create 2D block loads which load data for all contiguous blocks used by DPAS instructions in a subgroup, up to hardware limitations.
44

5-
The lowering of `tt.dot` operations to `DPAS` instructions and tensor loads to 2D block loads starts by identifying and lowering the tensor loads. The lowering occurs during the `Triton GPU to LLVM` conversion pass. For each `TTGIR::LoadOp` operation we emit both a LLVM IR function call to the appropriate 2D block load instruction(s) and a set of shuffle vectors to transform the output of the block load into the required input for the DPAS instruction. This transformation is expected to take place only in registers and is represented using LLVM IR virtual registers.
5+
The lowering of `tt.dot` operations to `DPAS` instructions and tensor loads to 2D block loads starts by identifying and lowering the tensor loads. The lowering occurs during the `Triton GPU to LLVM` conversion pass. For each `TTGIR::LoadOp` operation we emit both a LLVM IR function call to the appropriate 2D block load instruction(s) and a set of shuffle vectors to transform the output of the block load into the required input for the DPAS instruction. This transformation is expected to take place only in registers and is represented using LLVM IR virtual registers.
66

7-
The 2D block load size is primarily determined by the Tensor layout attached to the Triton `tt.dot` operation. However, there are numerous special cases and hardware limitations applied to the load. The algorithm works by starting with the DPAS required tile size and expanding the tile size based on the parameters of the DPAS layout, subject to hardware limitations. Below we present example layouts for a simple `AxB` and `AxBT` gemm kernel.
7+
The 2D block load size is primarily determined by the Tensor layout attached to the Triton `tt.dot` operation. However, there are numerous special cases and hardware limitations applied to the load. The algorithm works by starting with the DPAS required tile size and expanding the tile size based on the parameters of the DPAS layout, subject to hardware limitations. Below we present example layouts for a simple `AxB` and `AxBT` gemm kernel.
88

99
## `AxB`
1010

11-
We will first consider the `AxB` GEMM kernel with `A` matrix dimension `[1024, 5120]` and `B` matrix dimension `[5120, 4096]`. The Triton kernel is directed to use block size 256 in the `M` and `N` dimensions and block size `32` in the `K` dimension (`5120`). The Triton backend automatically partitions the input and output matrices across workgroups/subgroups during the `Triton GPU IR` lowering phase.
11+
We will first consider the `AxB` GEMM kernel with `A` matrix dimension `[1024, 5120]` and `B` matrix dimension `[5120, 4096]`. The Triton kernel is directed to use block size 256 in the `M` and `N` dimensions and block size `32` in the `K` dimension (`5120`). The Triton backend automatically partitions the input and output matrices across workgroups/subgroups during the `Triton GPU IR` lowering phase.
1212

1313
### `A` matrix
1414

@@ -18,7 +18,7 @@ We will start with the A matrix load, which is identical for both the `B` and `B
1818
tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>, kWidth = 1}>>
1919
```
2020

21-
Note that the tensor type describes the expected data layout feeding into the `tt.dot` / DPAS instructions. While this type is attached to the load instruction, the data loaded may not match the layout described above. The lowering of the `LoadOp` to 2D blocked load includes an implicit conversion to make the loaded data match the desired tensor type. This conversion is instantiated using shuffle vectors.
21+
Note that the tensor type describes the expected data layout feeding into the `tt.dot` / DPAS instructions. While this type is attached to the load instruction, the data loaded may not match the layout described above. The lowering of the `LoadOp` to 2D blocked load includes an implicit conversion to make the loaded data match the desired tensor type. This conversion is instantiated using shuffle vectors.
2222

2323
We can use the `triton-tensor-layout` utility to print the DPAS layout with a hardware centric view (i.e. register/lane/warp mapping to tensor coordinates) using the following command:
2424
```
@@ -93,7 +93,7 @@ Warp0:
9393
( 30,16), ( 30,17), ( 30,18), ( 30,19), ( 30,20), ( 30,21), ( 30,22), ( 30,23), ( 30,24), ( 30,25), ( 30,26), ( 30,27), ( 30,28), ( 30,29), ( 30,30), ( 30,31)
9494
( 31,16), ( 31,17), ( 31,18), ( 31,19), ( 31,20), ( 31,21), ( 31,22), ( 31,23), ( 31,24), ( 31,25), ( 31,26), ( 31,27), ( 31,28), ( 31,29), ( 31,30), ( 31,31)
9595
```
96-
Because we are only interested in a subgroup (warp), the other warps in the layout are omitted.
96+
Because we are only interested in a subgroup (warp), the other warps in the layout are omitted.
9797

9898
For a `bf16 x bf16` DPAS (with `fp32` accumulator) the A matrix input to each instruction is expected to be `8x16` with the following subgroup layout:
9999

@@ -108,12 +108,12 @@ For a `bf16 x bf16` DPAS (with `fp32` accumulator) the A matrix input to each in
108108
| 6, 0 | 6, 1 | 6, 2 | 6, 3 | 6, 4 | 6, 5 | 6, 6 | 6, 7 | 6, 8 | 6, 9 | 6, 10 | 6, 11 | 6, 12 | 6, 13 | 6, 14 | 6, 15 |
109109
| 7, 0 | 7, 1 | 7, 2 | 7, 3 | 7, 4 | 7, 5 | 7, 6 | 7, 7 | 7, 8 | 7, 9 | 7, 10 | 7, 11 | 7, 12 | 7, 13 | 7, 14 | 7, 15 |
110110

111-
We now have all the information we need to determine both the number of 2D block loads for the `A` matrix and their sizes.
111+
We now have all the information we need to determine both the number of 2D block loads for the `A` matrix and their sizes.
112112

113113
Up to this point the load for the `A` matrix is represented as a single load per block with the `TTGIR` layouts mapping that block to underlying hardware primitives (workgroups and subgroups/warps). We need to translate that load into one or more load instructions per subgroup (warp). We start with the DPAS tile size, which as shown above is `8x16`. We create a linear layout which takes as input a parameter, `offset`, which is a 1D-rowwise index into the tile, and outputs the 2D tensor coordinate corresponding to that index:
114114

115115
```
116-
Block load tile layout:
116+
Block load tile layout:
117117
- offset=1 -> (0, 1)
118118
offset=2 -> (0, 2)
119119
offset=4 -> (0, 4)
@@ -123,7 +123,7 @@ Block load tile layout:
123123
offset=64 -> (4, 0)
124124
where out dims are: [dim0 (size 8), dim1 (size 16)]
125125
```
126-
The indices for the first SIMD lane / work-item (every 16th offset) are printed below:
126+
The indices for the first SIMD lane / work-item (every 16th offset) are printed below:
127127
```
128128
0 : 0, 0
129129
16 : 1, 0
@@ -135,10 +135,10 @@ The indices for the first SIMD lane / work-item (every 16th offset) are printed
135135
112 : 7, 0
136136
```
137137

138-
During lowering of the `tt.dot` operation to DPAS multiple DPAS instructions will be generated according to the TTGIR DPAS layout. We need to enlarge the 2D block load tile we created above to load as much data as possible for the DPAS instructions in the subgroup. We do this by adding additional parameters, starting with `iteration`. Each `iteration` corresponds to a DPAS instruction. Each DPAS instruction operates on a DPAS tile. Specifically, for each iteration we will generate a shuffle vector per work-item which will output the registers in the correct order for DPAS. The number of iterations is determined primarily by the repetitions attribute of the DPAS layout, subject to hardware restrictions. For the GEMM kernel `A` matrix we have `4` iterations across the outer dimension and `2` iterations across the inner dimension. After adding iterations our load has increased in size from the DPAS tile size (`8x16`) to (`32x32`).
138+
During lowering of the `tt.dot` operation to DPAS multiple DPAS instructions will be generated according to the TTGIR DPAS layout. We need to enlarge the 2D block load tile we created above to load as much data as possible for the DPAS instructions in the subgroup. We do this by adding additional parameters, starting with `iteration`. Each `iteration` corresponds to a DPAS instruction. Each DPAS instruction operates on a DPAS tile. Specifically, for each iteration we will generate a shuffle vector per work-item which will output the registers in the correct order for DPAS. The number of iterations is determined primarily by the repetitions attribute of the DPAS layout, subject to hardware restrictions. For the GEMM kernel `A` matrix we have `4` iterations across the outer dimension and `2` iterations across the inner dimension. After adding iterations our load has increased in size from the DPAS tile size (`8x16`) to (`32x32`).
139139

140140
```
141-
Block load tile layout after adding iterations:
141+
Block load tile layout after adding iterations:
142142
- offset=1 -> (0, 1)
143143
offset=2 -> (0, 2)
144144
offset=4 -> (0, 4)
@@ -152,7 +152,7 @@ Block load tile layout after adding iterations:
152152
where out dims are: [dim0 (size 32), dim1 (size 32)]
153153
```
154154

155-
The DPAS layout is replicated first in the outer dimension, then in the inner dimension. Referring to the DPAS layout we can see that the first contiguous block processed by DPAS starts at `(0,0)` and ends at `(31,15)`. We can see that the block load layout behaves similarly. The third iteration ends at `(31,15)` and the fourth starts at `(0,16)`.
155+
The DPAS layout is replicated first in the outer dimension, then in the inner dimension. Referring to the DPAS layout we can see that the first contiguous block processed by DPAS starts at `(0,0)` and ends at `(31,15)`. We can see that the block load layout behaves similarly. The third iteration ends at `(31,15)` and the fourth starts at `(0,16)`.
156156

157157
```
158158
0, 0 : 0, 0
@@ -176,7 +176,7 @@ The DPAS layout is replicated first in the outer dimension, then in the inner di
176176
In some cases we may need multiple block loads; e.g. if the hardware restrictions on load size are exceeded or if we have non-contiguous replications in the layout. For the A matrix there is only one load:
177177

178178
```
179-
Block load tile layout after adding loads:
179+
Block load tile layout after adding loads:
180180
- offset=1 -> (0, 1)
181181
offset=2 -> (0, 2)
182182
offset=4 -> (0, 4)
@@ -191,7 +191,7 @@ Block load tile layout after adding loads:
191191
where out dims are: [dim0 (size 32), dim1 (size 32)]
192192
```
193193

194-
We now have a linear layout which maps DPAS instructions to locations in the 2D load output. We can use this layout to determine the global offset for each load, to compute the local load offsets for each DPAS instruction, and to compute the shuffle vectors which translate the output of the load into the required layout for DPAS.
194+
We now have a linear layout which maps DPAS instructions to locations in the 2D load output. We can use this layout to determine the global offset for each load, to compute the local load offsets for each DPAS instruction, and to compute the shuffle vectors which translate the output of the load into the required layout for DPAS.
195195

196196
### `B` Matrix
197197

@@ -351,7 +351,7 @@ The DPAS tile layout for `B` is also different from `A`. The `B` tile is expecte
351351
As before, we start with the block load layout corresponding to a single `16x16` DPAS tile:
352352

353353
```
354-
Block load tile layout:
354+
Block load tile layout:
355355
- offset=1 -> (0, 1)
356356
offset=2 -> (0, 2)
357357
offset=4 -> (0, 4)
@@ -384,10 +384,10 @@ Note that the load layout does not encode the vnni transform. The 16th element (
384384
240 : 15, 0
385385
```
386386

387-
We compute iterations as before. This layout has two iterations in both the outer and inner dimensions:
387+
We compute iterations as before. This layout has two iterations in both the outer and inner dimensions:
388388

389389
```
390-
Block load tile layout after adding iterations:
390+
Block load tile layout after adding iterations:
391391
- offset=1 -> (0, 1)
392392
offset=2 -> (0, 2)
393393
offset=4 -> (0, 4)
@@ -403,7 +403,7 @@ where out dims are: [dim0 (size 32), dim1 (size 32)]
403403

404404
Finally, as suspected when examining the DPAS layout, there are two contiguous blocks so we will need multiple loads. In this case hardware limitations are not a factor, so we generate two loads:
405405
```
406-
Block load tile layout after adding loads:
406+
Block load tile layout after adding loads:
407407
- offset=1 -> (0, 1)
408408
offset=2 -> (0, 2)
409409
offset=4 -> (0, 4)
@@ -451,18 +451,18 @@ where out dims are: [dim0 (size 32), dim1 (size 64)]
451451
1, 3, 255 : 31, 63
452452
```
453453

454-
We want our linear layout functions to be `surjective`; that is, all output values should be covered by some input value. This allows us to invert and compose the layouts and makes modifying the layout using intermediate layouts easier. Therefore we omit the replication stride from the second load; i.e. the second load indices start after the first load. We keep track of the stride and apply it after evaluating the layout function if we want a global coordinate.
454+
We want our linear layout functions to be `surjective`; that is, all output values should be covered by some input value. This allows us to invert and compose the layouts and makes modifying the layout using intermediate layouts easier. Therefore we omit the replication stride from the second load; i.e. the second load indices start after the first load. We keep track of the stride and apply it after evaluating the layout function if we want a global coordinate.
455455

456-
### `AxBT`
456+
### `AxBT`
457457

458-
Now we will consider the `AxBT` GEMM kernel. We keep the A matrix size fixed and transpose the B matrix. The `B` matrix input becomes `[4096 x 5120]` and the matrix is transposed during execution of the kernel. Because the DPAS instruction does not have a transpose variant, we use the same layouts for the `A` and `B` matrices as before. We are relying on the 2D block load to compute the transpose and the shuffle vectors to put the post-transposed data into the right registers for DPAS. In the future we may modify the pass pipeline to adjust the overall layout for a dot operation with a transposed A or B matrix, but for this example it is acceptable to leave the layouts fixed so we can examine the loads.
458+
Now we will consider the `AxBT` GEMM kernel. We keep the A matrix size fixed and transpose the B matrix. The `B` matrix input becomes `[4096 x 5120]` and the matrix is transposed during execution of the kernel. Because the DPAS instruction does not have a transpose variant, we use the same layouts for the `A` and `B` matrices as before. We are relying on the 2D block load to compute the transpose and the shuffle vectors to put the post-transposed data into the right registers for DPAS. In the future we may modify the pass pipeline to adjust the overall layout for a dot operation with a transposed A or B matrix, but for this example it is acceptable to leave the layouts fixed so we can examine the loads.
459459

460-
The `A` matrix is not transposed and the DPAS layout is the same as the non-transposed case, therefore the loads are the same.
460+
The `A` matrix is not transposed and the DPAS layout is the same as the non-transposed case, therefore the loads are the same.
461461

462462
For the `BT` matrix, the transpose is computed during the load. This requires a different layout for the load. Like the non-transpose case, we start from the DPAS tile size. However, transpose only supports 32 bit matrix elements. Our bf16 elements are 16 bits, so we account for this in the load by reducing the inner dimension by a factor of two:
463463

464464
```
465-
Block load tile layout:
465+
Block load tile layout:
466466
- offset=1 -> (0, 1)
467467
offset=2 -> (0, 2)
468468
offset=4 -> (0, 4)
@@ -473,9 +473,9 @@ Block load tile layout:
473473
where out dims are: [dim0 (size 16), dim1 (size 8)]
474474
```
475475

476-
For this load we have two iterations in the outer dimension:
476+
For this load we have two iterations in the outer dimension:
477477
```
478-
Block load tile layout after adding iterations:
478+
Block load tile layout after adding iterations:
479479
- offset=1 -> (0, 1)
480480
offset=2 -> (0, 2)
481481
offset=4 -> (0, 4)
@@ -487,10 +487,10 @@ Block load tile layout after adding iterations:
487487
where out dims are: [dim0 (size 32), dim1 (size 8)]
488488
```
489489

490-
Finally, we compute the loads. Since we are limited in the amount of data we can load compared to the non-transpose case, we need more loads.
490+
Finally, we compute the loads. Since we are limited in the amount of data we can load compared to the non-transpose case, we need more loads.
491491

492492
```
493-
Block load tile layout after adding loads:
493+
Block load tile layout after adding loads:
494494
- offset=1 -> (0, 1)
495495
offset=2 -> (0, 2)
496496
offset=4 -> (0, 4)
@@ -504,6 +504,6 @@ Block load tile layout after adding loads:
504504
where out dims are: [dim0 (size 64), dim1 (size 16)]
505505
```
506506

507-
The block tile layout after adding loads for `B` (non-tranpsose) had output size `32, 64`. The transpose layout has output size `64, 16`. This is essentially the non-transposed layout, transposed, and reducing the inner dim by a factor of two. Because the data type has been increased from 16 bits per element to 32 bits per element, we are still loading the same amount of data. And because two contiguous values will be loaded then transposed, the vnni transform is automatically computed. So, we emit two additional loads to load the same amount of data with the same layout.
507+
The block tile layout after adding loads for `B` (non-tranpsose) had output size `32, 64`. The transpose layout has output size `64, 16`. This is essentially the non-transposed layout, transposed, and reducing the inner dim by a factor of two. Because the data type has been increased from 16 bits per element to 32 bits per element, we are still loading the same amount of data. And because two contiguous values will be loaded then transposed, the vnni transform is automatically computed. So, we emit two additional loads to load the same amount of data with the same layout.
508508

509-
Note that unlike the non-transposed case, the transpose layout does implicitly encode the vnni transform. This is because we need to handle the vnni transform when computing shuffle vectors to convert the transposed, loaded data into register formats that DPAS is expecting.
509+
Note that unlike the non-transposed case, the transpose layout does implicitly encode the vnni transform. This is because we need to handle the vnni transform when computing shuffle vectors to convert the transposed, loaded data into register formats that DPAS is expecting.

0 commit comments

Comments
 (0)