You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Define operation adhering to following description:
The triton_intel_gpu.simd_reduce operation performs a SIMD reduction. Contrary to tt.reduce, when performing a warp reduction, the result is non-uniform.
The reduction axis must be in such a way that only a warp reduction is
performed, i.e., sizePerThread[axis], warpsPerCTA[axis] and CTAsPerCGA[axis] must be 1; and shape[axis] and threadsPerWarp[axis]
must be equal to the sub-group size.
The output type must be compatible with the performed reduction by reducing
the size per thread
Note this is in essence an optimized N*sub_group_sizexsub_group_size->N*sub_group_size reduction that involves a transpose and a good candidate to generate better code in the optimized reduction pass without going through SLM.
The text was updated successfully, but these errors were encountered:
Define operation adhering to following description:
The
triton_intel_gpu.simd_reduce
operation performs a SIMD reduction. Contrary tott.reduce
, when performing a warp reduction, the result is non-uniform.The reduction axis must be in such a way that only a warp reduction is
performed, i.e.,
sizePerThread[axis]
,warpsPerCTA[axis]
andCTAsPerCGA[axis]
must be 1; andshape[axis]
andthreadsPerWarp[axis]
must be equal to the sub-group size.
The output type must be compatible with the performed reduction by reducing
the size per thread
Example:
Note this is in essence an optimized
N*sub_group_sizexsub_group_size->N*sub_group_size
reduction that involves a transpose and a good candidate to generate better code in the optimized reduction pass without going through SLM.The text was updated successfully, but these errors were encountered: