Performance optimization for geometric kernel #1562

xytintel · 2025-04-09T09:43:21Z

Reproduce case:

import torch
from torch.profiler import profile, ProfilerActivity

shape_list = [(8192, 8192)]
backward = False

if __name__ == "__main__":
    for shape in shape_list:
        for dtype in [torch.bfloat16, torch.float16, torch.float32]:
            input = torch.randn(shape, dtype=torch.bfloat16, device=torch.device("xpu"))

            # warm up
            input.geometric_(0.5)

            # go
            print(
                "shape:",
                (shape),
                "; datatype:",
                dtype,
                "; P:",
                0.5,
                "; backward:",
                backward,
            )
            with profile(
                activities=[ProfilerActivity.CPU, ProfilerActivity.XPU],
                record_shapes=True,
            ) as prof:
                for i in range(20):
                    input.geometric_(0.5)
            print(prof.key_averages().table(sort_by="xpu_time_total"))

xytintel · 2025-04-14T08:09:08Z

Original:

# shape: (8192, 8192) ; datatype: torch.bfloat16 ; P: 0.5 ; backward: False
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                        aten::geometric_        55.27%     974.043us       100.00%       1.762ms      88.113us      14.921ms       100.00%      14.921ms     746.064us            20  
# at::native::xpu::DistributionElementwiseKernelFuncto...         0.00%       0.000us         0.00%       0.000us       0.000us      14.921ms       100.00%      14.921ms     746.064us            20  
#                                   urEnqueueKernelLaunch        44.73%     788.208us        44.73%     788.208us      39.410us       0.000us         0.00%       0.000us       0.000us            20  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
# Self CPU time total: 1.762ms
# Self XPU time total: 14.921ms

# shape: (8192, 8192) ; datatype: torch.float16 ; P: 0.5 ; backward: False
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                        aten::geometric_        64.93%     793.297us       100.00%       1.222ms      61.084us      14.586ms       100.00%      14.586ms     729.312us            20  
# at::native::xpu::DistributionElementwiseKernelFuncto...         0.00%       0.000us         0.00%       0.000us       0.000us      14.586ms       100.00%      14.586ms     729.312us            20  
#                                   urEnqueueKernelLaunch        35.07%     428.392us        35.07%     428.392us      21.420us       0.000us         0.00%       0.000us       0.000us            20  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
# Self CPU time total: 1.222ms
# Self XPU time total: 14.586ms

# shape: (8192, 8192) ; datatype: torch.float32 ; P: 0.5 ; backward: False
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                        aten::geometric_        64.42%     815.162us       100.00%       1.265ms      63.270us      15.043ms       100.00%      15.043ms     752.160us            20  
# at::native::xpu::DistributionElementwiseKernelFuncto...         0.00%       0.000us         0.00%       0.000us       0.000us      15.043ms       100.00%      15.043ms     752.160us            20  
#                                   urEnqueueKernelLaunch        35.58%     450.228us        35.58%     450.228us      22.511us       0.000us         0.00%       0.000us       0.000us            20  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
# Self CPU time total: 1.265ms
# Self XPU time total: 15.043ms

Optimized:

shape: (8192, 8192) ; datatype: torch.bfloat16 ; P: 0.5 ; backward: False
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       aten::geometric_        57.44%     859.941us       100.00%       1.497ms      74.860us       9.351ms       100.00%       9.351ms     467.544us            20  
at::native::xpu::DistributionElementwiseKernelFuncto...         0.00%       0.000us         0.00%       0.000us       0.000us       9.351ms       100.00%       9.351ms     467.544us            20  
                                  urEnqueueKernelLaunch        42.56%     637.265us        42.56%     637.265us      31.863us       0.000us         0.00%       0.000us       0.000us            20  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.497ms
Self XPU time total: 9.351ms

shape: (8192, 8192) ; datatype: torch.float16 ; P: 0.5 ; backward: False
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       aten::geometric_        66.84%     761.641us       100.00%       1.139ms      56.972us       8.922ms       100.00%       8.922ms     446.104us            20  
at::native::xpu::DistributionElementwiseKernelFuncto...         0.00%       0.000us         0.00%       0.000us       0.000us       8.922ms       100.00%       8.922ms     446.104us            20  
                                  urEnqueueKernelLaunch        33.16%     377.805us        33.16%     377.805us      18.890us       0.000us         0.00%       0.000us       0.000us            20  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.139ms
Self XPU time total: 8.922ms

shape: (8192, 8192) ; datatype: torch.float32 ; P: 0.5 ; backward: False
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       aten::geometric_        66.84%     751.307us       100.00%       1.124ms      56.204us       9.132ms       100.00%       9.132ms     456.584us            20  
at::native::xpu::DistributionElementwiseKernelFuncto...         0.00%       0.000us         0.00%       0.000us       0.000us       9.132ms       100.00%       9.132ms     456.584us            20  
                                  urEnqueueKernelLaunch        33.16%     372.776us        33.16%     372.776us      18.639us       0.000us         0.00%       0.000us       0.000us            20  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.124ms
Self XPU time total: 9.132ms

Update DistributionTemplates.h

bf8251c

xytintel added the kernel_optimization label Apr 9, 2025

Merge branch 'main' into xyt/geometric_kernel

a6be165

chunhuanMeng approved these changes Apr 14, 2025

View reviewed changes

xytintel added this pull request to the merge queue Apr 14, 2025

Merged via the queue into main with commit 5dac48a Apr 14, 2025
5 of 7 checks passed

xytintel deleted the xyt/geometric_kernel branch April 14, 2025 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimization for geometric kernel #1562

Performance optimization for geometric kernel #1562

Uh oh!

xytintel commented Apr 9, 2025

Uh oh!

xytintel commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

Performance optimization for geometric kernel #1562

Performance optimization for geometric kernel #1562

Uh oh!

Conversation

xytintel commented Apr 9, 2025

Uh oh!

xytintel commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!