Skip to content

Conversation

cudawarped
Copy link
Contributor

Changed current benchmark code to time kernels not memory initialization which is currently dominating the results.

New results on RTX 3070 are:

name met (ms) iters
traditional_1x 0.00482591 100
simple_warp_1x 0.00547583 100
functional_warp_1x 0.00544192 100
traditional_4x 0.01407168 100
simple_warp_4x 0.00489632 100
functional_warp_4x 0.01299008 100
traditional_32x 0.0048352000000000004 100
simple_warp_32x 0.00474848 100
functional_warp_32x 0.0050665499999999995 100
traditional_256x 0.00492608 100
simple_warp_256x 0.00488959 100
functional_warp_256x 0.01050047 100
traditional_2048x 0.01013152 100
simple_warp_2048x 0.00876447 100
functional_warp_2048x 0.006829760000000001 100
traditional_16384x 0.042761589999999995 100
simple_warp_16384x 0.03417824 100
functional_warp_16384x 0.015064949999999999 100
traditional_65536x 0.210256 100
simple_warp_65536x 0.12864159 100
functional_warp_65536x 0.04772992 100

The results show that using warp intrinsics is beneficial. Its worth noting that:

  1. The results for small grid sizes 1,...,256 x warp size on compute capability 8.6 are just noise as device is not saturated (SM's are idle). Only 16 SM's active with 16 warps per SM limit. This will also be true for 2048 on larger devices (more SM's) with greater compute capability (CC 10.x,12.x allows 32 resident blocks per multiprocessor) .
  2. The difference between functional_warp and simple_warp is because functional is always using 256 threads per block not 32 like simple_warp. The resulting reduced occupancy for simple_warp is why it has less performance. If you change n_threads to 256 for simple_warp you get the same result.

…lization which is currently dominating the results.

Updated kernels to write to different memory locations to avoid race condition and allow testing (previously functional benchmarking was only running a single warp)
Copy link
Collaborator

@ehsanmok ehsanmok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch 🙏LGTM!

@ehsanmok ehsanmok merged commit 15c26fb into modular:main Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants