[Puzzle 24] Fix benchmark timing #102

cudawarped · 2025-10-07T05:26:16Z

Changed current benchmark code to time kernels not memory initialization which is currently dominating the results.

New results on RTX 3070 are:

name	met (ms)	iters
traditional_1x	0.00482591	100
simple_warp_1x	0.00547583	100
functional_warp_1x	0.00544192	100
traditional_4x	0.01407168	100
simple_warp_4x	0.00489632	100
functional_warp_4x	0.01299008	100
traditional_32x	0.0048352000000000004	100
simple_warp_32x	0.00474848	100
functional_warp_32x	0.0050665499999999995	100
traditional_256x	0.00492608	100
simple_warp_256x	0.00488959	100
functional_warp_256x	0.01050047	100
traditional_2048x	0.01013152	100
simple_warp_2048x	0.00876447	100
functional_warp_2048x	0.006829760000000001	100
traditional_16384x	0.042761589999999995	100
simple_warp_16384x	0.03417824	100
functional_warp_16384x	0.015064949999999999	100
traditional_65536x	0.210256	100
simple_warp_65536x	0.12864159	100
functional_warp_65536x	0.04772992	100

The results show that using warp intrinsics is beneficial. Its worth noting that:

The results for small grid sizes 1,...,256 x warp size on compute capability 8.6 are just noise as device is not saturated (SM's are idle). Only 16 SM's active with 16 warps per SM limit. This will also be true for 2048 on larger devices (more SM's) with greater compute capability (CC 10.x,12.x allows 32 resident blocks per multiprocessor) .
The difference between functional_warp and simple_warp is because functional is always using 256 threads per block not 32 like simple_warp. The resulting reduced occupancy for simple_warp is why it has less performance. If you change n_threads to 256 for simple_warp you get the same result.

…lization which is currently dominating the results. Updated kernels to write to different memory locations to avoid race condition and allow testing (previously functional benchmarking was only running a single warp)

ehsanmok

Great catch 🙏LGTM!

ehsanmok approved these changes Oct 7, 2025

View reviewed changes

ehsanmok merged commit 15c26fb into modular:main Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Puzzle 24] Fix benchmark timing #102

[Puzzle 24] Fix benchmark timing #102

cudawarped commented Oct 7, 2025

Uh oh!

ehsanmok left a comment

Uh oh!

Uh oh!

[Puzzle 24] Fix benchmark timing #102

[Puzzle 24] Fix benchmark timing #102

Conversation

cudawarped commented Oct 7, 2025

Uh oh!

ehsanmok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!