[Puzzle 24] Fix benchmark timing #102
Merged
+645
−585
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changed current benchmark code to time kernels not memory initialization which is currently dominating the results.
New results on RTX 3070 are:
The results show that using warp intrinsics is beneficial. Its worth noting that:
functional_warp
andsimple_warp
is because functional is always using 256 threads per block not 32 like simple_warp. The resulting reduced occupancy forsimple_warp
is why it has less performance. If you changen_threads
to 256 forsimple_warp
you get the same result.