My matrix addition solution had the following code
index = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
offset = index
I changed it to ...
offset = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
This gave me a speedup of 8ms which was roughly 10% for that particular challenge.
I am wondering why the timing improved so much? I'd expect the jit compilers to resolve the first snippet into the second snippet automatically, unless we're taking into account the jit compilation time in timing the solutions.
Does anyone know if the compilation time is included?
My matrix addition solution had the following code
I changed it to ...
This gave me a speedup of 8ms which was roughly 10% for that particular challenge.
I am wondering why the timing improved so much? I'd expect the jit compilers to resolve the first snippet into the second snippet automatically, unless we're taking into account the jit compilation time in timing the solutions.
Does anyone know if the compilation time is included?