-
Notifications
You must be signed in to change notification settings - Fork 1
Description
beanfarmer/beanfarmer_dp4a_noshfl_k.cu
Lines 183 to 189 in 923cc4a
| // I don't understand why this __threadfence_block is needed, but it is. | |
| // Everything in the kernel should be warp synchronous but for some reason removing | |
| // this threadfence causes tests to fail. I assume this has something to do with the | |
| // branching above. I was under the impression that __threadfence_block basically didn't | |
| // do anything, but it is needed for correctness here and results in a minor performance loss. | |
| // Substituting in a syncthreads causes significant performance loss. | |
| __threadfence_block(); |
The threadfence is needed to ensure that the write has finished before the read is issued. Without it, the compiler has no reason to assume that there is a dependence between them, and is free to reorder them. An alternative (possibly preferred) solution is to mark the shared memory as volatile (discussed here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier).
As of CUDA 9, however, the correct approach will be to call __syncwarp() before the memory operations, which will guarantee the ordering of the instructions as well as the convergence of the warp (on Volta hardware, https://devblogs.nvidia.com/parallelforall/inside-volta/).