Regarding the need for threadfence in the kernel

https://github.com/ewanbarr/beanfarmer/blob/923cc4aac24a7212d098e0174c8674794dd26cb2/beanfarmer_dp4a_noshfl_k.cu#L183-L189

The threadfence is needed to ensure that the write has finished before the read is issued. Without it, the compiler has no reason to assume that there is a dependence between them, and is free to reorder them. An alternative (possibly preferred) solution is to mark the shared memory as volatile (discussed here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier).

As of CUDA 9, however, the correct approach will be to call __syncwarp() before the memory operations, which will guarantee the ordering of the instructions as well as the convergence of the warp (on Volta hardware, https://devblogs.nvidia.com/parallelforall/inside-volta/).

	// I don't understand why this __threadfence_block is needed, but it is.
	// Everything in the kernel should be warp synchronous but for some reason removing
	// this threadfence causes tests to fail. I assume this has something to do with the
	// branching above. I was under the impression that __threadfence_block basically didn't
	// do anything, but it is needed for correctness here and results in a minor performance loss.
	// Substituting in a syncthreads causes significant performance loss.
	__threadfence_block();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regarding the need for threadfence in the kernel #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Regarding the need for threadfence in the kernel #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions