Skip to content

Regarding the need for threadfence in the kernel #1

@benbarsdell

Description

@benbarsdell

// I don't understand why this __threadfence_block is needed, but it is.
// Everything in the kernel should be warp synchronous but for some reason removing
// this threadfence causes tests to fail. I assume this has something to do with the
// branching above. I was under the impression that __threadfence_block basically didn't
// do anything, but it is needed for correctness here and results in a minor performance loss.
// Substituting in a syncthreads causes significant performance loss.
__threadfence_block();

The threadfence is needed to ensure that the write has finished before the read is issued. Without it, the compiler has no reason to assume that there is a dependence between them, and is free to reorder them. An alternative (possibly preferred) solution is to mark the shared memory as volatile (discussed here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier).

As of CUDA 9, however, the correct approach will be to call __syncwarp() before the memory operations, which will guarantee the ordering of the instructions as well as the convergence of the warp (on Volta hardware, https://devblogs.nvidia.com/parallelforall/inside-volta/).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions