Skip to content

[Offload] Make olLaunchKernel test thread safe #149497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions offload/include/Shared/APITypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

#include <cstddef>
#include <cstdint>
#include <mutex>

extern "C" {

Expand Down Expand Up @@ -75,6 +76,9 @@ struct __tgt_async_info {
/// should be freed after finalization.
llvm::SmallVector<void *, 2> AssociatedAllocations;

/// Mutex to guard access to AssociatedAllocations
std::mutex AllocationsMutex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to construct __tgt_async_info with or without mutex logic? I understand this mutex is only required for the liboffload use-case, not for libomptarget. Having the mutex here doesn't seem like a problem, but maybe we could have a constant boolean field indicating if operations with this async info require mutex protection or not.


/// The kernel launch environment used to issue a kernel. Stored here to
/// ensure it is a valid location while the transfer to the device is
/// happening.
Expand Down
8 changes: 1 addition & 7 deletions offload/liboffload/src/OffloadImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -487,16 +487,10 @@ Error olWaitQueue_impl(ol_queue_handle_t Queue) {
// Host plugin doesn't have a queue set so it's not safe to call synchronize
// on it, but we have nothing to synchronize in that situation anyway.
if (Queue->AsyncInfo->Queue) {
if (auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo))
if (auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo, false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indicate with a comment what's the false doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes other threads will not release the queue from that async info, right?

return Err;
}

// Recreate the stream resource so the queue can be reused
// TODO: Would be easier for the synchronization to (optionally) not release
// it to begin with.
if (auto Res = Queue->Device->Device->initAsyncInfo(&Queue->AsyncInfo))
return Res;

return Error::success();
}

Expand Down
4 changes: 2 additions & 2 deletions offload/libomptarget/interface.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ targetData(ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase,
TargetDataFuncPtrTy TargetDataFunction, const char *RegionTypeMsg,
const char *RegionName) {
assert(PM && "Runtime not initialized");
static_assert(std::is_convertible_v<TargetAsyncInfoTy, AsyncInfoTy>,
static_assert(std::is_convertible_v<TargetAsyncInfoTy &, AsyncInfoTy &>,
"TargetAsyncInfoTy must be convertible to AsyncInfoTy.");

TIMESCOPE_WITH_DETAILS_AND_IDENT("Runtime: Data Copy",
Expand Down Expand Up @@ -311,7 +311,7 @@ static inline int targetKernel(ident_t *Loc, int64_t DeviceId, int32_t NumTeams,
int32_t ThreadLimit, void *HostPtr,
KernelArgsTy *KernelArgs) {
assert(PM && "Runtime not initialized");
static_assert(std::is_convertible_v<TargetAsyncInfoTy, AsyncInfoTy>,
static_assert(std::is_convertible_v<TargetAsyncInfoTy &, AsyncInfoTy &>,
"Target AsyncInfoTy must be convertible to AsyncInfoTy.");
DP("Entering target region for device %" PRId64 " with entry point " DPxMOD
"\n",
Expand Down
14 changes: 11 additions & 3 deletions offload/plugins-nextgen/amdgpu/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2227,6 +2227,7 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
/// Get the stream of the asynchronous info structure or get a new one.
Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper,
AMDGPUStreamTy *&Stream) {
std::lock_guard<std::mutex> StreamLock{StreamMutex};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only need this when we create a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple threads can call getStream, see that the stream doesn't exist and create a new one. This can result in multiple streams being created in error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this function scope lock. Sure, getStream can be called by multiple threads but I don't think it should be the responsibility of getStream for thread safety. I suppose AMDGPUStreamManager.getResource needs to be the one to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several comments about this function:

  • The resource managers should already be thread safe, they are acquiring an std::mutex when retrieving/releasing resources. E.g., GenericDeviceResourceManagerTy::getResourcesImpl.

  • The scope of this mutex seems too coarse-grain for the objective of this PR. My understanding is that you want to protect the set/unset of the queue in a async info object. But the StreamMutex here is placed in the device object. Thus, you are actually limiting the concurrency, apparently unnecessary, of threads that process different async infos from the same device. Wouldn't make more sense to move it to the async info instead?

  • If the previous point is correct, can't you use the same AllocationsMutex instead (after renaming it)?

// Get the stream (if any) from the async info.
Stream = AsyncInfoWrapper.getQueueAs<AMDGPUStreamTy *>();
if (!Stream) {
Expand Down Expand Up @@ -2291,7 +2292,8 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
}

/// Synchronize current thread with the pending operations on the async info.
Error synchronizeImpl(__tgt_async_info &AsyncInfo) override {
Error synchronizeImpl(__tgt_async_info &AsyncInfo,
bool RemoveQueue) override {
AMDGPUStreamTy *Stream =
reinterpret_cast<AMDGPUStreamTy *>(AsyncInfo.Queue);
assert(Stream && "Invalid stream");
Expand All @@ -2302,8 +2304,11 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
// Once the stream is synchronized, return it to stream pool and reset
// AsyncInfo. This is to make sure the synchronization only works for its
// own tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment as appropriate.

AsyncInfo.Queue = nullptr;
return AMDGPUStreamManager.returnResource(Stream);
if (RemoveQueue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we now need a conditional for this? It's supposed to consume it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liboffload contains this:

Error olWaitQueue_impl(ol_queue_handle_t Queue) {
  // Host plugin doesn't have a queue set so it's not safe to call synchronize
  // on it, but we have nothing to synchronize in that situation anyway.
  if (Queue->AsyncInfo->Queue) {
    if (auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo, false))
      return Err;
  }


  // Recreate the stream resource so the queue can be reused
  // TODO: Would be easier for the synchronization to (optionally) not release
  // it to begin with.
  if (auto Res = Queue->Device->Device->initAsyncInfo(&Queue->AsyncInfo))
    return Res;

  return Error::success();
}

This has to be done atomically so that, for example, we don't try to synchronise an absent queue. I could add a mutex to ol_queue_impl_t, but I figured it'd be better to just implement what the comment says. Specifically, we avoid dropping the AsyncInfo just to immediately recreate it right after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the whole point of the resource managers we used was to make acquiring / releasing resources cheap. @kevinsala was the one to implement this originally so I'll see if he knows the proper approach here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we still have the race condition. Tweaking the interface here allows us to have the lock cover a smaller section of code.

AsyncInfo.Queue = nullptr;
return AMDGPUStreamManager.returnResource(Stream);
}
return Plugin::success();
}

/// Query for the completion of the pending operations on the async info.
Expand Down Expand Up @@ -3013,6 +3018,9 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
/// True is the system is configured with XNACK-Enabled.
/// False otherwise.
bool IsXnackEnabled = false;

/// Mutex to guard getting/setting the stream
std::mutex StreamMutex;
};

Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
Expand Down
6 changes: 4 additions & 2 deletions offload/plugins-nextgen/common/include/PluginInterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ struct AsyncInfoWrapperTy {
/// Register \p Ptr as an associated allocation that is freed after
/// finalization.
void freeAllocationAfterSynchronization(void *Ptr) {
std::lock_guard<std::mutex> AllocationGuard{AsyncInfoPtr->AllocationsMutex};
AsyncInfoPtr->AssociatedAllocations.push_back(Ptr);
}

Expand Down Expand Up @@ -772,8 +773,9 @@ struct GenericDeviceTy : public DeviceAllocatorTy {

/// Synchronize the current thread with the pending operations on the
/// __tgt_async_info structure.
Error synchronize(__tgt_async_info *AsyncInfo);
virtual Error synchronizeImpl(__tgt_async_info &AsyncInfo) = 0;
Error synchronize(__tgt_async_info *AsyncInfo, bool RemoveQueue = true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably ReleaseQueue is more descriptive of what it's doing.

virtual Error synchronizeImpl(__tgt_async_info &AsyncInfo,
bool RemoveQueue) = 0;

/// Invokes any global constructors on the device if present and is required
/// by the target.
Expand Down
7 changes: 5 additions & 2 deletions offload/plugins-nextgen/common/src/PluginInterface.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1329,12 +1329,15 @@ Error PinnedAllocationMapTy::unlockUnmappedHostBuffer(void *HstPtr) {
return eraseEntry(*Entry);
}

Error GenericDeviceTy::synchronize(__tgt_async_info *AsyncInfo) {
Error GenericDeviceTy::synchronize(__tgt_async_info *AsyncInfo,
bool RemoveQueue) {
std::lock_guard<std::mutex> AllocationGuard{AsyncInfo->AllocationsMutex};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use syntax std::lock_guard<std::mutex> AllocationGuard(...); (i.e., with (...)) as other occurrences in the plugins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you need the lock covering the synchronize + delete of allocations to avoid deleting allocations that correspond to other kernel launches not yet synchronized (issued by other threads), right? In other words, to avoid this case:

Thread1
----------------------------------------------------------------- Time ---->
Add allocation                     -->                     Launch kernel

Thread2
----------------------------------------------------------------- Time ---->
                     Synchronize --> Delete allocations


if (!AsyncInfo || !AsyncInfo->Queue)
return Plugin::error(ErrorCode::INVALID_ARGUMENT,
"invalid async info queue");

if (auto Err = synchronizeImpl(*AsyncInfo))
if (auto Err = synchronizeImpl(*AsyncInfo, RemoveQueue))
return Err;

for (auto *Ptr : AsyncInfo->AssociatedAllocations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about all this code (line 1343 to 1346) being inside the lock. Delete operations may take significant time. What about creating a llvm::SmallVector Ptrs (with reasonable static size) in the stack, transferring all the allocation pointers from AssociatedAllocations to the temporary vector Ptrs, and then, outside the lock, deleting the allocations that are present in Ptrs?

Something like this pseudocode:

void synchronize(...) {
    SmallVector<void *, 10> Ptrs;
    {
        std::lock_guard<...> AllocationGuard(...);
        synchronizeImpl(AsyncInfo, ...);

        Ptrs = move_elements(AsyncInfo->AssociatedAllocations);
    }

    for (Ptr : Ptrs)
        dataDelete(Ptr, ...);
}

Expand Down
15 changes: 11 additions & 4 deletions offload/plugins-nextgen/cuda/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -522,6 +522,7 @@ struct CUDADeviceTy : public GenericDeviceTy {

/// Get the stream of the asynchronous info structure or get a new one.
Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper, CUstream &Stream) {
std::lock_guard<std::mutex> StreamLock{StreamMutex};
// Get the stream (if any) from the async info.
Stream = AsyncInfoWrapper.getQueueAs<CUstream>();
if (!Stream) {
Expand Down Expand Up @@ -642,17 +643,20 @@ struct CUDADeviceTy : public GenericDeviceTy {
}

/// Synchronize current thread with the pending operations on the async info.
Error synchronizeImpl(__tgt_async_info &AsyncInfo) override {
Error synchronizeImpl(__tgt_async_info &AsyncInfo,
bool RemoveQueue) override {
CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo.Queue);
CUresult Res;
Res = cuStreamSynchronize(Stream);

// Once the stream is synchronized, return it to stream pool and reset
// AsyncInfo. This is to make sure the synchronization only works for its
// own tasks.
AsyncInfo.Queue = nullptr;
if (auto Err = CUDAStreamManager.returnResource(Stream))
return Err;
if (RemoveQueue) {
AsyncInfo.Queue = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does the queue gets unset/released for liboffload queues?

if (auto Err = CUDAStreamManager.returnResource(Stream))
return Err;
}

return Plugin::check(Res, "error in cuStreamSynchronize: %s");
}
Expand Down Expand Up @@ -1281,6 +1285,9 @@ struct CUDADeviceTy : public GenericDeviceTy {
/// The maximum number of warps that can be resident on all the SMs
/// simultaneously.
uint32_t HardwareParallelism = 0;

/// Mutex to guard getting/setting the stream
std::mutex StreamMutex;
};

Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
Expand Down
3 changes: 2 additions & 1 deletion offload/plugins-nextgen/host/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,8 @@ struct GenELF64DeviceTy : public GenericDeviceTy {

/// All functions are already synchronous. No need to do anything on this
/// synchronization function.
Error synchronizeImpl(__tgt_async_info &AsyncInfo) override {
Error synchronizeImpl(__tgt_async_info &AsyncInfo,
bool RemoveQueue) override {
return Plugin::success();
}

Expand Down
18 changes: 18 additions & 0 deletions offload/unittests/OffloadAPI/common/Fixtures.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#include <OffloadAPI.h>
#include <OffloadPrint.hpp>
#include <gtest/gtest.h>
#include <thread>

#include "Environment.hpp"

Expand Down Expand Up @@ -57,6 +58,23 @@ inline std::string SanitizeString(const std::string &Str) {
return NewStr;
}

template <typename Fn> inline void threadify(Fn body) {
std::vector<std::thread> Threads;
for (size_t I = 0; I < 20; I++) {
Threads.emplace_back(
[&body](size_t I) {
std::string ScopeMsg{"Thread #"};
ScopeMsg.append(std::to_string(I));
SCOPED_TRACE(ScopeMsg);
body(I);
},
I);
}
for (auto &T : Threads) {
T.join();
}
}

struct OffloadTest : ::testing::Test {
ol_device_handle_t Host = TestEnvironment::getHostDevice();
};
Expand Down
23 changes: 23 additions & 0 deletions offload/unittests/OffloadAPI/kernel/olLaunchKernel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,29 @@ TEST_P(olLaunchKernelFooTest, Success) {
ASSERT_SUCCESS(olMemFree(Mem));
}

TEST_P(olLaunchKernelFooTest, SuccessThreaded) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to be able to add an OFFLOAD_TEST_THREADED_P macro so that you'd get threaded and non-threaded tests "for free" without copy-pasting the test body. But I can't think of a good way of actually implementing that with gtest, anyone have any ideas?

threadify([&](size_t) {
void *Mem;
ASSERT_SUCCESS(olMemAlloc(Device, OL_ALLOC_TYPE_MANAGED,
LaunchArgs.GroupSize.x * sizeof(uint32_t), &Mem));
struct {
void *Mem;
} Args{Mem};

ASSERT_SUCCESS(olLaunchKernel(Queue, Device, Kernel, &Args, sizeof(Args),
&LaunchArgs, nullptr));

ASSERT_SUCCESS(olWaitQueue(Queue));

uint32_t *Data = (uint32_t *)Mem;
for (uint32_t i = 0; i < 64; i++) {
ASSERT_EQ(Data[i], i);
}

ASSERT_SUCCESS(olMemFree(Mem));
});
}

TEST_P(olLaunchKernelNoArgsTest, Success) {
ASSERT_SUCCESS(
olLaunchKernel(Queue, Device, Kernel, nullptr, 0, &LaunchArgs, nullptr));
Expand Down
Loading