-
Notifications
You must be signed in to change notification settings - Fork 14.5k
[Offload] Make olLaunchKernel test thread safe #149497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -487,16 +487,10 @@ Error olWaitQueue_impl(ol_queue_handle_t Queue) { | |
// Host plugin doesn't have a queue set so it's not safe to call synchronize | ||
// on it, but we have nothing to synchronize in that situation anyway. | ||
if (Queue->AsyncInfo->Queue) { | ||
if (auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo)) | ||
if (auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo, false)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please indicate with a comment what's the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code assumes other threads will not release the queue from that async info, right? |
||
return Err; | ||
} | ||
|
||
// Recreate the stream resource so the queue can be reused | ||
// TODO: Would be easier for the synchronization to (optionally) not release | ||
// it to begin with. | ||
if (auto Res = Queue->Device->Device->initAsyncInfo(&Queue->AsyncInfo)) | ||
return Res; | ||
|
||
return Error::success(); | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2227,6 +2227,7 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy { | |
/// Get the stream of the asynchronous info structure or get a new one. | ||
Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper, | ||
AMDGPUStreamTy *&Stream) { | ||
std::lock_guard<std::mutex> StreamLock{StreamMutex}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we only need this when we create a new one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Multiple threads can call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure about this function scope lock. Sure, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have several comments about this function:
|
||
// Get the stream (if any) from the async info. | ||
Stream = AsyncInfoWrapper.getQueueAs<AMDGPUStreamTy *>(); | ||
if (!Stream) { | ||
|
@@ -2291,7 +2292,8 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy { | |
} | ||
|
||
/// Synchronize current thread with the pending operations on the async info. | ||
Error synchronizeImpl(__tgt_async_info &AsyncInfo) override { | ||
Error synchronizeImpl(__tgt_async_info &AsyncInfo, | ||
bool RemoveQueue) override { | ||
AMDGPUStreamTy *Stream = | ||
reinterpret_cast<AMDGPUStreamTy *>(AsyncInfo.Queue); | ||
assert(Stream && "Invalid stream"); | ||
|
@@ -2302,8 +2304,11 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy { | |
// Once the stream is synchronized, return it to stream pool and reset | ||
// AsyncInfo. This is to make sure the synchronization only works for its | ||
// own tasks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the comment as appropriate. |
||
AsyncInfo.Queue = nullptr; | ||
return AMDGPUStreamManager.returnResource(Stream); | ||
if (RemoveQueue) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we now need a conditional for this? It's supposed to consume it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Liboffload contains this: Error olWaitQueue_impl(ol_queue_handle_t Queue) {
// Host plugin doesn't have a queue set so it's not safe to call synchronize
// on it, but we have nothing to synchronize in that situation anyway.
if (Queue->AsyncInfo->Queue) {
if (auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo, false))
return Err;
}
// Recreate the stream resource so the queue can be reused
// TODO: Would be easier for the synchronization to (optionally) not release
// it to begin with.
if (auto Res = Queue->Device->Device->initAsyncInfo(&Queue->AsyncInfo))
return Res;
return Error::success();
} This has to be done atomically so that, for example, we don't try to synchronise an absent queue. I could add a mutex to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought the whole point of the resource managers we used was to make acquiring / releasing resources cheap. @kevinsala was the one to implement this originally so I'll see if he knows the proper approach here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but we still have the race condition. Tweaking the interface here allows us to have the lock cover a smaller section of code. |
||
AsyncInfo.Queue = nullptr; | ||
return AMDGPUStreamManager.returnResource(Stream); | ||
} | ||
return Plugin::success(); | ||
} | ||
|
||
/// Query for the completion of the pending operations on the async info. | ||
|
@@ -3013,6 +3018,9 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy { | |
/// True is the system is configured with XNACK-Enabled. | ||
/// False otherwise. | ||
bool IsXnackEnabled = false; | ||
|
||
/// Mutex to guard getting/setting the stream | ||
std::mutex StreamMutex; | ||
}; | ||
|
||
Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,6 +104,7 @@ struct AsyncInfoWrapperTy { | |
/// Register \p Ptr as an associated allocation that is freed after | ||
/// finalization. | ||
void freeAllocationAfterSynchronization(void *Ptr) { | ||
std::lock_guard<std::mutex> AllocationGuard{AsyncInfoPtr->AllocationsMutex}; | ||
AsyncInfoPtr->AssociatedAllocations.push_back(Ptr); | ||
} | ||
|
||
|
@@ -772,8 +773,9 @@ struct GenericDeviceTy : public DeviceAllocatorTy { | |
|
||
/// Synchronize the current thread with the pending operations on the | ||
/// __tgt_async_info structure. | ||
Error synchronize(__tgt_async_info *AsyncInfo); | ||
virtual Error synchronizeImpl(__tgt_async_info &AsyncInfo) = 0; | ||
Error synchronize(__tgt_async_info *AsyncInfo, bool RemoveQueue = true); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably |
||
virtual Error synchronizeImpl(__tgt_async_info &AsyncInfo, | ||
bool RemoveQueue) = 0; | ||
|
||
/// Invokes any global constructors on the device if present and is required | ||
/// by the target. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1329,12 +1329,15 @@ Error PinnedAllocationMapTy::unlockUnmappedHostBuffer(void *HstPtr) { | |
return eraseEntry(*Entry); | ||
} | ||
|
||
Error GenericDeviceTy::synchronize(__tgt_async_info *AsyncInfo) { | ||
Error GenericDeviceTy::synchronize(__tgt_async_info *AsyncInfo, | ||
bool RemoveQueue) { | ||
std::lock_guard<std::mutex> AllocationGuard{AsyncInfo->AllocationsMutex}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use syntax There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand you need the lock covering the synchronize + delete of allocations to avoid deleting allocations that correspond to other kernel launches not yet synchronized (issued by other threads), right? In other words, to avoid this case:
|
||
|
||
if (!AsyncInfo || !AsyncInfo->Queue) | ||
return Plugin::error(ErrorCode::INVALID_ARGUMENT, | ||
"invalid async info queue"); | ||
|
||
if (auto Err = synchronizeImpl(*AsyncInfo)) | ||
if (auto Err = synchronizeImpl(*AsyncInfo, RemoveQueue)) | ||
return Err; | ||
|
||
for (auto *Ptr : AsyncInfo->AssociatedAllocations) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm worried about all this code (line 1343 to 1346) being inside the lock. Delete operations may take significant time. What about creating a Something like this pseudocode: void synchronize(...) {
SmallVector<void *, 10> Ptrs;
{
std::lock_guard<...> AllocationGuard(...);
synchronizeImpl(AsyncInfo, ...);
Ptrs = move_elements(AsyncInfo->AssociatedAllocations);
}
for (Ptr : Ptrs)
dataDelete(Ptr, ...);
} |
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -522,6 +522,7 @@ struct CUDADeviceTy : public GenericDeviceTy { | |
|
||
/// Get the stream of the asynchronous info structure or get a new one. | ||
Error getStream(AsyncInfoWrapperTy &AsyncInfoWrapper, CUstream &Stream) { | ||
std::lock_guard<std::mutex> StreamLock{StreamMutex}; | ||
// Get the stream (if any) from the async info. | ||
Stream = AsyncInfoWrapper.getQueueAs<CUstream>(); | ||
if (!Stream) { | ||
|
@@ -642,17 +643,20 @@ struct CUDADeviceTy : public GenericDeviceTy { | |
} | ||
|
||
/// Synchronize current thread with the pending operations on the async info. | ||
Error synchronizeImpl(__tgt_async_info &AsyncInfo) override { | ||
Error synchronizeImpl(__tgt_async_info &AsyncInfo, | ||
bool RemoveQueue) override { | ||
CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo.Queue); | ||
CUresult Res; | ||
Res = cuStreamSynchronize(Stream); | ||
|
||
// Once the stream is synchronized, return it to stream pool and reset | ||
// AsyncInfo. This is to make sure the synchronization only works for its | ||
// own tasks. | ||
AsyncInfo.Queue = nullptr; | ||
if (auto Err = CUDAStreamManager.returnResource(Stream)) | ||
return Err; | ||
if (RemoveQueue) { | ||
AsyncInfo.Queue = nullptr; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When does the queue gets unset/released for liboffload queues? |
||
if (auto Err = CUDAStreamManager.returnResource(Stream)) | ||
return Err; | ||
} | ||
|
||
return Plugin::check(Res, "error in cuStreamSynchronize: %s"); | ||
} | ||
|
@@ -1281,6 +1285,9 @@ struct CUDADeviceTy : public GenericDeviceTy { | |
/// The maximum number of warps that can be resident on all the SMs | ||
/// simultaneously. | ||
uint32_t HardwareParallelism = 0; | ||
|
||
/// Mutex to guard getting/setting the stream | ||
std::mutex StreamMutex; | ||
}; | ||
|
||
Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,6 +104,29 @@ TEST_P(olLaunchKernelFooTest, Success) { | |
ASSERT_SUCCESS(olMemFree(Mem)); | ||
} | ||
|
||
TEST_P(olLaunchKernelFooTest, SuccessThreaded) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd love to be able to add an |
||
threadify([&](size_t) { | ||
void *Mem; | ||
ASSERT_SUCCESS(olMemAlloc(Device, OL_ALLOC_TYPE_MANAGED, | ||
LaunchArgs.GroupSize.x * sizeof(uint32_t), &Mem)); | ||
struct { | ||
void *Mem; | ||
} Args{Mem}; | ||
|
||
ASSERT_SUCCESS(olLaunchKernel(Queue, Device, Kernel, &Args, sizeof(Args), | ||
&LaunchArgs, nullptr)); | ||
|
||
ASSERT_SUCCESS(olWaitQueue(Queue)); | ||
|
||
uint32_t *Data = (uint32_t *)Mem; | ||
for (uint32_t i = 0; i < 64; i++) { | ||
ASSERT_EQ(Data[i], i); | ||
} | ||
|
||
ASSERT_SUCCESS(olMemFree(Mem)); | ||
}); | ||
} | ||
|
||
TEST_P(olLaunchKernelNoArgsTest, Success) { | ||
ASSERT_SUCCESS( | ||
olLaunchKernel(Queue, Device, Kernel, nullptr, 0, &LaunchArgs, nullptr)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to construct
__tgt_async_info
with or without mutex logic? I understand this mutex is only required for the liboffload use-case, not for libomptarget. Having the mutex here doesn't seem like a problem, but maybe we could have a constant boolean field indicating if operations with this async info require mutex protection or not.