[Offload] Implement `olShutDown` #144055

RossBrunton · 2025-06-13T10:39:23Z

olShutDown was not properly calling deinit on the platforms, resulting in random segfaults on AMD devices.

As part of this, olInit and olShutDown now alloc and free the platform list and alloc info list rather than it being static. This allows olShutDown to be called within a destructor of a static object (like the tests do) without having to worry about destructor ordering.

This flagged up some memory issues, which have been fixed in #143873 . Please ignore the first three commits of this MR.

llvmbot · 2025-06-13T10:39:58Z

@llvm/pr-subscribers-offload

@llvm/pr-subscribers-backend-amdgpu

Author: Ross Brunton (RossBrunton)

Changes

olShutDown was not properly calling deinit on the platforms, resulting in random segfaults on AMD devices.

The spec has been updated to remove references to olInit/olShutDown doing reference counting. Implementations may wrap liboffload with their own reference counting if required. This matches the behaviour of handle types which also don't use reference counting.

This flagged up some memory issues, which have been fixed in #143873 . Please ignore the first three commits of this MR.

Full diff: https://github.com/llvm/llvm-project/pull/144055.diff

7 Files Affected:

(modified) offload/liboffload/API/Common.td (+3-4)
(modified) offload/liboffload/API/Program.td (+3-1)
(modified) offload/liboffload/src/OffloadImpl.cpp (+18-3)
(modified) offload/plugins-nextgen/amdgpu/src/rtl.cpp (+7-13)
(modified) offload/plugins-nextgen/common/include/PluginInterface.h (+4)
(modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+41-34)
(modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+13-14)

diff --git a/offload/liboffload/API/Common.td b/offload/liboffload/API/Common.td
index 7674da0438c29..e611dd9105251 100644
--- a/offload/liboffload/API/Common.td
+++ b/offload/liboffload/API/Common.td
@@ -152,8 +152,7 @@ def : Function {
   let name = "olInit";
   let desc = "Perform initialization of the Offload library and plugins";
   let details = [
-    "This must be the first API call made by a user of the Offload library",
-    "Each call will increment an internal reference count that is decremented by `olShutDown`"
+    "This must be the first API call made by a user of the Offload library"
   ];
   let params = [];
   let returns = [];
@@ -163,8 +162,8 @@ def : Function {
   let name = "olShutDown";
   let desc = "Release the resources in use by Offload";
   let details = [
-    "This decrements an internal reference count. When this reaches 0, all resources will be released",
-    "Subsequent API calls made after this are not valid"
+    "All resources owned by the Offload library and plugins will be released",
+    "Subsequent API calls made after calling `olShutDown` are undefined behavior"
   ];
   let params = [];
   let returns = [];
diff --git a/offload/liboffload/API/Program.td b/offload/liboffload/API/Program.td
index 8c88fe6e21e6a..0476fa1f7c27a 100644
--- a/offload/liboffload/API/Program.td
+++ b/offload/liboffload/API/Program.td
@@ -13,7 +13,9 @@
 def : Function {
     let name = "olCreateProgram";
     let desc = "Create a program for the device from the binary image pointed to by `ProgData`.";
-    let details = [];
+    let details = [
+        "The provided `ProgData` will be copied and need not outlive the returned handle",
+    ];
     let params = [
         Param<"ol_device_handle_t", "Device", "handle of the device", PARAM_IN>,
         Param<"const void*", "ProgData", "pointer to the program binary data", PARAM_IN>,
diff --git a/offload/liboffload/src/OffloadImpl.cpp b/offload/liboffload/src/OffloadImpl.cpp
index d2b331905ab77..ae2975e3ec3c5 100644
--- a/offload/liboffload/src/OffloadImpl.cpp
+++ b/offload/liboffload/src/OffloadImpl.cpp
@@ -168,15 +168,27 @@ void initPlugins() {
       !std::getenv("OFFLOAD_DISABLE_VALIDATION");
 }
 
-// TODO: We can properly reference count here and manage the resources in a more
-// clever way
 Error olInit_impl() {
   static std::once_flag InitFlag;
   std::call_once(InitFlag, initPlugins);
 
   return Error::success();
 }
-Error olShutDown_impl() { return Error::success(); }
+
+Error olShutDown_impl() {
+  llvm::Error Result = Error::success();
+
+  for (auto &P : Platforms()) {
+    // Host plugin is nullptr and has no deinit
+    if (!P.Plugin)
+      continue;
+
+    if (auto Res = P.Plugin->deinit())
+      Result = llvm::joinErrors(std::move(Result), std::move(Res));
+  }
+
+  return Result;
+}
 
 Error olGetPlatformInfoImplDetail(ol_platform_handle_t Platform,
                                   ol_platform_info_t PropName, size_t PropSize,
@@ -465,6 +477,9 @@ Error olCreateProgram_impl(ol_device_handle_t Device, const void *ProgData,
 }
 
 Error olDestroyProgram_impl(ol_program_handle_t Program) {
+  if (auto Err = Program->Image->getDevice().unloadBinary(Program->Image))
+    return Err;
+
   return olDestroy(Program);
 }
 
diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index e4c32713e2c15..5f80a97d02cb6 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -2023,6 +2023,13 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     return Plugin::success();
   }
 
+  Error unloadBinaryImpl(DeviceImageTy *Image) override {
+    AMDGPUDeviceImageTy &AMDImage = static_cast<AMDGPUDeviceImageTy &>(*Image);
+
+    // Unload the executable of the image.
+    return AMDImage.unloadExecutable();
+  }
+
   /// Deinitialize the device and release its resources.
   Error deinitImpl() override {
     // Deinitialize the stream and event pools.
@@ -2035,19 +2042,6 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     if (auto Err = AMDGPUSignalManager.deinit())
       return Err;
 
-    // Close modules if necessary.
-    if (!LoadedImages.empty()) {
-      // Each image has its own module.
-      for (DeviceImageTy *Image : LoadedImages) {
-        AMDGPUDeviceImageTy &AMDImage =
-            static_cast<AMDGPUDeviceImageTy &>(*Image);
-
-        // Unload the executable of the image.
-        if (auto Err = AMDImage.unloadExecutable())
-          return Err;
-      }
-    }
-
     // Invalidate agent reference.
     Agent = {0};
 
diff --git a/offload/plugins-nextgen/common/include/PluginInterface.h b/offload/plugins-nextgen/common/include/PluginInterface.h
index d2437908a0a6f..7af61074bb322 100644
--- a/offload/plugins-nextgen/common/include/PluginInterface.h
+++ b/offload/plugins-nextgen/common/include/PluginInterface.h
@@ -712,6 +712,10 @@ struct GenericDeviceTy : public DeviceAllocatorTy {
   virtual Expected<DeviceImageTy *>
   loadBinaryImpl(const __tgt_device_image *TgtImage, int32_t ImageId) = 0;
 
+  /// Unload a previously loaded Image from the device
+  Error unloadBinary(DeviceImageTy *Image);
+  virtual Error unloadBinaryImpl(DeviceImageTy *Image) = 0;
+
   /// Setup the device environment if needed. Notice this setup may not be run
   /// on some plugins. By default, it will be executed, but plugins can change
   /// this behavior by overriding the shouldSetupDeviceEnvironment function.
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index f9a6b3c1f4324..bb503c1ff7a54 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -821,26 +821,52 @@ Error GenericDeviceTy::init(GenericPluginTy &Plugin) {
   return Plugin::success();
 }
 
-Error GenericDeviceTy::deinit(GenericPluginTy &Plugin) {
-  for (DeviceImageTy *Image : LoadedImages)
-    if (auto Err = callGlobalDestructors(Plugin, *Image))
-      return Err;
+Error GenericDeviceTy::unloadBinary(DeviceImageTy *Image) {
+  if (auto Err = callGlobalDestructors(Plugin, *Image))
+    return Err;
 
   if (OMPX_DebugKind.get() & uint32_t(DeviceDebugKind::AllocationTracker)) {
     GenericGlobalHandlerTy &GHandler = Plugin.getGlobalHandler();
-    for (auto *Image : LoadedImages) {
-      DeviceMemoryPoolTrackingTy ImageDeviceMemoryPoolTracking = {0, 0, ~0U, 0};
-      GlobalTy TrackerGlobal("__omp_rtl_device_memory_pool_tracker",
-                             sizeof(DeviceMemoryPoolTrackingTy),
-                             &ImageDeviceMemoryPoolTracking);
-      if (auto Err =
-              GHandler.readGlobalFromDevice(*this, *Image, TrackerGlobal)) {
-        consumeError(std::move(Err));
-        continue;
-      }
-      DeviceMemoryPoolTracking.combine(ImageDeviceMemoryPoolTracking);
+    DeviceMemoryPoolTrackingTy ImageDeviceMemoryPoolTracking = {0, 0, ~0U, 0};
+    GlobalTy TrackerGlobal("__omp_rtl_device_memory_pool_tracker",
+                           sizeof(DeviceMemoryPoolTrackingTy),
+                           &ImageDeviceMemoryPoolTracking);
+    if (auto Err =
+            GHandler.readGlobalFromDevice(*this, *Image, TrackerGlobal)) {
+      consumeError(std::move(Err));
     }
+    DeviceMemoryPoolTracking.combine(ImageDeviceMemoryPoolTracking);
+  }
+
+  GenericGlobalHandlerTy &Handler = Plugin.getGlobalHandler();
+  auto ProfOrErr = Handler.readProfilingGlobals(*this, *Image);
+  if (!ProfOrErr)
+    return ProfOrErr.takeError();
+
+  if (!ProfOrErr->empty()) {
+    // Dump out profdata
+    if ((OMPX_DebugKind.get() & uint32_t(DeviceDebugKind::PGODump)) ==
+        uint32_t(DeviceDebugKind::PGODump))
+      ProfOrErr->dump();
+
+    // Write data to profiling file
+    if (auto Err = ProfOrErr->write())
+      return Err;
+  }
+
+  LoadedImages.erase(
+      std::find(LoadedImages.begin(), LoadedImages.end(), Image));
 
+  return unloadBinaryImpl(Image);
+}
+
+Error GenericDeviceTy::deinit(GenericPluginTy &Plugin) {
+  while (!LoadedImages.empty()) {
+    if (auto Err = unloadBinary(LoadedImages.back()))
+      return Err;
+  }
+
+  if (OMPX_DebugKind.get() & uint32_t(DeviceDebugKind::AllocationTracker)) {
     // TODO: Write this by default into a file.
     printf("\n\n|-----------------------\n"
            "| Device memory tracker:\n"
@@ -856,25 +882,6 @@ Error GenericDeviceTy::deinit(GenericPluginTy &Plugin) {
            DeviceMemoryPoolTracking.AllocationMax);
   }
 
-  for (auto *Image : LoadedImages) {
-    GenericGlobalHandlerTy &Handler = Plugin.getGlobalHandler();
-    auto ProfOrErr = Handler.readProfilingGlobals(*this, *Image);
-    if (!ProfOrErr)
-      return ProfOrErr.takeError();
-
-    if (ProfOrErr->empty())
-      continue;
-
-    // Dump out profdata
-    if ((OMPX_DebugKind.get() & uint32_t(DeviceDebugKind::PGODump)) ==
-        uint32_t(DeviceDebugKind::PGODump))
-      ProfOrErr->dump();
-
-    // Write data to profiling file
-    if (auto Err = ProfOrErr->write())
-      return Err;
-  }
-
   // Delete the memory manager before deinitializing the device. Otherwise,
   // we may delete device allocations after the device is deinitialized.
   if (MemoryManager)
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index 44ccfc47a21c9..eb313d0a4f093 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -358,6 +358,19 @@ struct CUDADeviceTy : public GenericDeviceTy {
     return Plugin::success();
   }
 
+  Error unloadBinaryImpl(DeviceImageTy *Image) override {
+    assert(Context && "Invalid CUDA context");
+
+    // Each image has its own module.
+    CUDADeviceImageTy &CUDAImage = static_cast<CUDADeviceImageTy &>(*Image);
+
+    // Unload the module of the image.
+    if (auto Err = CUDAImage.unloadModule())
+      return Err;
+
+    return Plugin::success();
+  }
+
   /// Deinitialize the device and release its resources.
   Error deinitImpl() override {
     if (Context) {
@@ -372,20 +385,6 @@ struct CUDADeviceTy : public GenericDeviceTy {
     if (auto Err = CUDAEventManager.deinit())
       return Err;
 
-    // Close modules if necessary.
-    if (!LoadedImages.empty()) {
-      assert(Context && "Invalid CUDA context");
-
-      // Each image has its own module.
-      for (DeviceImageTy *Image : LoadedImages) {
-        CUDADeviceImageTy &CUDAImage = static_cast<CUDADeviceImageTy &>(*Image);
-
-        // Unload the module of the image.
-        if (auto Err = CUDAImage.unloadModule())
-          return Err;
-      }
-    }
-
     if (Context) {
       CUresult Res = cuDevicePrimaryCtxRelease(Device);
       if (auto Err =

jhuber6 · 2025-06-13T12:58:25Z

Initialization like this should always be reference counted. Otherwise the first thread to call olShutDown will destroy the context for everyone.

RossBrunton · 2025-06-13T13:07:18Z

@jhuber6 Ah, okay I'll add that then. What about calling olInit after olShutDown? Should that reinitialise everything?

jhuber6 · 2025-06-13T13:41:49Z

@jhuber6 Ah, okay I'll add that then. What about calling olInit after olShutDown? Should that reinitialise everything?

Yes, see https://github.com/llvm/llvm-project/blob/main/offload/libomptarget/OffloadRTL.cpp for how OpenMP does it. It's tough to do this without mutexes because you need to prevent other threads from making progress until you're done initializing stuff.

github-actions · 2025-06-16T09:07:54Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jhuber6 · 2025-06-16T12:47:46Z

offload/liboffload/API/Common.td

-    "All resources owned by the Offload library and plugins will be released",
-    "Subsequent API calls made after calling `olShutDown` are undefined behavior"
+    "This decrements an internal reference count. When this reaches 0, all resources will be released",
+    "Subsequent API calls to methods other than `olInit` made after resources are released are undefined behavior"


Why is this undefined behavior? It just initializes everything again.

These are for functions other than olInit. That is, trying to use the offload API without initialising it first causes UB.

We need a separate error category for that, the other runtimes have errors for using API calls while uninitialized. Definitely not UB. We can't just declare everything we're too lazy to check as undefined behavior.

jhuber6 · 2025-06-16T12:48:30Z

offload/liboffload/src/OffloadImpl.cpp

+static std::atomic_int &GlobalRefCount() {
+  static std::atomic_int Ref{0};
+  return Ref;
 }


What's the benefit of this over a global variable?

Mainly for consistency with the (old version) of Platforms() and allocInfoMap(). In theory it means that Ref isn't initialised until its first use, but in practice initialising an atomic_int probably isn't that expensive.

Does LLVM have a policy for how to handle global state like this?

It's a glorified global variable that now needs a mutex guard, no thanks.

jhuber6 · 2025-06-16T12:49:02Z

offload/liboffload/src/OffloadImpl.cpp

@@ -169,13 +173,16 @@ void initPlugins() {
 }

 Error olInit_impl() {


Where's the mutex?

I thought that the reference counting would be sufficient, but thinking about it more, yes a mutex is required. Thanks, it's been added.

jhuber6 · 2025-06-16T13:15:30Z

offload/liboffload/src/OffloadImpl.cpp

+  // initialization, we need to ensure that other threads are blocked until it
+  // is completed - hence this mutex.
+  static std::mutex Init{};
+  std::lock_guard<std::mutex> Guard{Init};


Forgot to submit the comment, you need to use the same mutex when you uninitialize. The global ref count doesn't even need to be atomic since it's mutex guarded. Refer to what I linked earlier.

Ah right, because we could call init while it is being torn down. I've just copied the logic from libomptarget.

arsenm · 2025-06-16T13:17:15Z

offload/liboffload/src/OffloadImpl.cpp

+  PlatformList = new PlatformVecT();
+  AllocInfoMap = new AllocInfoMapT();


unique_ptr?

Yeah, I'm not really a fan of these weird singleton things at all. This is supposed to be a parallel library but nothing is set up to enable that from what I can see.

unique_ptr doesn't work here because of the destruction order of statics. olShutDown requires the platforms list to be valid (so it can loop through them and deinit them). However if you call olShutDown in the destructor of a static (e.g. https://github.com/llvm/llvm-project/blob/main/offload/unittests/OffloadAPI/common/Environment.cpp#L20 ), then the unique_ptr might get freed before olShutDown is ran.

Yeah that's a common problem with these libraries when you start mixing global initializers. We had to do similar hacks in the plugins. There's nothing stopping you from calling ptr.release instead of delete. Also, this is basically the global shared state as far as I can tell. That should be a single type instead of scattered around and put in random Meyer's singletons.

jhuber6 · 2025-06-16T13:32:20Z

offload/liboffload/API/Program.td

+    let details = [
+        "The provided `ProgData` will be copied and need not outlive the returned handle",
+    ];


Ideally we should be able to return a proper error if someone uses the interface while it was shut down.

jhuber6 · 2025-06-16T13:32:45Z

offload/liboffload/src/OffloadImpl.cpp

+    // Host plugin is nullptr and has no deinit
+    if (!P.Plugin)
+      continue;


We need to handle this more cleanly in the future.

jhuber6 · 2025-06-16T13:33:28Z

offload/liboffload/src/OffloadImpl.cpp

+    if (auto Res = P.Plugin->deinit())
+      Result = llvm::joinErrors(std::move(Result), std::move(Res));


If we have multiple plugins active this will potentially drop a previous error and hit an assertion.

This allows removal of a specific Image from a Device, rather than requiring all image data to outlive the device they were created for. This is required for `ol_program_handle_t`s, which now specify the lifetime of the buffer used to create the program.

`olShutDown` was not properly calling deinit on the platforms, resulting in random segfaults on AMD devices. The spec has been updated to remove references to `olInit`/`olShutDown` doing reference counting. Implementations may wrap liboffload with their own reference counting if required. This matches the behaviour of handle types which also don't use reference counting.

llvmbot added backend:AMDGPU offload labels Jun 13, 2025

RossBrunton marked this pull request as draft June 13, 2025 14:27

RossBrunton marked this pull request as ready for review June 16, 2025 09:05

RossBrunton changed the title ~~[Offload] Implement olShutDown and remove global reference counting~~ [Offload] Implement olShutDown Jun 16, 2025

jhuber6 reviewed Jun 16, 2025

View reviewed changes

arsenm reviewed Jun 16, 2025

View reviewed changes

jhuber6 reviewed Jun 16, 2025

View reviewed changes

RossBrunton added 8 commits June 19, 2025 10:58

Fix cuda

0c30b9e

Review feedback

9afacf3

Bring back reference counting

39dafaf

Clang-format

444fbce

Add mutex guard

cf242de

Better mutexes

7d1ad55

RossBrunton force-pushed the teardown branch from 650e1c0 to 7d1ad55 Compare June 19, 2025 09:58

RossBrunton marked this pull request as draft June 19, 2025 09:59

		@@ -169,13 +173,16 @@ void initPlugins() {
		}

		Error olInit_impl() {

		PlatformList = new PlatformVecT();
		AllocInfoMap = new AllocInfoMapT();

		if (auto Res = P.Plugin->deinit())
		Result = llvm::joinErrors(std::move(Result), std::move(Res));

[Offload] Implement olShutDown #144055

Are you sure you want to change the base?

[Offload] Implement olShutDown #144055

Uh oh!

Conversation

RossBrunton commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Jun 13, 2025

Uh oh!

RossBrunton commented Jun 13, 2025

Uh oh!

jhuber6 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[Offload] Implement `olShutDown` #144055

[Offload] Implement `olShutDown` #144055

RossBrunton commented Jun 13, 2025 •

edited

Loading

llvmbot commented Jun 13, 2025 •

edited

Loading

jhuber6 commented Jun 13, 2025 •

edited

Loading

github-actions bot commented Jun 16, 2025 •

edited

Loading

jhuber6 Jun 16, 2025 •

edited

Loading