[SYCL][UR][CUDA] wrong order of the `~ur_device_handle_t_()` destructor and a user-app static buffer destructor #17450

ldorau · 2025-03-13T13:23:58Z

Describe the bug

Wrong order of the ~ur_device_handle_t_() destructor and a user-app static buffer destructor.

The ur_device_handle_t_::~ur_device_handle_t_() destructor of a CUDA device is incorrectly called too early before the sycl::~buffer destructor of a user-app static buffer is called. As a result the CUDA device is destroyed before a memory allocated from this device is freed and the test segfaults - see a part of a log from the sycl/test-e2e/Regression/static-buffer-dtor.cpp test:

UR ---> ~ur_device_handle_t_() ---> cuDevicePrimaryCtxRelease(0) // <--- destroying the CUDA device
UR <--- ~ur_device_handle_t_() <--- cuDevicePrimaryCtxRelease(0)
   ---> urEventWait
   <--- urEventWait(.numEvents = 1, .phEventWaitList = 0x7fffee576128 {0x39f39d80}) -> UR_RESULT_SUCCESS;
   ---> urMemRelease // <--- freeing memory from the already destroyed CUDA device
[PID:459547 TID:459547 DEBUG UMF] umfFree: calling umfPoolFree(hPool = 0x7f2df7287268, ptr = 0x7f2dd4200200)
[PID:459547 TID:459547 DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f2df7295068, pool=0x7f2df7287268, ptr=0x7f2dd4200200, size=256
[PID:459547 TID:459547 DEBUG UMF] cu_memory_provider_free: cu_memory_provider_free(0x7f2dd4200200, 256)
[PID:459547 TID:459547 ERROR UMF] set_context: cuCtxGetCurrent() failed (cu_result = 4)
[PID:459547 TID:459547 ERROR UMF] cu2umf_result: CUDA driver has been deinitialized
[PID:459547 TID:459547 ERROR UMF] cu_memory_provider_free: Failed to set CUDA context, ret = 7
[PID:459547 TID:459547 ERROR UMF] trackingFree: upstream provider failed to free the memory
[PID:459547 TID:459547 DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f2df7295068, pool=0x7f2df7287268, ptr=0x7f2dd4200200, size=256
   <--- urMemRelease(.hMem = 0x39678070) -> UR_RESULT_SUCCESS;

See: #17411 (comment)
Ref: #17411

To reproduce

Include a code snippet that is as short as possible - the sycl/test-e2e/Regression/static-buffer-dtor.cpp SYCL test.
Specify the command which should be used to compile the program

$ build/bin/clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sycl/test-e2e/Regression/static-buffer-dtor.cpp -Xarch_device -fsanitize=address -DMALLOC_DEVICE -O0 -g -o static_buffer_dtor

Specify the command which should be used to launch the program

$ ONEAPI_DEVICE_SELECTOR="cuda:gpu" SYCL_UR_TRACE=2 UMF_LOG="level:debug;flush:debug;output:stderr;pid:yes" ./static_buffer_dtor

Indicate what is wrong and what was expected

The sycl/test-e2e/Regression/static-buffer-dtor.cpp test segfaults on the PR #17468

See the reproduction in CI: https://github.com/intel/llvm/actions/runs/13855853921/job/38773587493?pr=17468

Environment

OS: Linux (Ubuntu 22.04.4 LTS)
DPC++ version: on the PR: [UR] Bump UMF to v0.11.0-dev4 #17468
Dependencies version: [e.g. the output of sycl-ls --verbose]

$ sycl-ls --verbose
Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to cuda:gpu.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.

[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]

Platforms: 1
Platform [#1]:
    Version  : CUDA 12.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 8.6
        Name       : NVIDIA GeForce RTX 3060
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 12.6
        Aspects    : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthur_print: Images are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_non_uniform_groups
        info::device::sub_group_sizes: 32
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]
accelerator_selector()  : No device of requested type available. Please chec...
cpu_selector()          : No device of requested type available. Please chec...
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]
custom_selector(cpu)    : No device of requested type available. Please chec...
custom_selector(acc)    : No device of requested type available. Please chec...

Additional context

This bug can happen if SYCL calls urAdapterRelease() before this static sycl::buffer is destroyed.

The text was updated successfully, but these errors were encountered:

ldorau · 2025-03-14T07:58:46Z

See: #17411 (comment)

ldorau · 2025-03-14T12:17:02Z

See the reproduction in CI: https://github.com/intel/llvm/actions/runs/13855853921/job/38773587493?pr=17468

ldorau · 2025-03-14T12:22:13Z

DPC++ version: on the PR: #17468

In the cuda adapter the adapter struct itself is currently an extern global defined in adapter.cpp. This means fully tearing down the adapter is subject to the same destructor ordering as all other static and global variables, it's first in last out. This presents a problem because an application can declare a static sycl object like a buffer right up top before doing anything else, which results in the sycl object being destroyed after the cuda adapter struct. The UR spec doesn't put the onus on users to keep their parent object lifetimes in order, i.e. there is no statement about "the context you use to create a ur_mem_handle_t must not be released until after the mem_handle". It's assumed (by omission rather than explicitly) that adapters will have their objects keep a reference to any parent objects alive for the duration of their own lifetime. This change moves the cuda adapter structs ownership into a global shared_ptr, which allows child objects of the adapter to keep their own references to it alive past the point where its initial definition goes out of scope. Also adjusts how some other objects track parent object references so that the destructors correctly cascade back to the top: mem handle releases its context, which releases its adapter, which releases the platform + devices, etc. Fixes intel#17450

ldorau added the bug label Mar 13, 2025

ldorau mentioned this issue Mar 13, 2025

[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411

Closed

AlexeySachkov added cuda unified-runtime labels Mar 14, 2025

ldorau mentioned this issue Mar 14, 2025

[UR] Bump UMF to v0.11.0-dev4 #17468

Merged

aarongreig linked a pull request Mar 21, 2025 that will close this issue

[UR][CUDA][HIP] Allow adapter objects to block full adapter teardown. #17571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][UR][CUDA] wrong order of the `~ur_device_handle_t_()` destructor and a user-app static buffer destructor #17450

[SYCL][UR][CUDA] wrong order of the `~ur_device_handle_t_()` destructor and a user-app static buffer destructor #17450

ldorau commented Mar 13, 2025 •

edited

Loading

ldorau commented Mar 14, 2025 •

edited

Loading

ldorau commented Mar 14, 2025

ldorau commented Mar 14, 2025

[SYCL][UR][CUDA] wrong order of the ~ur_device_handle_t_() destructor and a user-app static buffer destructor #17450

[SYCL][UR][CUDA] wrong order of the ~ur_device_handle_t_() destructor and a user-app static buffer destructor #17450

Comments

ldorau commented Mar 13, 2025 • edited Loading

Describe the bug

To reproduce

Environment

Additional context

ldorau commented Mar 14, 2025 • edited Loading

ldorau commented Mar 14, 2025

ldorau commented Mar 14, 2025

[SYCL][UR][CUDA] wrong order of the `~ur_device_handle_t_()` destructor and a user-app static buffer destructor #17450

[SYCL][UR][CUDA] wrong order of the `~ur_device_handle_t_()` destructor and a user-app static buffer destructor #17450

ldorau commented Mar 13, 2025 •

edited

Loading

ldorau commented Mar 14, 2025 •

edited

Loading