Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][UR][CUDA] wrong order of the ~ur_device_handle_t_() destructor and a user-app static buffer destructor #17450

Open
ldorau opened this issue Mar 13, 2025 · 3 comments · May be fixed by #17571
Labels
bug Something isn't working cuda CUDA back-end unified-runtime

Comments

@ldorau
Copy link
Contributor

ldorau commented Mar 13, 2025

Describe the bug

Wrong order of the ~ur_device_handle_t_() destructor and a user-app static buffer destructor.

The ur_device_handle_t_::~ur_device_handle_t_() destructor of a CUDA device is incorrectly called too early before the sycl::~buffer destructor of a user-app static buffer is called. As a result the CUDA device is destroyed before a memory allocated from this device is freed and the test segfaults - see a part of a log from the sycl/test-e2e/Regression/static-buffer-dtor.cpp test:

UR ---> ~ur_device_handle_t_() ---> cuDevicePrimaryCtxRelease(0) // <--- destroying the CUDA device
UR <--- ~ur_device_handle_t_() <--- cuDevicePrimaryCtxRelease(0)
   ---> urEventWait
   <--- urEventWait(.numEvents = 1, .phEventWaitList = 0x7fffee576128 {0x39f39d80}) -> UR_RESULT_SUCCESS;
   ---> urMemRelease // <--- freeing memory from the already destroyed CUDA device
[PID:459547 TID:459547 DEBUG UMF] umfFree: calling umfPoolFree(hPool = 0x7f2df7287268, ptr = 0x7f2dd4200200)
[PID:459547 TID:459547 DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f2df7295068, pool=0x7f2df7287268, ptr=0x7f2dd4200200, size=256
[PID:459547 TID:459547 DEBUG UMF] cu_memory_provider_free: cu_memory_provider_free(0x7f2dd4200200, 256)
[PID:459547 TID:459547 ERROR UMF] set_context: cuCtxGetCurrent() failed (cu_result = 4)
[PID:459547 TID:459547 ERROR UMF] cu2umf_result: CUDA driver has been deinitialized
[PID:459547 TID:459547 ERROR UMF] cu_memory_provider_free: Failed to set CUDA context, ret = 7
[PID:459547 TID:459547 ERROR UMF] trackingFree: upstream provider failed to free the memory
[PID:459547 TID:459547 DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f2df7295068, pool=0x7f2df7287268, ptr=0x7f2dd4200200, size=256
   <--- urMemRelease(.hMem = 0x39678070) -> UR_RESULT_SUCCESS;

See: #17411 (comment)
Ref: #17411

To reproduce

  1. Include a code snippet that is as short as possible - the sycl/test-e2e/Regression/static-buffer-dtor.cpp SYCL test.

  2. Specify the command which should be used to compile the program

$ build/bin/clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sycl/test-e2e/Regression/static-buffer-dtor.cpp -Xarch_device -fsanitize=address -DMALLOC_DEVICE -O0 -g -o static_buffer_dtor
  1. Specify the command which should be used to launch the program
$ ONEAPI_DEVICE_SELECTOR="cuda:gpu" SYCL_UR_TRACE=2 UMF_LOG="level:debug;flush:debug;output:stderr;pid:yes" ./static_buffer_dtor
  1. Indicate what is wrong and what was expected

The sycl/test-e2e/Regression/static-buffer-dtor.cpp test segfaults on the PR #17468

See the reproduction in CI: https://github.com/intel/llvm/actions/runs/13855853921/job/38773587493?pr=17468

Environment

$ sycl-ls --verbose
Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to cuda:gpu.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.

[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]

Platforms: 1
Platform [#1]:
    Version  : CUDA 12.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 8.6
        Name       : NVIDIA GeForce RTX 3060
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 12.6
        Aspects    : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthur_print: Images are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_non_uniform_groups
        info::device::sub_group_sizes: 32
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]
accelerator_selector()  : No device of requested type available. Please chec...
cpu_selector()          : No device of requested type available. Please chec...
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.6]
custom_selector(cpu)    : No device of requested type available. Please chec...
custom_selector(acc)    : No device of requested type available. Please chec...

Additional context

This bug can happen if SYCL calls urAdapterRelease() before this static sycl::buffer is destroyed.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 14, 2025

See: #17411 (comment)

@ldorau
Copy link
Contributor Author

ldorau commented Mar 14, 2025

@ldorau
Copy link
Contributor Author

ldorau commented Mar 14, 2025

DPC++ version: on the PR: #17468

aarongreig added a commit to aarongreig/intel-llvm that referenced this issue Mar 21, 2025

Verified

This commit was signed with the committer’s verified signature. The key has expired.
SomeRandomiOSDev Joe Newton
In the cuda adapter the adapter struct itself is currently an extern
global defined in adapter.cpp. This means fully tearing down the adapter
is subject to the same destructor ordering as all other static and
global variables, it's first in last out. This presents a problem
because an application can declare a static sycl object like a buffer
right up top before doing anything else, which results in the sycl
object being destroyed after the cuda adapter struct.

The UR spec doesn't put the onus on users to keep their parent object
lifetimes in order, i.e. there is no statement about "the context you
use to create a ur_mem_handle_t must not be released until after the
mem_handle". It's assumed (by omission rather than explicitly) that
adapters will have their objects keep a reference to any parent objects
alive for the duration of their own lifetime.

This change moves the cuda adapter structs ownership into a global
shared_ptr, which allows child objects of the adapter to keep their own
references to it alive past the point where its initial definition goes
out of scope. Also adjusts how some other objects track parent object
references so that the destructors correctly cascade back to the top:
mem handle releases its context, which releases its adapter, which
releases the platform + devices, etc.

Fixes intel#17450
aarongreig added a commit to aarongreig/intel-llvm that referenced this issue Mar 21, 2025
In the cuda adapter the adapter struct itself is currently an extern
global defined in adapter.cpp. This means fully tearing down the adapter
is subject to the same destructor ordering as all other static and
global variables, it's first in last out. This presents a problem
because an application can declare a static sycl object like a buffer
right up top before doing anything else, which results in the sycl
object being destroyed after the cuda adapter struct.

The UR spec doesn't put the onus on users to keep their parent object
lifetimes in order, i.e. there is no statement about "the context you
use to create a ur_mem_handle_t must not be released until after the
mem_handle". It's assumed (by omission rather than explicitly) that
adapters will have their objects keep a reference to any parent objects
alive for the duration of their own lifetime.

This change moves the cuda adapter structs ownership into a global
shared_ptr, which allows child objects of the adapter to keep their own
references to it alive past the point where its initial definition goes
out of scope. Also adjusts how some other objects track parent object
references so that the destructors correctly cascade back to the top:
mem handle releases its context, which releases its adapter, which
releases the platform + devices, etc.

Fixes intel#17450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end unified-runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants