Skip to content

[SYCL][UR][L0 v2] Fix issues with event lifetime #18962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: sycl
Choose a base branch
from

Conversation

igchor
Copy link
Member

@igchor igchor commented Jun 12, 2025

When releasing executionEvent in the commandBuffer we need to also retain the queue. Otherwise event->release() in commandBuffer's destructor might attempt to release the event to an event pool that has already been destroyed.

Also, make sure eventPool is destroyed after commandListManager to avoid any issues with events owned by the commandListManager and move context retain/release from commandListManager to queue/command_buffer.

The implementation of commandListManager marked move ctor and assignment operator as defaulted which was wrong - the implementation would have to account for context and device retain/release calls. To avoid this, move the logic to queue/commandBuffer which can have move ctors/assignments operators removed.

@pbalcer
Copy link
Contributor

pbalcer commented Jun 13, 2025

When releasing executionEvent in the commandBuffer we need to also retain the queue. Otherwise event->release() in commandBuffer's destructor might attempt to release the event to an event pool that has already been destroyed.

Event pools are released to the context when the queue is released. So this would mean that the command buffer outlives the context itself, not just the queue.

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but please add a test that releases a command buffer and queue before using the command buffer for the last time.

@igchor
Copy link
Member Author

igchor commented Jun 13, 2025

When releasing executionEvent in the commandBuffer we need to also retain the queue. Otherwise event->release() in commandBuffer's destructor might attempt to release the event to an event pool that has already been destroyed.

Event pools are released to the context when the queue is released. So this would mean that the command buffer outlives the context itself, not just the queue.

Yes, and we could actually solve this problem by retaining the context in the command buffer, rather than the queue. However, there is a potential problem with the execution event: if the queue associated with that event is destroyed and we want to later synchronize on the execution event we might crash because (as far as I know) counter-based events rely on the associated command-list for keeping some of it's state (and command list is potentially destroyed/re-used as soon as queue is released). I did solve this problem by retaining the queue in command buffer but now as I think about it, it is a more general problem. What if someone wants to pass an event as a dependency to some operation after the associated queue is destroyed?

I think we might actually need to start retaining the queue when allocating events (which we have tried to avoid until now). One alternative I'm considering is to do some logic in the event pool. For example, we could create another raii wrapper that would internally store cache_borrowed_event_pool* and if eventPool->events.size() != eventPool->freelist.size() delay returning the event pool to the context and destroying the queue until all events are released (add a callback to eventPool::free)

@igchor igchor requested a review from a team as a code owner June 13, 2025 16:57
@igchor igchor temporarily deployed to WindowsCILock June 13, 2025 16:57 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock June 13, 2025 17:33 — with GitHub Actions Inactive
@pbalcer
Copy link
Contributor

pbalcer commented Jun 16, 2025

:(

I think we might actually need to start retaining the queue when allocating events (which we have tried to avoid until now). One alternative I'm considering is to do some logic in the event pool. For example, we could create another raii wrapper that would internally store cache_borrowed_event_pool* and if eventPool->events.size() != eventPool->freelist.size() delay returning the event pool to the context and destroying the queue until all events are released (add a callback to eventPool::free)

I think this might be the best option. I'll think on it.

But let's do that as a separate PR.

@igchor igchor force-pushed the v2_lifetime_fixes branch from 6d6a4c7 to c08ea6f Compare June 16, 2025 14:13
@igchor igchor temporarily deployed to WindowsCILock June 16, 2025 14:13 — with GitHub Actions Inactive
@igchor
Copy link
Member Author

igchor commented Jun 16, 2025

:(

I think we might actually need to start retaining the queue when allocating events (which we have tried to avoid until now). One alternative I'm considering is to do some logic in the event pool. For example, we could create another raii wrapper that would internally store cache_borrowed_event_pool* and if eventPool->events.size() != eventPool->freelist.size() delay returning the event pool to the context and destroying the queue until all events are released (add a callback to eventPool::free)

I think this might be the best option. I'll think on it.

But let's do that as a separate PR.

Yes, for now, I think we can just merge the fix with context retain/release - this is enough for all the current tests to pass (with ooo queue).

@igchor igchor temporarily deployed to WindowsCILock June 16, 2025 14:43 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock June 16, 2025 14:43 — with GitHub Actions Inactive
@igchor
Copy link
Member Author

igchor commented Jun 16, 2025

@intel/sycl-graphs-reviewers could you please take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants