-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Use work stealing in all relevant CUB algorithms #3871
Comments
Accumulating AlgorithmsRetain accumulated partial result across tiles: At the "EOL" of a thread block, the accumulated state (e.g., histogram, partial reduction) can be retained and used for processing the next tile of work. This approach reduces global memory communication overhead and aligns with the existing Relevant Algorithms:
Algorithms Involving Binary SearchMany algorithms require a binary search to assign work to thread blocks. For example, in Relevant Algorithms:
Decoupled Look-Back-Based AlgorithmsThrough work stealing we can ensure that work on a subsequent tile is scheduled in a timely manner. A thread block can carry the result of its previous tile's inclusive prefix and integrate it once the look-back has reached that tile. Relevant Algorithms:
Early terminationOnce a thread block has found the element being searched for, we can "work steal" subsequent tiles retaining the state that the item was already found. Essentially, cancelling work of subsequent tiles. Relevant Algorithms:
Other ConsiderationsFor algorithms currently following a single-CTA-per-segment model, work stealing is not immediately applicable. However, if we transition to a more load-balanced approach, similar optimizations to those described for Relevant Algorithms:
|
That's an awesome analysis @elstehle !!! Thank you so much! |
Yeah this is a great job @elstehle |
Work stealing is a newly exposed feature in CCCL (via #3671) which allows a thread block to steal work from other not-yet-launched thread blocks. This feature is also known as cluster launch control or UGETNEXTWORKID.
Work stealing allows load balancing and early termination at virtually no overhead, which should be generally beneficial. Since Blackwell, this feature enjoyes hardware acceleration.
We should evaluate where we can use this feature in CUB and then add it wherever it shows a benefit.
The text was updated successfully, but these errors were encountered: