Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Use work stealing in all relevant CUB algorithms #3871

Open
17 tasks
bernhardmgruber opened this issue Feb 20, 2025 · 3 comments
Open
17 tasks

[EPIC] Use work stealing in all relevant CUB algorithms #3871

bernhardmgruber opened this issue Feb 20, 2025 · 3 comments

Comments

@bernhardmgruber
Copy link
Contributor

Work stealing is a newly exposed feature in CCCL (via #3671) which allows a thread block to steal work from other not-yet-launched thread blocks. This feature is also known as cluster launch control or UGETNEXTWORKID.

Work stealing allows load balancing and early termination at virtually no overhead, which should be generally beneficial. Since Blackwell, this feature enjoyes hardware acceleration.

We should evaluate where we can use this feature in CUB and then add it wherever it shows a benefit.

  • DeviceAdjacentDifference
  • DeviceCopy
  • DeviceFor
  • DeviceHistogram
  • DeviceMemcpy
  • DeviceMergeSort
  • DeviceMerge
  • DevicePartition
  • DeviceRadixSort
  • DeviceReduce
  • DeviceRunLengthEncode
  • DeviceScan
  • DeviceSegmentedRadixSort
  • DeviceSegmentedReduce
  • DeviceSegmentedSort
  • DeviceSelect
  • DeviceTransform
@elstehle
Copy link
Collaborator

elstehle commented Feb 22, 2025

Accumulating Algorithms

Retain accumulated partial result across tiles: At the "EOL" of a thread block, the accumulated state (e.g., histogram, partial reduction) can be retained and used for processing the next tile of work. This approach reduces global memory communication overhead and aligns with the existing GridEvenShare strategy in DeviceReduce. The current GridEvenShare can lead to load imbalance, particularly when workload distribution is irregular. Introducing a work-stealing mechanism would help mitigate this potential imbalance by dynamically redistributing work among underutilized thread blocks.

Relevant Algorithms:

  • DeviceHistogram
  • DeviceReduce
  • DeviceSegmentedReduce

Algorithms Involving Binary Search

Many algorithms require a binary search to assign work to thread blocks. For example, in DeviceCopy and DeviceMemcpy, each thread block determines its assigned range via a binary search. Since the lower bound of the search range is non-decreasing across tiles, we can propagate this information forward, effectively reducing the search range for successor tiles. This optimization can also be applied to merging operations, where we use the MergePath algorithm: by communicating the lower bounds of the left-hand-side (lhs) and right-hand-side (rhs) merge paths across tiles, we can reduce redundant computation.

Relevant Algorithms:

  • DeviceCopy
  • DeviceMemcpy
  • DeviceMergeSort
  • DeviceMerge

Decoupled Look-Back-Based Algorithms

Through work stealing we can ensure that work on a subsequent tile is scheduled in a timely manner. A thread block can carry the result of its previous tile's inclusive prefix and integrate it once the look-back has reached that tile.

Relevant Algorithms:

  • DevicePartition
  • DeviceRunLengthEncode
  • DeviceScan
  • DeviceSelect

Early termination

Once a thread block has found the element being searched for, we can "work steal" subsequent tiles retaining the state that the item was already found. Essentially, cancelling work of subsequent tiles.

Relevant Algorithms:

Other Considerations

For algorithms currently following a single-CTA-per-segment model, work stealing is not immediately applicable. However, if we transition to a more load-balanced approach, similar optimizations to those described for DeviceCopy and DeviceMemcpy may become viable.

Relevant Algorithms:

  • DeviceSegmentedRadixSort
  • DeviceSegmentedSort

@bernhardmgruber
Copy link
Contributor Author

That's an awesome analysis @elstehle !!! Thank you so much!

@miscco
Copy link
Collaborator

miscco commented Feb 24, 2025

Yeah this is a great job @elstehle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants