Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: work-stealing improvements meta-issue #3870

Open
1 of 12 tasks
gonzalobg opened this issue Feb 20, 2025 · 0 comments
Open
1 of 12 tasks

[FEA]: work-stealing improvements meta-issue #3870

gonzalobg opened this issue Feb 20, 2025 · 0 comments
Labels
feature request New feature or request.

Comments

@gonzalobg
Copy link
Collaborator

gonzalobg commented Feb 20, 2025

Is this a duplicate?

Area

libcu++

Is your feature request related to a problem? Please describe.

#3671 proposes an MVP for work-stealing.

The following features have not made the MVP but can be evaluated, and if deemed worth it, pursued later:

  • Cluster-level work stealing: see Tune cub::DeviceTransform for Blackwell #3543 .
  • Explore whether two different APIs: for_each_canceled_block/_cluster are required, or a single API suffices (e.g. for_each_canceled_block could detect a cluster, switch to multi-cast, and cancel the entire cluster).
  • The API currently doesn't provide any memory ordering guarantees. But the implementation for sm100 provides them. We should consider guaranteeing memory ordering by default, and providing a cuda::memory_order argument that allows opting out, if we can improve performance by doing so.
  • Performance fine-tuning: add benchmarks to benchmark these APIs, and use those to tweak the generated code.
  • Evaluating replacing the inline asm with cuda::ptx inside of the API. Would make sense to have a benchmark first to verify impact.
  • Add a better example to the documentation in which thread blocks have a prologue and an epilogue, to show how to handle those cases (e.g. a histogram maybe?).
    • Add these examples as tests.
    • Do a blog-post afterwards?
  • Control over leader thread and leader block for easier integration into warp-specialized kernels.
  • Stateless API for easier integration into kernels with tight resource constraints (e.g., a cuda::for_each_cancelled_state type that enables programmers to control internal barrier storage and lifetime, similar to cuda::pipeline).
    • Currently, this API only implements a 1-stage pipeline that uses 8-bytes of shared memory per block, which is deemed worth of an "easy mode" in which the API manages these internally. If we support wider pipelining, then the value of a stateless API becomes more worth it.
  • Wider pipelining, e.g., support for > 1 stage pipelines.
    • For example, via an int NStages = 1 template parameter in the APIs and the API state.

Describe the solution you'd like

See above.

Describe alternatives you've considered

See above.

Additional context

No response

@gonzalobg gonzalobg added the feature request New feature or request. label Feb 20, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 20, 2025
@gonzalobg gonzalobg changed the title [FEA]: work_stealing improvements meta-issue [FEA]: work-stealing improvements meta-issue Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Todo
Development

No branches or pull requests

1 participant