You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Explore whether two different APIs: for_each_canceled_block/_cluster are required, or a single API suffices (e.g. for_each_canceled_block could detect a cluster, switch to multi-cast, and cancel the entire cluster).
The API currently doesn't provide any memory ordering guarantees. But the implementation for sm100 provides them. We should consider guaranteeing memory ordering by default, and providing a cuda::memory_order argument that allows opting out, if we can improve performance by doing so.
Performance fine-tuning: add benchmarks to benchmark these APIs, and use those to tweak the generated code.
Evaluating replacing the inline asm with cuda::ptx inside of the API. Would make sense to have a benchmark first to verify impact.
Add a better example to the documentation in which thread blocks have a prologue and an epilogue, to show how to handle those cases (e.g. a histogram maybe?).
Add these examples as tests.
Do a blog-post afterwards?
Control over leader thread and leader block for easier integration into warp-specialized kernels.
Stateless API for easier integration into kernels with tight resource constraints (e.g., a cuda::for_each_cancelled_state type that enables programmers to control internal barrier storage and lifetime, similar to cuda::pipeline).
Currently, this API only implements a 1-stage pipeline that uses 8-bytes of shared memory per block, which is deemed worth of an "easy mode" in which the API manages these internally. If we support wider pipelining, then the value of a stateless API becomes more worth it.
Wider pipelining, e.g., support for > 1 stage pipelines.
For example, via an int NStages = 1 template parameter in the APIs and the API state.
Describe the solution you'd like
See above.
Describe alternatives you've considered
See above.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is this a duplicate?
Area
libcu++
Is your feature request related to a problem? Please describe.
#3671 proposes an MVP for work-stealing.
The following features have not made the MVP but can be evaluated, and if deemed worth it, pursued later:
for_each_canceled_block
/_cluster
are required, or a single API suffices (e.g.for_each_canceled_block
could detect a cluster, switch to multi-cast, and cancel the entire cluster).cuda::memory_order
argument that allows opting out, if we can improve performance by doing so.cuda::ptx
inside of the API. Would make sense to have a benchmark first to verify impact.__for_each_canceled_xxx
APIs that support these, but these are non-public and non-stable for now.cuda::for_each_cancelled_state
type that enables programmers to control internal barrier storage and lifetime, similar tocuda::pipeline
).Describe the solution you'd like
See above.
Describe alternatives you've considered
See above.
Additional context
No response
The text was updated successfully, but these errors were encountered: