Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCCL C: Avoid passing structs by value in C APIs #3952

Open
leofang opened this issue Feb 27, 2025 · 2 comments
Open

CCCL C: Avoid passing structs by value in C APIs #3952

leofang opened this issue Feb 27, 2025 · 2 comments
Labels
CCCL-C For all items related to the C library cuda.parallel For all items related to the cuda.parallel Python module

Comments

@leofang
Copy link
Member

leofang commented Feb 27, 2025

Currently the C APIs take structs like cccl_iterator_t and cccl_op_t by value. It would be better for both performance and codegen reasons to take pointers to these structs instead. The change will also need to propagate to Python (cuda.parallel).

@leofang leofang added CCCL-C For all items related to the C library cuda.parallel For all items related to the cuda.parallel Python module labels Feb 27, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 27, 2025
@leofang leofang changed the title Avoid passing structs by value in C APIs CCCL C: Avoid passing structs by value in C APIs Feb 27, 2025
@bernhardmgruber
Copy link
Contributor

Please verify the claim with a benchmark. Pass by value for small structs has become better over the decades, since the parameter cannot alias and may be passed through registers. It allows more optimizations.

@leofang
Copy link
Member Author

leofang commented Feb 27, 2025

Perf for now is only a potential concern that someone can help verify later (or maybe @tpn already knows! 🤩). Mainly we are using the C library from Python, and my understanding is it involves two copies of the same struct:

  1. Preparing the struct data in Python
  2. Preparing an FFI call (this copy is alloca so on the stack)

So if we want to do this microbenchmarking we need to do this through Python/FFI, not just writing a C code.

But perf was not the main issue I want to address 😅 We maintain an internal Python binding generator that eventually can be used to remove the ctypes hack in cuda.parallel (#3854). But it is a lot cleaner as far as codegen is concerned to pass arguments by pointer instead of by value. Saving an extra copy is only a by-product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CCCL-C For all items related to the C library cuda.parallel For all items related to the cuda.parallel Python module
Projects
Status: Todo
Development

No branches or pull requests

2 participants