-
Notifications
You must be signed in to change notification settings - Fork 257
Add support to ndarray for DLPack version 1 #1175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wjakob
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hpkfft,
this looks great, here is a first batch of comments from me. I feel like this change also needs some documentation.
If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility? These questions are both relevant for code accepting dlpack-capable objects, and for returning them.
Thanks!
|
Is this still a draft PR? |
include/nanobind/ndarray.h
Outdated
| enum class dtype_code : uint8_t { | ||
| Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6 | ||
| Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6, | ||
| Float8_e3m4 = 7, Float8_e4m3 = 8, Float8_e4m3b11fnuz = 9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I would prefer the letters to be uppercase. e.g. Float8_E4M3.
Nothing. No. Only goodness. When nanobind imports a DLPack-capable object, it first tries to call the object's In the case of a versioned capsule, a flag bit can be set to indicate that the tensor is read-only. Nanobind honors this and creates a read-only nd-array. On export, it depends on the framework.
Tensorflow is unchanged. An unversioned capsule is passed to
NumPy is unchanged. It first makes a new
PyTorch, JAX, and CuPy: nanobind creates a new |
|
Beautiful, thank you for this clarification. I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct? But I suppose the impact of that should diminish over time. |
|
One more potential optimization opportunity. Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons? (this is from the function that checks if an object is an ndarray). PyObject *name = nb_type_name((PyObject *) tp);
check(name, "Could not obtain type name! (1)");
const char *tp_name = PyUnicode_AsUTF8AndSize(name, nullptr);
check(tp_name, "Could not obtain type name! (2)");
bool result =
// PyTorch
strcmp(tp_name, "torch.Tensor") == 0 ||
// XLA
strcmp(tp_name, "jaxlib.xla_extension.ArrayImpl") == 0 ||
// Tensorflow
strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor") == 0 ||
// Cupy
strcmp(tp_name, "cupy.ndarray") == 0; |
Yes, if
Yes.
I don't think it would help. The problem is that the pointer comparison This short-circuiting is good, since the pointer comparison is cheap and should be expected to succeed, because keyword argument names used across API boundaries ought to be interned by both sides (in order to support this optimization). [but see footnote 1] Now, consider If the result will be false, then the pointer compare will be false, and we'll have to do either The frameworks should implement will be fast. [footnote 1] The current (and past) release of NumPy does not intern |
My assumption was that the python type construction will intern type and module names so that pointer equality is legal. |
|
That doesn't seem to be the case. Using Python 3.11.2 and adding the following to I get when running |
a7d3550 to
f36bba4
Compare
wjakob
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hpkfft,
sorry about the delay, here is some more minor feedback about this PR.
| X &operator=(const X &) = delete; | ||
|
|
||
| #define NB_MOD_STATE_SIZE 80 | ||
| #define NB_MOD_STATE_SIZE 96 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the module state size have any impact on the nanobind ABI? (I am thinking no since it will be per-module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also thinking it's OK to increase the module state size, which is per-module, without increasing the nanobind ABI. Anyway, by luck, the ABI number was just increased. :)
| o = module_::import_(pkg_name) | ||
| .attr(static_pyobjects[pyobj_name::from_dlpack_str])(o); | ||
| #else | ||
| PyObject* pkg_mod = module_import(pkg_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related comment as this PR is performance-focused: I was surprised in separate discussions here on the tracker that PyImport_ImportModule seems to be a serious performance hog. Would it make sense to cache the numpy module import for faster to-numpy conversion? (And similar for other frameworks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw that. Yes, I think the perf test create_array_np in the opening comment to this PR can be made about 100ns faster. My hypothesis is that making our own PyDict would not be faster than sys.modules. The speedup would be possible by using a C++ cache. I am thinking this could be done by adding an 8-entry module cache and using the per-module state for storage. The key could be void*, so it could be a C++ const char*. (Literal strings are placed in the .rodata section of the ELF shared library.) So, 128B of storage: 8 key pointers and 8 module PyObject*. Linear search and pointer-equality-compare is probably faster than hashing.
Have to think about locking. Multiple readers are OK. Maybe std::shared_mutex
I think this should be a separate PR.
It seems like it can/should be a feature of detail::module_import and no changes to nb_ndarray.cpp would be needed.
Given that Python 3.14 supports subinterpreters in python code (previously, subinterpreters were only available through the C-API), maybe you want nanobind to support it? The "What's new in Python 3.14" page says, "Regarding extension modules, work is in progress to update some PyPI projects, as well as tools like Cython, pybind11, nanobind, and PyO3." I mention this because maybe that should be done first so that the design of a module cache can accommodate the restrictions. Might have to add a pointer to type_data to replace static_pyobjects....
I haven't given it much thought, but probably my advice/preference would be to announce (in the changelog, etc.) that the upcoming release of nanobind will be the last one to support Python 3.8.
Then, after the release, remove 3.8 support.
Then support subinterpreters.
Then add a C++ cache for imports.
Maybe nanobind::ndarray_traits<T> can be removed (deprecated in #742)
4d71d9a to
238b695
Compare
This commit adds support for the struct ``DLManagedTensorVersioned`` as defined by DLPack version 1. It also adds the ndarray framework ``nb::array_api``, which returns an object that provides the buffer interface and provides the two DLPack methods ``__dlpack__()`` and ``__dlpack_device__()``.
|
Great, thanks you for incorporating the feedback. I will merge this as-is. |
(I am also onboard with this lightweight caching design.) Is there any strong reason to get rid of Python 3.8? If it's all the same, I would like to keep it around for another year or so. |
Thanks for that initial feedback
Maybe not. Probably not. I suppose suggesting to drop support comes from my feeling that I don't know what I don't know and thinking about subinterpreters and Petr's |
This PR adds support for DLPack version 1 and adds the ndarray framework
nb::arrayapi, which returns an object that provides the buffer interface and has the two DLPack methods__dlpack__()and__dlpack_device__().Given the following:
I measure performance as follows:
using Python 3.14 and