Skip to content

Conversation

@hpkfft
Copy link
Contributor

@hpkfft hpkfft commented Sep 29, 2025

This PR adds support for DLPack version 1 and adds the ndarray framework nb::arrayapi, which returns an object that provides the buffer interface and has the two DLPack methods __dlpack__() and __dlpack_device__().

Given the following:

using array_t    = nb::ndarray<float, nb::ndim<1>, nb::c_contig>;
using array_np_t = nb::ndarray<float, nb::ndim<1>, nb::c_contig, nb::numpy>;

void init_array(const array_t& a) {
    const std::size_t n = a.shape(0);
    float* ptr = a.data();
    for (std::size_t i = 0; i < n; ++i) ptr[i] = 1.0f;
}

array_np_t create_array_np(std::size_t n) {
    float* ptr = new float[n];
    nb::capsule deleter(ptr, [](void* p) noexcept { delete[] (float*) p; });
    return array_np_t(ptr, {n}, std::move(deleter));
}

NB_MODULE(my_extension, m) {
    m.doc() = "nanobind my_extension module";
    m.def("init_array",      &init_array,      "Initialize array.");
    m.def("create_array_np", &create_array_np, "Create NumPy array.");
}

I measure performance as follows:

test old new ratio
init_array(array) 435 ns 278 ns 1.56
init_array(numpy) 160 ns 111 ns 1.44
create_array_np 565 ns 450 ns 1.25

using Python 3.14 and

python3 -m timeit -n 10000000 -r 10 -s "import array, my_extension as me; a = array.array('f', [1,2,3,4,5,6,7,8])" "me.init_array(a)"

python3 -m timeit -n 10000000 -r 10 -s "import numpy as np, my_extension as me; a = np.zeros(8, dtype=np.float32)" "me.init_array(a)"

python3 -m timeit -n 1000000 -r 10 -s "import numpy as np, my_extension as me;" "me.create_array_np(8)"

Copy link
Owner

@wjakob wjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hpkfft,

this looks great, here is a first batch of comments from me. I feel like this change also needs some documentation.

If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility? These questions are both relevant for code accepting dlpack-capable objects, and for returning them.

Thanks!

@wjakob
Copy link
Owner

wjakob commented Oct 17, 2025

Is this still a draft PR?

enum class dtype_code : uint8_t {
Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6
Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6,
Float8_e3m4 = 7, Float8_e4m3 = 8, Float8_e4m3b11fnuz = 9,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I would prefer the letters to be uppercase. e.g. Float8_E4M3.

@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 17, 2025

If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility?

Nothing. No. Only goodness.

When nanobind imports a DLPack-capable object, it first tries to call the object's __dlpack__() method with keyword argument "max_version" set to (1, 1), indicating that nanobind can accept a versioned tensor. (The minor version is irrelevant.) The object can return either the old unversioned tensor or a versioned tensor--either way nanobind does the import. If the object cannot accept the kwarg at all (i.e., raises TypeError), nanobind calls __dlpack__() without any kwargs and imports the unversioned tensor. (In theory, it could be versioned (which would be a bug in their code), but in reality, if the object doesn't even know about "max_version", then it doesn't know about versioned tensors.)
If the object is not DLPack-capable, nanobind tries to import using the buffer protocol.
If that doesn't work, nanobind tries to call to_dlpack(obj) on the framework to get an unversioned capsule. [This is very obsolete, but the code was there, so might as well keep it.]

In the case of a versioned capsule, a flag bit can be set to indicate that the tensor is read-only. Nanobind honors this and creates a read-only nd-array.
In the case of an unversioned capsule, nanobind assumes it's writable. As before, it would be the user's responsibility to know if that's not the case and to refrain from actually writing to it.

On export, it depends on the framework.

no_framework is unchanged. It continues to return an unversioned capsule for backward compatibility.

Tensorflow is unchanged. An unversioned capsule is passed to tensorflow.experimental.dlpack.from_dlpack(). Their online docs show that's the thing to do.

arrayapi is new. It returns an object of type nanobind.nb_ndarray, which supports both the buffer protocol and the DLPack __dlpack__() and __dlpack_device__() methods. The __dlpack__() method accepts and honors the keyword argument "max_version" and returns a versioned tensor if and only if the value tuple[int, int] has first component (i.e., major version) of at least 1. (If the value is None, or the keyword argument is missing, that is equivalent to passing a maximum major version of 0.)

NumPy is unchanged. It first makes a new nanobind.nb_ndarray and then passes it to NumPy, which imports it using the buffer protocol. I did not see a performance improvement in changing to DLPack. Also, numpy.array() supports a "copy" keyword argument, so if a copy is needed, it's done in the same call without having to subsequently call a copy() or clone() function.

memview is unchanged. It uses the buffer protocol on a new nanobind.nb_ndarray object.

PyTorch, JAX, and CuPy: nanobind creates a new nanobind.nb_ndarray object and then passes that to the framework's from_dlpack() function. That's not different per se, but these frameworks can now call our __dlpack__() with a maximum major version of 1 (and any minor version) and get a versioned tensor in return. They can also pass a maximum major version of 0 and get an unversioned tensor, as before. Or, pass max_version=None, or omit the keyword argument, and get an unversioned tensor, as before.

@wjakob
Copy link
Owner

wjakob commented Oct 17, 2025

Beautiful, thank you for this clarification. I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct? But I suppose the impact of that should diminish over time.

@wjakob
Copy link
Owner

wjakob commented Oct 17, 2025

One more potential optimization opportunity. Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons? (this is from the function that checks if an object is an ndarray).

    PyObject *name = nb_type_name((PyObject *) tp);
    check(name, "Could not obtain type name! (1)");

    const char *tp_name = PyUnicode_AsUTF8AndSize(name, nullptr);
    check(tp_name, "Could not obtain type name! (2)");

    bool result =
        // PyTorch
        strcmp(tp_name, "torch.Tensor") == 0 ||
        // XLA
        strcmp(tp_name, "jaxlib.xla_extension.ArrayImpl") == 0 ||
        // Tensorflow
        strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor") == 0 ||
        // Cupy
        strcmp(tp_name, "cupy.ndarray") == 0;

@hpkfft hpkfft marked this pull request as ready for review October 18, 2025 04:50
@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 18, 2025

I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct?

Yes, if __dlpack__(max_version=(1, 1)) fails and then __dlpack__() succeeds, we spend some time on the first call, which is not currently the case in nanobind. But that's unavoidable.
The max_version kwarg was added in Python array API standard v2023.12.
Note that a framework could trivially add support for max_version by simply accepting it as a kwarg and then ignoring it.
It's not required to return a versioned tensor when the caller asks for one.
It's always OK to return an unversioned tensor.
It is prohibited to return a versioned tensor unless the max_version is (1, 0) or greater.

But I suppose the impact of that should diminish over time.

Yes.

Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons [in ndarray_check]?

I don't think it would help.

The problem is that the pointer comparison name == something only succeeds if both name and something are the same object, which can be achieved if they have both been interned. We can intern something, but we can't intern name, which is whatever was set as the type name of the object. In other words, if the pointer comparison succeeds, then we know the strings are equal since they are the same object. But even if they are not the same object, they may still be the same UTF8 string.
In nb_ndarray_dlpack(), there is now the following code to check whether key is UTF8 string "max_version":

    if (key == static_pyobjects[pyobj_name::max_version_str] ||
        PyObject_RichCompareBool(key, static_pyobjects[pyobj_name::max_version_str], Py_EQ) == 1) {

This short-circuiting is good, since the pointer comparison is cheap and should be expected to succeed, because keyword argument names used across API boundaries ought to be interned by both sides (in order to support this optimization). [but see footnote 1]
If there are multiple kwnames, each key should be pointer compared to all supported names before doing any RichCompares. Hopefully, all keys pointer compare equal to some expected name and there's no need to do any RichCompares.

Now, consider ndarray_check.
If the result will be true, the common cases (PyObject either has attribute __dlpack__, or supports the buffer protocol, or is a PyCapsule) are all tested first, and the function returns before reaching the existing string comparisons.

If the result will be false, then the pointer compare will be false, and we'll have to do either strcmp or PyObject_RichCompareBool anyway to be sure the strings are not the same UTF8 strings (despite being different PyObjects). (And the former (as it is now) seems it would be faster than the latter.)

The frameworks should implement __dlpack__() from Python array API standard v2021.12.
Then the test we have now,

    if (PyObject_HasAttr(o, static_pyobjects[pyobj_name::dunder_dlpack_str]) ||
        PyObject_CheckBuffer(o))
        return true;

will be fast.

[footnote 1] The current (and past) release of NumPy does not intern "dl_device", "copy", or "max_version", so nanobind does the RichCompare, which succeeds. This is fixed in the development version by numpy/numpy#29875 So, nanobind will be a bit faster with the next release of NumPy.

@wjakob
Copy link
Owner

wjakob commented Oct 18, 2025

I don't think it would help.

My assumption was that the python type construction will intern type and module names so that pointer equality is legal.

@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 18, 2025

That doesn't seem to be the case. Using Python 3.11.2 and adding the following to ndarray_check:

    PyObject* tpfoe = 
      PyUnicode_InternFromString("tensorflow.python.framework.ops.EagerTensor");
    printf("name %s tpfoe\n", (name == tpfoe) ? "is" : "is not");
    int sc = strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor");
    printf("name %s tpfoe\n", (sc == 0) ? "==" : "!=");

I get

name is not tpfoe
name == tpfoe

when running

python3 -m pytest --capture=tee-sys tests/test_tensorflow.py

@hpkfft hpkfft force-pushed the dlpack-v1 branch 2 times, most recently from a7d3550 to f36bba4 Compare October 19, 2025 01:20
Copy link
Owner

@wjakob wjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hpkfft,

sorry about the delay, here is some more minor feedback about this PR.

X &operator=(const X &) = delete;

#define NB_MOD_STATE_SIZE 80
#define NB_MOD_STATE_SIZE 96
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the module state size have any impact on the nanobind ABI? (I am thinking no since it will be per-module).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking it's OK to increase the module state size, which is per-module, without increasing the nanobind ABI. Anyway, by luck, the ABI number was just increased. :)

o = module_::import_(pkg_name)
.attr(static_pyobjects[pyobj_name::from_dlpack_str])(o);
#else
PyObject* pkg_mod = module_import(pkg_name);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related comment as this PR is performance-focused: I was surprised in separate discussions here on the tracker that PyImport_ImportModule seems to be a serious performance hog. Would it make sense to cache the numpy module import for faster to-numpy conversion? (And similar for other frameworks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw that. Yes, I think the perf test create_array_np in the opening comment to this PR can be made about 100ns faster. My hypothesis is that making our own PyDict would not be faster than sys.modules. The speedup would be possible by using a C++ cache. I am thinking this could be done by adding an 8-entry module cache and using the per-module state for storage. The key could be void*, so it could be a C++ const char*. (Literal strings are placed in the .rodata section of the ELF shared library.) So, 128B of storage: 8 key pointers and 8 module PyObject*. Linear search and pointer-equality-compare is probably faster than hashing.

Have to think about locking. Multiple readers are OK. Maybe std::shared_mutex

I think this should be a separate PR.
It seems like it can/should be a feature of detail::module_import and no changes to nb_ndarray.cpp would be needed.

Given that Python 3.14 supports subinterpreters in python code (previously, subinterpreters were only available through the C-API), maybe you want nanobind to support it? The "What's new in Python 3.14" page says, "Regarding extension modules, work is in progress to update some PyPI projects, as well as tools like Cython, pybind11, nanobind, and PyO3." I mention this because maybe that should be done first so that the design of a module cache can accommodate the restrictions. Might have to add a pointer to type_data to replace static_pyobjects....

I haven't given it much thought, but probably my advice/preference would be to announce (in the changelog, etc.) that the upcoming release of nanobind will be the last one to support Python 3.8.
Then, after the release, remove 3.8 support.
Then support subinterpreters.
Then add a C++ cache for imports.

Maybe nanobind::ndarray_traits<T> can be removed (deprecated in #742)

@wjakob wjakob force-pushed the master branch 3 times, most recently from 4d71d9a to 238b695 Compare October 27, 2025 19:38
This commit adds support for the struct ``DLManagedTensorVersioned``
as defined by DLPack version 1.  It also adds the ndarray framework
``nb::array_api``, which returns an object that provides the buffer
interface and provides the two DLPack methods ``__dlpack__()`` and
``__dlpack_device__()``.
@wjakob
Copy link
Owner

wjakob commented Oct 30, 2025

Great, thanks you for incorporating the feedback. I will merge this as-is.

@wjakob wjakob merged commit babec16 into wjakob:master Oct 30, 2025
31 checks passed
@wjakob
Copy link
Owner

wjakob commented Oct 30, 2025

My hypothesis is that making our own PyDict would not be faster than sys.modules. The speedup would be possible by using a C++ cache. I am thinking this could be done by adding an 8-entry module cache and using the per-module state for storage. The key could be void*, so it could be a C++ const char*. (Literal strings are placed in the .rodata section of the ELF shared library.) So, 128B of storage: 8 key pointers and 8 module PyObject*. Linear search and pointer-equality-compare is probably faster than hashing.

(I am also onboard with this lightweight caching design.)

Is there any strong reason to get rid of Python 3.8? If it's all the same, I would like to keep it around for another year or so.

@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 30, 2025

(I am also onboard with this lightweight caching design.)

Thanks for that initial feedback

Is there any strong reason to get rid of Python 3.8?

Maybe not. Probably not. I suppose suggesting to drop support comes from my feeling that I don't know what I don't know and thinking about subinterpreters and Petr's PyModExport for Python 3.15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants