-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add fast_type_map, use it authoritatively for local types and as a hint for global types #5842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
// REVIEW: do we need to add a fancy hash for pointers or is the | ||
// possible identity hash function from the standard library (e.g., | ||
// libstdc++) sufficient? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that the Linux box I used for benchmarking has libstdc++.
…nt for global types nanobind has a similar two-level lookup strategy, added and explained by wjakob/nanobind@b515b1f In this PR I've ported this approach to pybind11. To avoid an ABI break, I've kept the fast maps to the `local_internals`. I think this should be safe because any particular module should see its `local_internals` reset at least as often as the global `internals`, and misses in the fast "hint" map for global types fall back to the global `internals`. Performance seems to have improved. Using my patched fork of pybind11_benchmark (https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates, specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I run bench.py and observe the MyInt case. Each time, I do 3 runs and just report all 3. master, Mac: 75.9, 76.9, 75.3 nsec/loop this PR, Mac: 73.8, 73.8, 73.6 nsec/loop master, Linux box: 188, 187, 188 nsec/loop this PR, Linux box: 164, 165, 164 nsec/loop Note that the "real" percentage improvement is larger than implied by the above because master does not yet include pybind#5824.
7e11f11
to
cc0b053
Compare
tests/test_embed/external_module.cpp
Outdated
py::detail::get_local_internals_pp_manager().unref(); | ||
|
||
// now we unref the static global singleton internals | ||
py::detail::get_local_internals_pp_manager().unref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended to unref the non-local internals? looks redundant as written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, butchered my copy/paste/modify job
auto it = locals.find(tp); | ||
inline detail::type_info *get_local_type_info(const std::type_info &tp, | ||
const local_internals &local_internals) { | ||
const auto &locals = local_internals.registered_types_cpp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not changed in this PR but I think this function needs a with_internals lock too. Multiple threads could run bindings from the same DSO simultaneously. local_internals doesn't have its own lock and is generally protected by the global internals lock.
You could optimize by making the versions with a local_internals parameter assume internals is already locked. Then get_type_info() could query both internals under a single lock instead of acquiring, releasing, and re-acquiring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems clearly out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, doesn't the GIL preclude this sort of thing when present?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems clearly out of scope for this PR.
I figured since you were reworking these functions anyway, I would point out an easy-looking opportunity to improve their thread-safety while you're here. I agree the PR doesn't make current behavior worse and I wasn't trying enforce a wider scope.
(Also, doesn't the GIL preclude this sort of thing when present?)
Yes, but pybind11 supports free-threading mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_internals doesn't have its own lock and is generally protected by the global internals lock.
function_record_ptr_from_PyObject also accesses local_internals and isn't called under the global internals lock.
include/pybind11/detail/class.h
Outdated
} else { | ||
internals.registered_types_cpp.erase(tindex); | ||
local_internals.global_registered_types_cpp_fast.erase(tinfo->cpptype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only erase the fast map entry for the DSO where the type was bound. Other DSOs will preserve their cached entries in the fast map, which will now dangle. Solving that with an ABI break is pretty easy - make a linked list of all the fast maps and have modifications to the slow map invalidate the corresponding entries in the fast maps. (Or since you're taking an ABI break, just put the fast map in global internals.) Solving it without an ABI break will be difficult. The best way I can think of is to accompany each entry you add to the fast map with a keep_alive that will remove it when the targeted type object is being destroyed. That's pretty heavyweight though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding the hole in this strategy! I wonder if there's a way to accommodate developing things in anticipation of a future ABI break without just doubling CI costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, wjakob/nanobind#1140 incidentally contains some rework of the nanobind fast type map in case you want to incorporate ideas from that. One fast map per thread (to avoid lock contention) + caching negative lookups. Note that caching negative lookups requires also remembering which type_info pointers you've seen for a given C++ type, so you can invalidate the cache when a new type is bound. (AFAICT, nothing you're doing here precludes getting fancier in that direction later.)
Description
nanobind has a similar two-level lookup strategy, added and explained by
wjakob/nanobind@b515b1f . In brief, fast_type_map should be faster because std::type_index identity is based on the name, whereas fast_type_map uses pointer equality.
In this PR I've ported this approach to pybind11.
Performance seems to have improved. Using my patched fork of pybind11_benchmark
(https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates, specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I run bench.py and observe the MyInt case. Each time, I do 3 runs and just report all 3.
master, Mac: 75.9, 76.9, 75.3 nsec/loop
this PR, Mac: 73.8, 73.8, 73.6 nsec/loop
master, Linux box: 188, 187, 188 nsec/loop
this PR, Linux box: 164, 165, 164 nsec/loop
Note that the "real" percentage improvement is larger than implied by the above because master does not yet include #5824.
Suggested changelog entry: