Skip to content

Conversation

@b-pass
Copy link
Collaborator

@b-pass b-pass commented Dec 26, 2025

Description

Change both the Python->C++ and the C++->Python function calls to use the faster "vector call" calling protocol.

This makes function calls faster, at the minor expense of having to handle the arguments slightly differently.

Suggested changelog entry:

  • Improved performance of function calls between Python and C++ by switching to the "vectorcall" calling protocol.

@b-pass b-pass marked this pull request as draft December 26, 2025 13:25
@b-pass
Copy link
Collaborator Author

b-pass commented Dec 26, 2025

Small benchmark results for short function calls, this PR provides modest improvement over current master... perf

@b-pass b-pass requested a review from oremanj December 26, 2025 16:52
@b-pass b-pass marked this pull request as ready for review December 26, 2025 16:52
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

@b-pass I merged master, which had no conflicts.

My idea was that maybe Cursor can figure out the root cause for these two CI failures:

2025-12-26T23:29:10.4917170Z  E       ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
...
2025-12-26T23:29:12.4779770Z  E       ModuleNotFoundError: No module named 'mod_shared_interpreter_gil'

I'll wait for this CI run to finish, to see if these errors appear again.

When testing locally with Python 3.14.2, default and freethreaded, I saw this error (with both builds):

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

EDIT: Please see #5948 (comment) for the resolution.

I'm attaching the full build & test logs, JIC.

I switched back to master, everything else being equal, this error didn't appear. So it's definitely somehow connected to this PR.

pybind11_gcc_v3.14_df793163d58_default_tests_log_2025-12-26+193154.txt

pybind11_gcc_v3.14_df793163d58_freethreaded_log_2025-12-26+193608.txt

The `mod_per_interpreter_gil`, `mod_shared_interpreter_gil`, and
`mod_per_interpreter_gil_with_singleton` modules were being built
but not installed into the wheel when using scikit-build-core
(SKBUILD=true). This caused iOS (and potentially Android) CIBW
tests to fail with ModuleNotFoundError.

Root cause analysis:
- The main test targets have install() commands (line 531)
- The PYBIND11_MULTIPLE_INTERPRETERS_TEST_MODULES were missing
  equivalent install() commands
- For regular CMake builds, this wasn't a problem because
  LIBRARY_OUTPUT_DIRECTORY places the modules next to pybind11_tests
- For wheel builds, only targets with explicit install() commands
  are included in the wheel

This issue was latent until commit fee2527 changed the test imports
from `pytest.importorskip()` (graceful skip) to direct `import`
statements (hard failure), which exposed the missing modules.

Failing tests:
- test_multiple_interpreters.py::test_independent_subinterpreters
- test_multiple_interpreters.py::test_dependent_subinterpreters

Error: ModuleNotFoundError: No module named 'mod_per_interpreter_gil'
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

Quick update: I think Cursor found the root cause for the CIBW / iOS failures, and a fix.

I'm still looking into why we're not seeing the m.array_reshape2(b) errors here.

Add numpy==2.4.0 requirement for Python 3.14 (both default and
free-threaded builds). NumPy 2.4.0 is the first version to provide
official PyPI wheels for Python 3.14:

- numpy-2.4.0-cp314-cp314-manylinux_2_27_x86_64...whl (default)
- numpy-2.4.0-cp314-cp314t-manylinux_2_27_x86_64...whl (free-threaded)

Previously, CI was skipping all numpy-dependent tests for Python 3.14
because PIP_ONLY_BINARY was set and no wheels were available:

  SKIPPED [...] test_numpy_array.py:8: could not import 'numpy':
  No module named 'numpy'

With this change, the full numpy test suite will run on Python 3.14,
providing better test coverage for the newest Python version.

Note: Using exact pin (==2.4.0) rather than compatible release (~=2.4.0)
to ensure reproducible CI results with the first known-working version.
@rwgk rwgk requested a review from henryiii as a code owner December 27, 2025 05:00
@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

I pushed commit ca72d6c here because we only have very limited GHA resources. Depending on how it goes, it might be best to merge that commit in a separate PR. Let's see.

Ideally commit 6ed6d5a should also be in a separate PR, but same issue with the GHA resources.

@rwgk
Copy link
Collaborator

rwgk commented Dec 27, 2025

NumPy 2.3.5 vs 2.4.0 Observations

The Issue

When testing this PR locally with Python 3.14 + NumPy 2.3.5, I observed a test failure:

>       m.array_reshape2(b)
E       ValueError: array_reshape2: input array total size is not a squared integer

test_numpy_array.py::test_array_resize FAILED

This occurred when calling array_reshape2 on a transposed view of a resized 64-element array (which should have worked since 64 = 8²).

What We Know For Sure

  1. The error appeared with NumPy 2.3.5 (released 2025-11-16) on Python 3.14
  2. The error is gone with NumPy 2.4.0 (released 2025-12-20) on Python 3.14
  3. This is NOT a bug in the vectorcall PR — the same test passes on master with NumPy 2.3.5, but that's because CI was skipping numpy tests for Python 3.14 entirely (no numpy entry in tests/requirements.txt for 3.14)
  4. NumPy 2.4.0 wheels for Python 3.14 are now available on PyPI (both default cp314 and free-threaded cp314t)

What's Undetermined

The exact bug fix in NumPy between 2.3.5 and 2.4.0 that resolved this issue. I searched the NumPy 2.4.0 release notes but couldn't find a specific fix that clearly matches our observation. The release notes mention "many expired deprecations and bug fixes" without listing all of them individually.

Resolution

Added numpy==2.4.0 to tests/requirements.txt for Python 3.14+ (commit ca72d6c). This:

  • Enables numpy tests on Python 3.14 in CI (previously skipped)
  • Avoids the NumPy 2.3.5 bug

Add `-v` to the pytest command in tests/pyproject.toml to help
diagnose hanging tests in CIBW jobs (particularly iOS).

This will show each test name as it runs, making it easier to
identify which specific test is hanging.
@Skylion007
Copy link
Collaborator

Random design question, why does the collector store pyobject pointers directly vs py::handle?

It's a good bit faster at the cost of this one scary reinterpret_cast.
At 6, the struct is 144 bytes (not 128 bytes as the comment said).
@b-pass
Copy link
Collaborator Author

b-pass commented Dec 28, 2025

The push_back impl isnt quite right right.

Fixed.

Needs a static assert the heapvector? And maybe iarray?

Not sure what you mean... elaborate?

Random design question, why does the collector store pyobject pointers directly vs py::handle?

It was to avoid having to reinterpret_cast the C++ type to the C type in the Vectorcall call.... But I've consolidated both the PyObject* vector and the temp object vector into one and casted it in this latest version.

@Skylion007
Copy link
Collaborator

The push_back impl isnt quite right right.

Fixed.

Needs a static assert the heapvector? And maybe iarray?

Not sure what you mean... elaborate?

Random design question, why does the collector store pyobject pointers directly vs py::handle?

It was to avoid having to reinterpret_cast the C++ type to the C type in the Vectorcall call.... But I've consolidated both the PyObject* vector and the temp object vector into one and casted it in this latest version.

Avoiding the reintrepret_cast is a valid reason, just curious.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why a bunch of list and tuple objects are being stored as generic objects? We don't do anything crazy to them and they all inherit py;:object. The only different is the implicit boolean checks maybe but those can be easily over by unwrapping the ptr any time and checking against nullptr

if (PyList_Append(names_list.ptr(), name.release().ptr()) < 0) {
throw error_already_set();
}
m_args.emplace_back(a.value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_args.emplace_back(a.value);
m_args.emplace_back(std::move(a.value));

@b-pass
Copy link
Collaborator Author

b-pass commented Dec 28, 2025

I'm a bit confused why a bunch of list and tuple objects are being stored as generic objects? We don't do anything crazy to them and they all inherit py;:object. The only different is the implicit boolean checks maybe but those can be easily over by unwrapping the ptr any time and checking against nullptr

tuple and list don't have a constructor which allows them to be constructed as nullptr, where as object's default constructor is a nullptr. Typing these properly would make an extra allocation. IMO we can't afford the extra allocation here (in probably the hotest code path in all of pybind11). I also don't think it would be a good idea (or at least needs a lot of debate) to make a change to pybind11 tuple/list for this case.

It's just a bit inconvenient in this class ... I can certainly add some explanation in a comment about it.

EDIT: Turns out there is a way: auto null_list = reinterpret_steal<list>(handle());

b-pass and others added 4 commits December 28, 2025 18:04
They can be null if you "steal" a null handle.
…icit ownership

Introduce `ref_small_vector` to manage PyObject* references in `unpacking_collector`,
replacing the previous `small_vector<object>` approach.

Primary goals:

1. **Maintainability**: The previous implementation relied on
   `sizeof(object) == sizeof(PyObject*)` and used a reinterpret_cast to
   pass the object array to vectorcall. This coupling to py::object's
   internal layout could break if someone adds a debug field or other
   member to py::handle/py::object in the future.

2. **Readability**: The new `push_back_steal()` vs `push_back_borrow()`
   API makes reference counting intent explicit at each call site,
   rather than relying on implicit py::object semantics.

3. **Intuitive code**: Storing `PyObject*` directly and passing it to
   `_PyObject_Vectorcall` without casts is straightforward and matches
   what the C API expects. No "scary" reinterpret_cast needed.

Additional benefits:
- `PyObject*` is trivially copyable, simplifying vector operations
- Batch decref in destructor (tight loop vs N individual object destructors)
- Self-documenting ownership semantics

Design consideration: We considered folding the ref-counting functionality
directly into `small_vector` via template specialization for `PyObject*`.
We decided against this because:
- It would give `small_vector<PyObject*, N>` a different interface than the
  generic `small_vector<T, N>` (steal/borrow vs push_back)
- Someone might want a non-ref-counting `small_vector<PyObject*, N>`
- The specialization behavior could surprise users expecting uniform semantics

A separate `ref_small_vector` type makes the ref-counting behavior explicit
and self-documenting, while keeping `small_vector` generic and predictable.
@rwgk
Copy link
Collaborator

rwgk commented Dec 29, 2025

@b-pass commit d828e7d is meant to be a suggestion, although already fully developed. Please let me know what you think.

@b-pass
Copy link
Collaborator Author

b-pass commented Dec 29, 2025

Please let me know what you think.

I like it. It's less code than I thought it would be.

if (m_names) {
nargs -= m_names.size();
}
PyObject *result = _PyObject_Vectorcall(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight nit could have been a py::handle

/// destructor, and avoids the need for reinterpret_cast when passing to vectorcall.
template <std::size_t InlineSize>
class ref_small_vector {
public:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I don't like this, but I don't see a better way of doing due to layout compatibiltiy issues with py::object. Every other method of doing this requires a copy.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you thought of this, but if you make push_back only take in objects, you can handle the reinterpret steal and reinterpret borrow using only move and release mechanics.

}

/// Add a pointer, taking ownership (no incref, will decref on destruction)
void push_back_steal(PyObject *p) { m_ptrs.push_back(p); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void push_back_steal(PyObject *p) { m_ptrs.push_back(p); }
void push_back_steal(object &&o) { m_ptrs.push_back(o.release().ptr()); }

detail::make_caster<T>::cast(std::forward<T>(x), policy, {}));
if (!o) {
void process(list & /*names_list*/, T &&x) {
handle h = detail::make_caster<T>::cast(std::forward<T>(x), policy, {});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting this to a handle here makes me a bit uncomfortable. You could keep it in an object and it just use the move from ctor, the only downside is you'll have one extra call to PYXDECREF on a nullptr.

@Skylion007
Copy link
Collaborator

I like how you have it done now, I just wonder if it could be clenaed up anymore with .release().ptr() semantics.

}
m_kwargs[a.name] = std::move(a.value);
names_list.append(std::move(name));
m_args.push_back_borrow(a.value.ptr());
Copy link
Collaborator

@Skylion007 Skylion007 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah both of these could just be a push_back(&&o) with a release ptr syntax and then you don't have to duplicate the logic and while the implementation is py::objects the input is still py::object. The one downside is the extra xdecref call on the nullptr stored in the py::object, but doubt that nullcheck will be a bottleneck?

Suggested change
m_args.push_back_borrow(a.value.ptr());
m_args.push_back_steal(reinterpret_borrow<object>(a.value));

redblackbst added a commit to redblackbst/pybind11 that referenced this pull request Jan 5, 2026
@rwgk
Copy link
Collaborator

rwgk commented Jan 6, 2026

@b-pass I'm in favor of merging based on @Skylion007 and my reviews.

We could then work on the v3.0.2 release. (I could jump-start that.)

@Skylion007
Copy link
Collaborator

Might be enough for a minor version bump to 3.1, no?

@b-pass b-pass merged commit 10f8708 into pybind:master Jan 6, 2026
151 of 152 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jan 6, 2026
@rwgk
Copy link
Collaborator

rwgk commented Jan 6, 2026

Might be enough for a minor version bump to 3.1, no?

Not sure, maybe let's look at the changelog PR first?

There are two PRs in a holding pattern, for those we need to bump the minor version for sure (IMO):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs changelog Possibly needs a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants