Speedup AdvancedSubtensor1 and AdvancedIncSubtensor1 in C backend #1346

ricardoV94 · 2025-04-07T13:11:29Z

These are some of the biggest drags in the C-backend. This PR does some tweaks that increase performance substantically.

AdvancedSubtensor1 benchmark

Before

---------------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                Max              Mean            StdDev            Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_advanced_subtensor1[gc=True-static_shape=False]      1.7730 (1.0)      62.5270 (5.48)     2.0225 (1.00)     0.6402 (2.22)     1.9640 (1.0)      0.0400 (1.00)    1785;6702      494.4466 (1.00)      83320           1
test_advanced_subtensor1[gc=True-static_shape=True]       1.7830 (1.01)     57.1570 (5.01)     2.0216 (1.0)      0.6144 (2.13)     1.9840 (1.01)     0.0400 (1.0)      690;2893      494.6573 (1.0)       73660           1
test_advanced_subtensor1[gc=False-static_shape=True]      2.1040 (1.19)     11.4020 (1.0)      2.3344 (1.15)     0.2878 (1.0)      2.3040 (1.17)     0.0500 (1.25)    1349;4021      428.3810 (0.87)     102691           1
test_advanced_subtensor1[gc=False-static_shape=False]     2.1140 (1.19)     19.3860 (1.70)     2.2857 (1.13)     0.2986 (1.04)     2.2740 (1.16)     0.0510 (1.27)       65;547      437.5102 (0.88)      11134           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

---------------------------------------------------------------------------------------------------- benchmark: 4 tests ----------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                 Max              Mean            StdDev            Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_advanced_subtensor1[gc=False-static_shape=True]      1.2820 (1.0)       45.9260 (1.0)      1.4129 (1.0)      0.5339 (1.0)      1.3730 (1.0)      0.0490 (1.22)      783;971      707.7531 (1.0)       57594           1
test_advanced_subtensor1[gc=True-static_shape=True]       1.4620 (1.14)      87.1230 (1.90)     1.6104 (1.14)     0.5974 (1.12)     1.5630 (1.14)     0.0400 (1.0)     2326;3531      620.9714 (0.88)     115929           1
test_advanced_subtensor1[gc=False-static_shape=False]     1.7530 (1.37)      71.8950 (1.57)     1.9472 (1.38)     0.7248 (1.36)     1.9240 (1.40)     0.0400 (1.00)       65;408      513.5575 (0.73)      10931           1
test_advanced_subtensor1[gc=True-static_shape=False]      1.7730 (1.38)     282.6200 (6.15)     2.0129 (1.42)     1.1193 (2.10)     1.9540 (1.42)     0.0400 (1.0)     1410;6501      496.7973 (0.70)     113676           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

AdvancedIncSubtensor1 benchmark

Before

------------------------------------------------------------------------------------------------------------- benchmark: 8 tests ------------------------------------------------------------------------------------------------------------
Name (time in us)                                                             Min                 Max              Mean            StdDev            Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_advanced_incsubtensor1[set_subtensor-gc=False-static_shape=True]      5.2000 (1.0)       80.1600 (2.73)     5.5870 (1.01)     1.7055 (2.56)     5.4600 (1.01)     0.1000 (1.64)     875;1577      178.9867 (0.99)      48810           1
test_advanced_incsubtensor1[set_subtensor-gc=False-static_shape=False]     5.2000 (1.00)     291.1850 (9.92)     5.5165 (1.0)      2.0229 (3.04)     5.4000 (1.0)      0.0610 (1.0)      744;1847      181.2759 (1.0)       50209           1
test_advanced_incsubtensor1[set_subtensor-gc=True-static_shape=True]       5.3200 (1.02)      71.1430 (2.42)     5.6233 (1.02)     1.3633 (2.05)     5.5210 (1.02)     0.0700 (1.15)     628;1601      177.8307 (0.98)      38184           1
test_advanced_incsubtensor1[set_subtensor-gc=True-static_shape=False]      5.4200 (1.04)      29.3550 (1.0)      5.6929 (1.03)     0.6650 (1.0)      5.6400 (1.04)     0.0700 (1.15)     702;1393      175.6585 (0.97)      47170           1
test_advanced_incsubtensor1[inc_subtensor-gc=False-static_shape=True]      5.8510 (1.13)     346.4190 (11.80)    6.2323 (1.13)     2.3741 (3.57)     6.1110 (1.13)     0.0800 (1.31)     579;1374      160.4549 (0.89)      46905           1
test_advanced_incsubtensor1[inc_subtensor-gc=False-static_shape=False]     5.9310 (1.14)      29.5250 (1.01)     6.2690 (1.14)     0.8971 (1.35)     6.1410 (1.14)     0.0800 (1.31)      377;584      159.5152 (0.88)      10442           1
test_advanced_incsubtensor1[inc_subtensor-gc=True-static_shape=True]       5.9720 (1.15)     111.7590 (3.81)     6.3686 (1.15)     1.8011 (2.71)     6.2510 (1.16)     0.0900 (1.48)     806;1734      157.0216 (0.87)      48265           1
test_advanced_incsubtensor1[inc_subtensor-gc=True-static_shape=False]      6.0720 (1.17)      86.5220 (2.95)     6.5262 (1.18)     1.9851 (2.99)     6.3420 (1.17)     0.1510 (2.48)    1039;1465      153.2284 (0.85)      42257           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After


------------------------------------------------------------------------------------------------------------- benchmark: 8 tests ------------------------------------------------------------------------------------------------------------
Name (time in us)                                                             Min                 Max              Mean            StdDev            Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_advanced_incsubtensor1[set_subtensor-gc=False-static_shape=True]      1.6530 (1.0)       58.8300 (7.25)     1.8268 (1.0)      0.6546 (3.03)     1.7930 (1.0)      0.0500 (1.22)     588;1745      547.4165 (1.0)       98727           1
test_advanced_incsubtensor1[set_subtensor-gc=False-static_shape=False]     1.7730 (1.07)     250.6090 (30.88)    1.9281 (1.06)     1.0618 (4.92)     1.8940 (1.06)     0.0500 (1.22)     290;1008      518.6360 (0.95)      81215           1
test_advanced_incsubtensor1[set_subtensor-gc=True-static_shape=True]       1.9040 (1.15)      92.4230 (11.39)    2.0656 (1.13)     0.7314 (3.39)     2.0040 (1.12)     0.0500 (1.22)    1165;5456      484.1202 (0.88)      93730           1
test_advanced_incsubtensor1[set_subtensor-gc=True-static_shape=False]      2.0840 (1.26)      53.9000 (6.64)     2.3187 (1.27)     0.5710 (2.65)     2.3040 (1.28)     0.0910 (2.22)     764;1078      431.2728 (0.79)      88645           1
test_advanced_incsubtensor1[inc_subtensor-gc=False-static_shape=True]      2.4540 (1.48)     163.9370 (20.20)    2.6131 (1.43)     0.8179 (3.79)     2.5750 (1.44)     0.0410 (1.0)      921;2552      382.6924 (0.70)      90827           1
test_advanced_incsubtensor1[inc_subtensor-gc=True-static_shape=True]       2.7160 (1.64)      51.4560 (6.34)     2.9719 (1.63)     0.7313 (3.39)     2.9150 (1.63)     0.0600 (1.46)    1406;4193      336.4881 (0.61)      95612           1
test_advanced_incsubtensor1[inc_subtensor-gc=False-static_shape=False]     2.9050 (1.76)       8.1160 (1.0)      3.0595 (1.67)     0.2158 (1.0)      3.0460 (1.70)     0.0600 (1.46)      120;249      326.8474 (0.60)      23965           1
test_advanced_incsubtensor1[inc_subtensor-gc=True-static_shape=False]      3.2560 (1.97)     229.7100 (28.30)    3.4273 (1.88)     1.3326 (6.17)     3.3760 (1.88)     0.0600 (1.46)     312;1105      291.7782 (0.53)      52149           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I added a long-missing check for runtime broadcasting in the python/C/torch implementations (would require a bit more code-changes for numba), which moves towards #1348

Provides a restricted case of #1325

Alloc of zeros is also about twice as fast now, which is benchmarked indirectly in the AdvancedIncSubtensor1 tests

📚 Documentation preview 📚: https://pytensor--1346.org.readthedocs.build/en/1346/

codecov · 2025-04-09T14:33:19Z

Codecov Report

Attention: Patch coverage is 87.65432% with 10 lines in your changes missing coverage. Please review.

Project coverage is 82.02%. Comparing base (4378d48) to head (a0abe86).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
pytensor/link/pytorch/dispatch/subtensor.py	0.00%	2 Missing and 2 partials ⚠️
pytensor/tensor/rewriting/subtensor.py	63.63%	3 Missing and 1 partial ⚠️
pytensor/link/jax/dispatch/subtensor.py	0.00%	1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (87.65%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1346   +/-   ##
=======================================
  Coverage   82.01%   82.02%           
=======================================
  Files         207      207           
  Lines       49250    49302   +52     
  Branches     8734     8748   +14     
=======================================
+ Hits        40394    40438   +44     
- Misses       6692     6697    +5     
- Partials     2164     2167    +3

Files with missing lines	Coverage Δ
pytensor/link/numba/dispatch/slinalg.py	`69.76% <100.00%> (ø)`
pytensor/link/numba/dispatch/subtensor.py	`95.58% <100.00%> (ø)`
pytensor/tensor/basic.py	`91.19% <100.00%> (+0.02%)`	⬆️
pytensor/tensor/slinalg.py	`93.10% <100.00%> (ø)`
pytensor/tensor/subtensor.py	`89.98% <100.00%> (+0.50%)`	⬆️
pytensor/link/jax/dispatch/subtensor.py	`95.83% <0.00%> (-4.17%)`	⬇️
pytensor/link/pytorch/dispatch/subtensor.py	`85.55% <0.00%> (-3.98%)`	⬇️
pytensor/tensor/rewriting/subtensor.py	`89.58% <63.63%> (-0.38%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (3)

tests/tensor/test_subtensor.py:1277

The generator expression used for 'inc_var_static_shape' should be converted to a tuple (e.g. tuple(...)) to ensure it produces a concrete shape tuple.

inc_var_static_shape = (1 if dim_length == 1 else None for dim_length in inc_shape)

pytensor/link/pytorch/dispatch/subtensor.py:112

The _check_runtime_broadcasting method expects four arguments (including the node), so the node argument should be passed to ensure correct runtime checking.

if isinstance(op, AdvancedIncSubtensor1): op._check_runtime_broadcasting(x, y, indices)

pytensor/link/jax/dispatch/subtensor.py:70

Similar to the PyTorch dispatch, the node argument is missing when calling _check_runtime_broadcasting. Ensure that the node is passed as the first parameter after self.

if isinstance(op, AdvancedIncSubtensor1): op._check_runtime_broadcasting(x, y, indices)

jessegrabowski · 2025-04-09T14:39:14Z

Useless AI strikes again!

ricardoV94 · 2025-04-09T18:12:58Z

Useless AI strikes again!

It was actually correct, I'm surprised no tests failed, I guess we are not really covering the dispatch of these in JAX/PyTorch because the Op is only introduced during rewrites

Copilot

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pytensor/tensor/subtensor.py:2247

The condition validating index values in _idx_may_be_invalid is non-obvious; please verify that it correctly handles negative indices and reflects the intended bounds check.

return not (min_idx >= 0 or min_idx >= -shape0) and (max_idx < 0 or max_idx < shape0)

pytensor/tensor/subtensor.py

ricardoV94 · 2025-04-29T14:35:10Z

There were some wrong promised types in the LUFactor/Pivots that showed up in this PR @jessegrabowski (first commit)

jessegrabowski

The requested green check, with some questions/comments.

As usual, my ability to contribute on these C-code PRs is limited, but I tried my best.

jessegrabowski · 2025-04-29T14:52:55Z

pytensor/link/numba/dispatch/slinalg.py

@@ -83,7 +83,7 @@ def cholesky(a):
 @numba_funcify.register(PivotToPermutations)
 def pivot_to_permutation(op, node, **kwargs):
    inverse = op.inverse
-    dtype = node.inputs[0].dtype
+    dtype = node.outputs[0].dtype


What a dumb mistake, I wonder which idiot wrote this code

jessegrabowski · 2025-04-29T14:54:30Z

pytensor/tensor/basic.py

@@ -1659,6 +1659,11 @@ def c_code(self, node, name, inp, out, sub):
        o_static_shape = node.outputs[0].type.shape
        v_ndim = len(v_static_shape)
        o_ndim = len(o_static_shape)
+        is_zero = (
+            all(node.inputs[0].type.broadcastable)


Will this catch a scalar zero?

Yes, all(empy) is True

pytensor/tensor/rewriting/subtensor.py

jessegrabowski · 2025-04-29T15:01:27Z

pytensor/tensor/slinalg.py

@@ -604,7 +604,7 @@ def make_node(self, pivots):

    def perform(self, node, inputs, outputs):
        [pivots] = inputs
-        p_inv = np.arange(len(pivots), dtype=pivots.dtype)
+        p_inv = np.arange(len(pivots), dtype="int64")


We have to always be careful about single vs double precision on floats (use floatX) but not int -- why is that?

float32 were very geared towards GPU concerns, integers are not as problematic [citation needed]

jessegrabowski · 2025-04-29T15:01:54Z

pytensor/tensor/slinalg.py

@@ -639,7 +639,7 @@ def make_node(self, A):
            )

        LU = matrix(shape=A.type.shape, dtype=A.type.dtype)
-        pivots = vector(shape=(A.type.shape[0],), dtype="int64")
+        pivots = vector(shape=(A.type.shape[0],), dtype="int32")


Like why 64 above but 32 here?

This is what the scipy function returns. The reason I went with i64 above is that np.argsort returns that regardless of the input, so I kept the outputs the same type regardless of return_inverse or whatever the property on that other Op is

pytensor/tensor/subtensor.py

Also add checks for runtime broadcast

ricardoV94 added C-backend performance indexing memory_optimization labels Apr 7, 2025

ricardoV94 changed the title ~~Speedup AdvancedSubtensor1 and AdvancedIncSubtensor1 in C backends~~ Speedup AdvancedSubtensor1 and AdvancedIncSubtensor1 in C backend Apr 7, 2025

ricardoV94 force-pushed the faster_indexing branch 6 times, most recently from c0045c3 to c211405 Compare April 9, 2025 13:46

ricardoV94 added the major label Apr 9, 2025

ricardoV94 requested review from jessegrabowski and Copilot April 9, 2025 14:36

Copilot AI reviewed Apr 9, 2025

View reviewed changes

ricardoV94 force-pushed the faster_indexing branch from c211405 to 38f9036 Compare April 9, 2025 18:13

ricardoV94 requested a review from Copilot April 9, 2025 18:13

Copilot AI reviewed Apr 9, 2025

View reviewed changes

pytensor/tensor/subtensor.py Outdated Show resolved Hide resolved

ricardoV94 force-pushed the faster_indexing branch 2 times, most recently from eef5a69 to 83506d9 Compare April 29, 2025 14:34

jessegrabowski approved these changes Apr 29, 2025

View reviewed changes

ricardoV94 added 2 commits April 29, 2025 18:27

Fix incorrect dtypes in LUFactor and PivotToPremutations

a80ce3c

Specialize AdvancedSubtensor1 mode for compile time valid indices

e10863f

ricardoV94 force-pushed the faster_indexing branch from 83506d9 to 78f1cf8 Compare April 29, 2025 16:28

ricardoV94 added 2 commits April 29, 2025 19:09

Specialized C-impl for vector AdvancedIncSubtensor1

5fadf46

Also add checks for runtime broadcast

Specialize Zero Alloc

41114da

Avoid copy of zeros in AdvancedIncSubtensor1

a0abe86

ricardoV94 force-pushed the faster_indexing branch from 78f1cf8 to a0abe86 Compare April 29, 2025 17:09

ricardoV94 merged commit 70b0ce7 into pymc-devs:main Apr 29, 2025
72 of 73 checks passed

ricardoV94 deleted the faster_indexing branch April 29, 2025 19:27

Speedup AdvancedSubtensor1 and AdvancedIncSubtensor1 in C backend #1346

Speedup AdvancedSubtensor1 and AdvancedIncSubtensor1 in C backend #1346

Uh oh!

Conversation

ricardoV94 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AdvancedSubtensor1 benchmark

AdvancedIncSubtensor1 benchmark

Uh oh!

codecov bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

jessegrabowski commented Apr 9, 2025

Uh oh!

ricardoV94 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ricardoV94 commented Apr 29, 2025

Uh oh!

jessegrabowski left a comment

Choose a reason for hiding this comment

Uh oh!

jessegrabowski Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

jessegrabowski Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

ricardoV94 Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jessegrabowski Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

ricardoV94 Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

jessegrabowski Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

ricardoV94 Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricardoV94 commented Apr 7, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading

ricardoV94 commented Apr 9, 2025 •

edited

Loading