Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86 by jberg5 · Pull Request #10 · nanograv/fastshermanmorrison

jberg5 · 2026-04-26T16:28:36Z

1.15 to 1.6x speedup for _solve_2D2 on ARM, 1.05 to 1.25x for x86.

Repeating the same factorization from nanograv/enterprise#445.

Same math as before which allows us to switch from BLAS level 2 (dger_) to level 3 (dgemm_) (copy/pasted from my other PR):

$$ \sum_{i=1}^{E} \beta_i ; \mathbf{z}_i , \mathbf{x}_i^T = \sum_{i=1}^{E} \left(\sqrt{\beta_i};\mathbf{z}_i\right)\left(\sqrt{\beta_i};\mathbf{x}_i\right)^T = V^T W $$

where we define:

$$V \in \mathbb{R}^{E \times N_\text{basis}}, \quad V_{i,:} = \sqrt{\beta_i};\mathbf{z}_i$$

$$W \in \mathbb{R}^{E \times N_\text{basis}}, \quad W_{i,:} = \sqrt{\beta_i};\mathbf{x}_i$$

Same flops, but fewer writes to ZNZ, and switching from a bandwidth constrained BLAS level 2 kernel to a compute constrained level 3 kernel gives us some solid speedups on ARM (dgemm hits the Apple Matrix Co-Processor; dger does not). x86 picture is more complicated - some hardware configurations show 25% speedups, particularly newer Xeon chips with MKL installed, whereas other configurations show more modest gains.

Benchmarked memory usage and I see a roughly 12mb RSS increase at NG15 scale. No leaks.

ARM Results

config	N	m	ep	n_J	platform	before_ms	after_ms	speedup
B1855+09	4000	150	17	236	apple_m3_accelerate	1.835	1.128	1.63x
B1937+21	10000	180	27	371	apple_m3_accelerate	4.934	3.241	1.52x

x86 Results

config	N	m	ep	n_J	platform	before_ms	after_ms	speedup
B1855+09	4000	150	17	236	cascade_ob	4.119	3.313	1.24x
B1855+09	4000	150	17	236	cascade_mkl	4.655	3.925	1.19x
B1855+09	4000	150	17	236	emerald_ob	12.200	11.698	1.04x
B1855+09	4000	150	17	236	emerald_mkl	3.669	2.907	1.26x
B1855+09	4000	150	17	236	zen3_ob	5.529	4.823	1.15x
B1937+21	10000	180	27	371	cascade_ob	13.031	11.568	1.13x
B1937+21	10000	180	27	371	cascade_mkl	18.531	16.737	1.11x
B1937+21	10000	180	27	371	emerald_ob	41.676	40.523	1.03x
B1937+21	10000	180	27	371	emerald_mkl	11.942	10.240	1.17x
B1937+21	10000	180	27	371	zen3_ob	17.876	16.039	1.11x

Hardware:

cascade_ob = GCP c2-standard-8 (Intel Xeon Cascade Lake @ 3.10 GHz) + OpenBLAS
cascade_mkl = GCP c2-standard-8 + MKL
emerald_ob = GCP c4-standard-8 (Intel Xeon Platinum 8581C/Emerald) + OpenBLAS
emerald_mkl = GCP c4-standard-8 + MKL
zen3_ob = GCP c2d-standard-8 (AMD EPYC 7B13 / Zen 3) + OpenBLAS

All 8 existing tests pass.

jberg5 · 2026-04-26T17:23:18Z

As a followup, I noticed that https://github.com/nanograv/enterprise/blob/6335ff7dc4d05495912bf200a465d4ebac0c1d41/enterprise/signals/signal_base.py#L994 means we're always dealing with X==Z, so we could use symmetric dsyrk for an even bigger speedup. Have benchmarked and we're looking at an additional 50-80% faster on x86 platforms.

I'll leave that as a separate PR because it basically stacks on this one.

codecov-commenter · 2026-04-27T06:46:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.76%. Comparing base (7845cea) to head (9e83ebe).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #10   +/-   ##
=======================================
  Coverage   91.76%   91.76%           
=======================================
  Files           2        2           
  Lines         255      255           
=======================================
  Hits          234      234           
  Misses         21       21

Flag	Coverage Δ
unittests	`91.76% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7845cea...9e83ebe. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vhaasteren · 2026-04-27T06:49:24Z

Hi @jberg5,

It is really great you are making these tweaks, thank you!

The other day we found that 2D ZNZ calculation was actually faster using the new non-cholesky matrix square root. Note how the new fastshermanmorrison has _cholsqrtsolve_D2 and _sqrtsolve_D2. We never went that route because it's slower, but with the updated _sqrtsolve_D2 going through here the ZNZ can then be constructed with a single dgemm call.

Could you check?

Thanks!!

jberg5 · 2026-04-30T13:20:04Z

Ooh nice one @vhaasteren - I didn't see your recent changes, but it looks like your approach would be a lot faster than what I have in this PR. Here's a summary of the median runtimes with n_toa=50000, n_basis=320, E=2940, es=17(thanks Claude for making a nice table):

  ┌─────────────────────────────────┬─────────────────────────┬─────────────────────────────┬──────────────────────┐              
  │            Platform             │ _solve_2D2 (main, dger) │ _solve_2D2 (this PR, dgemm) │ sqrt + dgemm         │
  ├─────────────────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────────────┤              
  │ Apple M3 Pro (Accelerate / AMX) │                  121 ms │                       77 ms │                47 ms │
  ├─────────────────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────────────┤
  │ x86 Cascade Lake (OpenBLAS)     │                  317 ms │                      294 ms │               200 ms │              
  └─────────────────────────────────┴─────────────────────────┴─────────────────────────────┴──────────────────────┘

Writing a symmetric variant of this PR that uses dsyrk is a different story in some respects: smaller configurations are faster with that approach compared to _sqrtsolve_D2, but I see _sqrtsolve_D2 consistently faster at larger scales. Summary:

  Speedup vs main _solve_2D2 (dger), at NG15-yr scale (50000 × 320, E=2940, es=17)                                                
  
  ┌──────────────────────────────────────────────────────────────┬───────┬───────┐                                                
  │                           Approach                           │  ARM  │  x86  │
  ├──────────────────────────────────────────────────────────────┼───────┼───────┤
  │ this PR _solve_2D2 (dgemm batching alone)                    │ 1.26× │ 1.09× │
  ├──────────────────────────────────────────────────────────────┼───────┼───────┤
  │ this PR + dsyrk (dgemm batching + dsyrk for symmetric paths) │ 2.14× │ 1.64× │                                                
  ├──────────────────────────────────────────────────────────────┼───────┼───────┤                                                
  │ sqrt + dgemm (vhaasteren PR #8 + Python dispatch)            │ 2.24× │ 1.68× │                                                
  └──────────────────────────────────────────────────────────────┴───────┴───────┘

and

                                                                                                                                  
  ┌───────────────────────────────────────┬──────────┬────────────────────┬───────────────────────┬──────────────┬───────────┐    
  │  Config (n_toa × n_basis, E epochs,   │ Platform │  main _solve_2D2   │  this PR _solve_2D2   │  this PR +   │  sqrt +   │
  │              epoch_size)              │          │       (dger)       │        (dgemm)        │    dsyrk     │   dgemm   │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ very small (1000 × 100, E=100, es=10) │ ARM      │            0.47 ms │               0.34 ms │      0.19 ms │   0.36 ms │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ very small (1000 × 100, E=100, es=10) │ x86      │            1.01 ms │               1.35 ms │      1.32 ms │   1.32 ms │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ B1855+09-like (4005 × 151, E=235,     │ ARM      │            3.76 ms │               2.93 ms │      1.54 ms │   2.12 ms │    
  │ es=17)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ B1855+09-like (4005 × 151, E=235,     │ x86      │           10.46 ms │              10.72 ms │      8.23 ms │   6.22 ms │    
  │ es=17)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ B1937+21-like (9730 × 178, E=369,     │ ARM      │           14.83 ms │              13.21 ms │      6.66 ms │   8.66 ms │    
  │ es=26)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ B1937+21-like (9730 × 178, E=369,     │ x86      │           29.51 ms │              27.25 ms │     14.63 ms │  19.65 ms │    
  │ es=26)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ J1909-3744-like (10259 × 166, E=307,  │ ARM      │           13.60 ms │              12.36 ms │      6.08 ms │   8.78 ms │    
  │ es=33)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ J1909-3744-like (10259 × 166, E=307,  │ x86      │           29.31 ms │              28.34 ms │     19.67 ms │  20.13 ms │    
  │ es=33)                                │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ larger synthetic (15000 × 200, E=880, │ ARM      │           33.42 ms │              26.37 ms │     15.84 ms │  17.27 ms │    
  │  es=17)                               │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ larger synthetic (15000 × 200, E=880, │ x86      │           52.68 ms │              45.58 ms │     26.43 ms │  36.39 ms │    
  │  es=17)                               │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ NG15-yr scale (50000 × 320, E=2940,   │ ARM      │           192.6 ms │              152.9 ms │      89.9 ms │   86.5 ms │    
  │ es=17)                                │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ NG15-yr scale (50000 × 320, E=2940,   │ x86      │           316.9 ms │              291.0 ms │     193.4 ms │  199.2 ms │    
  │ es=17)                                │          │                    │                       │              │           │    
  └───────────────────────────────────────┴──────────┴────────────────────┴───────────────────────┴──────────────┴───────────┘

jberg5 · 2026-04-30T13:22:18Z

Anyway, I think both approaches get pretty close, but #8 is already ready to go and no new code to review :) So let me know what you'd like me to do here - I'm happy to just close this PR.

jberg5 added 2 commits April 25, 2026 21:56

switch from dger to dgemm for performance

9f1203e

clean up comments

9e83ebe

jberg5 mentioned this pull request Apr 26, 2026

Perf: 2-4x speedup for ShermanMorrison _solve_2D2 nanograv/enterprise#445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86#10

Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86#10
jberg5 wants to merge 2 commits into
nanograv:mainfrom
jberg5:faster-sherman-morrison

jberg5 commented Apr 26, 2026 •

edited

Loading

Uh oh!

jberg5 commented Apr 26, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026 •

edited

Loading

Uh oh!

vhaasteren commented Apr 27, 2026

Uh oh!

jberg5 commented Apr 30, 2026

Uh oh!

jberg5 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jberg5 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jberg5 commented Apr 26, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vhaasteren commented Apr 27, 2026

Uh oh!

jberg5 commented Apr 30, 2026

Uh oh!

jberg5 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jberg5 commented Apr 26, 2026 •

edited

Loading

codecov-commenter commented Apr 27, 2026 •

edited

Loading