Skip to content

Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86#10

Open
jberg5 wants to merge 2 commits into
nanograv:mainfrom
jberg5:faster-sherman-morrison
Open

Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86#10
jberg5 wants to merge 2 commits into
nanograv:mainfrom
jberg5:faster-sherman-morrison

Conversation

@jberg5
Copy link
Copy Markdown

@jberg5 jberg5 commented Apr 26, 2026

1.15 to 1.6x speedup for _solve_2D2 on ARM, 1.05 to 1.25x for x86.

Repeating the same factorization from nanograv/enterprise#445.

Same math as before which allows us to switch from BLAS level 2 (dger_) to level 3 (dgemm_) (copy/pasted from my other PR):

$$ \sum_{i=1}^{E} \beta_i ; \mathbf{z}_i , \mathbf{x}_i^T = \sum_{i=1}^{E} \left(\sqrt{\beta_i};\mathbf{z}_i\right)\left(\sqrt{\beta_i};\mathbf{x}_i\right)^T = V^T W $$

where we define:

$$V \in \mathbb{R}^{E \times N_\text{basis}}, \quad V_{i,:} = \sqrt{\beta_i};\mathbf{z}_i$$

$$W \in \mathbb{R}^{E \times N_\text{basis}}, \quad W_{i,:} = \sqrt{\beta_i};\mathbf{x}_i$$

Same flops, but fewer writes to ZNZ, and switching from a bandwidth constrained BLAS level 2 kernel to a compute constrained level 3 kernel gives us some solid speedups on ARM (dgemm hits the Apple Matrix Co-Processor; dger does not). x86 picture is more complicated - some hardware configurations show 25% speedups, particularly newer Xeon chips with MKL installed, whereas other configurations show more modest gains.

Benchmarked memory usage and I see a roughly 12mb RSS increase at NG15 scale. No leaks.

ARM Results

config N m ep n_J platform before_ms after_ms speedup
B1855+09 4000 150 17 236 apple_m3_accelerate 1.835 1.128 1.63x
B1937+21 10000 180 27 371 apple_m3_accelerate 4.934 3.241 1.52x

x86 Results

config N m ep n_J platform before_ms after_ms speedup
B1855+09 4000 150 17 236 cascade_ob 4.119 3.313 1.24x
B1855+09 4000 150 17 236 cascade_mkl 4.655 3.925 1.19x
B1855+09 4000 150 17 236 emerald_ob 12.200 11.698 1.04x
B1855+09 4000 150 17 236 emerald_mkl 3.669 2.907 1.26x
B1855+09 4000 150 17 236 zen3_ob 5.529 4.823 1.15x
B1937+21 10000 180 27 371 cascade_ob 13.031 11.568 1.13x
B1937+21 10000 180 27 371 cascade_mkl 18.531 16.737 1.11x
B1937+21 10000 180 27 371 emerald_ob 41.676 40.523 1.03x
B1937+21 10000 180 27 371 emerald_mkl 11.942 10.240 1.17x
B1937+21 10000 180 27 371 zen3_ob 17.876 16.039 1.11x

Hardware:

  • cascade_ob = GCP c2-standard-8 (Intel Xeon Cascade Lake @ 3.10 GHz) + OpenBLAS
  • cascade_mkl = GCP c2-standard-8 + MKL
  • emerald_ob = GCP c4-standard-8 (Intel Xeon Platinum 8581C/Emerald) + OpenBLAS
  • emerald_mkl = GCP c4-standard-8 + MKL
  • zen3_ob = GCP c2d-standard-8 (AMD EPYC 7B13 / Zen 3) + OpenBLAS

All 8 existing tests pass.

@jberg5
Copy link
Copy Markdown
Author

jberg5 commented Apr 26, 2026

As a followup, I noticed that https://github.com/nanograv/enterprise/blob/6335ff7dc4d05495912bf200a465d4ebac0c1d41/enterprise/signals/signal_base.py#L994 means we're always dealing with X==Z, so we could use symmetric dsyrk for an even bigger speedup. Have benchmarked and we're looking at an additional 50-80% faster on x86 platforms.

I'll leave that as a separate PR because it basically stacks on this one.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.76%. Comparing base (7845cea) to head (9e83ebe).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main      #10   +/-   ##
=======================================
  Coverage   91.76%   91.76%           
=======================================
  Files           2        2           
  Lines         255      255           
=======================================
  Hits          234      234           
  Misses         21       21           
Flag Coverage Δ
unittests 91.76% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7845cea...9e83ebe. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vhaasteren
Copy link
Copy Markdown
Member

Hi @jberg5,

It is really great you are making these tweaks, thank you!

The other day we found that 2D ZNZ calculation was actually faster using the new non-cholesky matrix square root. Note how the new fastshermanmorrison has _cholsqrtsolve_D2 and _sqrtsolve_D2. We never went that route because it's slower, but with the updated _sqrtsolve_D2 going through here the ZNZ can then be constructed with a single dgemm call.

Could you check?

Thanks!!

@jberg5
Copy link
Copy Markdown
Author

jberg5 commented Apr 30, 2026

Ooh nice one @vhaasteren - I didn't see your recent changes, but it looks like your approach would be a lot faster than what I have in this PR. Here's a summary of the median runtimes with n_toa=50000, n_basis=320, E=2940, es=17(thanks Claude for making a nice table):

  ┌─────────────────────────────────┬─────────────────────────┬─────────────────────────────┬──────────────────────┐              
  │            Platform             │ _solve_2D2 (main, dger) │ _solve_2D2 (this PR, dgemm) │ sqrt + dgemm         │
  ├─────────────────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────────────┤              
  │ Apple M3 Pro (Accelerate / AMX) │                  121 ms │                       77 ms │                47 ms │
  ├─────────────────────────────────┼─────────────────────────┼─────────────────────────────┼──────────────────────┤
  │ x86 Cascade Lake (OpenBLAS)     │                  317 ms │                      294 ms │               200 ms │              
  └─────────────────────────────────┴─────────────────────────┴─────────────────────────────┴──────────────────────┘  

Writing a symmetric variant of this PR that uses dsyrk is a different story in some respects: smaller configurations are faster with that approach compared to _sqrtsolve_D2, but I see _sqrtsolve_D2 consistently faster at larger scales. Summary:

  Speedup vs main _solve_2D2 (dger), at NG15-yr scale (50000 × 320, E=2940, es=17)                                                
  
  ┌──────────────────────────────────────────────────────────────┬───────┬───────┐                                                
  │                           Approach                           │  ARM  │  x86  │
  ├──────────────────────────────────────────────────────────────┼───────┼───────┤
  │ this PR _solve_2D2 (dgemm batching alone)                    │ 1.26× │ 1.09× │
  ├──────────────────────────────────────────────────────────────┼───────┼───────┤
  │ this PR + dsyrk (dgemm batching + dsyrk for symmetric paths) │ 2.14× │ 1.64× │                                                
  ├──────────────────────────────────────────────────────────────┼───────┼───────┤                                                
  │ sqrt + dgemm (vhaasteren PR #8 + Python dispatch)            │ 2.24× │ 1.68× │                                                
  └──────────────────────────────────────────────────────────────┴───────┴───────┘ 

and

                                                                                                                                  
  ┌───────────────────────────────────────┬──────────┬────────────────────┬───────────────────────┬──────────────┬───────────┐    
  │  Config (n_toa × n_basis, E epochs,   │ Platform │  main _solve_2D2   │  this PR _solve_2D2   │  this PR +   │  sqrt +   │
  │              epoch_size)              │          │       (dger)       │        (dgemm)        │    dsyrk     │   dgemm   │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ very small (1000 × 100, E=100, es=10) │ ARM      │            0.47 ms │               0.34 ms │      0.19 ms │   0.36 ms │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ very small (1000 × 100, E=100, es=10) │ x86      │            1.01 ms │               1.35 ms │      1.32 ms │   1.32 ms │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ B1855+09-like (4005 × 151, E=235,     │ ARM      │            3.76 ms │               2.93 ms │      1.54 ms │   2.12 ms │    
  │ es=17)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ B1855+09-like (4005 × 151, E=235,     │ x86      │           10.46 ms │              10.72 ms │      8.23 ms │   6.22 ms │    
  │ es=17)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ B1937+21-like (9730 × 178, E=369,     │ ARM      │           14.83 ms │              13.21 ms │      6.66 ms │   8.66 ms │    
  │ es=26)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ B1937+21-like (9730 × 178, E=369,     │ x86      │           29.51 ms │              27.25 ms │     14.63 ms │  19.65 ms │    
  │ es=26)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ J1909-3744-like (10259 × 166, E=307,  │ ARM      │           13.60 ms │              12.36 ms │      6.08 ms │   8.78 ms │    
  │ es=33)                                │          │                    │                       │              │           │
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ J1909-3744-like (10259 × 166, E=307,  │ x86      │           29.31 ms │              28.34 ms │     19.67 ms │  20.13 ms │    
  │ es=33)                                │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤    
  │ larger synthetic (15000 × 200, E=880, │ ARM      │           33.42 ms │              26.37 ms │     15.84 ms │  17.27 ms │    
  │  es=17)                               │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ larger synthetic (15000 × 200, E=880, │ x86      │           52.68 ms │              45.58 ms │     26.43 ms │  36.39 ms │    
  │  es=17)                               │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ NG15-yr scale (50000 × 320, E=2940,   │ ARM      │           192.6 ms │              152.9 ms │      89.9 ms │   86.5 ms │    
  │ es=17)                                │          │                    │                       │              │           │    
  ├───────────────────────────────────────┼──────────┼────────────────────┼───────────────────────┼──────────────┼───────────┤
  │ NG15-yr scale (50000 × 320, E=2940,   │ x86      │           316.9 ms │              291.0 ms │     193.4 ms │  199.2 ms │    
  │ es=17)                                │          │                    │                       │              │           │    
  └───────────────────────────────────────┴──────────┴────────────────────┴───────────────────────┴──────────────┴───────────┘

@jberg5
Copy link
Copy Markdown
Author

jberg5 commented Apr 30, 2026

Anyway, I think both approaches get pretty close, but #8 is already ready to go and no new code to review :) So let me know what you'd like me to do here - I'm happy to just close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants