Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86#10
Faster Sherman Morrison (dger -> dgemm), 15% to 60% speedup on ARM, 5% to 25% speedup on x86#10jberg5 wants to merge 2 commits into
Conversation
|
As a followup, I noticed that https://github.com/nanograv/enterprise/blob/6335ff7dc4d05495912bf200a465d4ebac0c1d41/enterprise/signals/signal_base.py#L994 means we're always dealing with I'll leave that as a separate PR because it basically stacks on this one. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #10 +/- ##
=======================================
Coverage 91.76% 91.76%
=======================================
Files 2 2
Lines 255 255
=======================================
Hits 234 234
Misses 21 21
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
Hi @jberg5, It is really great you are making these tweaks, thank you! The other day we found that 2D ZNZ calculation was actually faster using the new non-cholesky matrix square root. Note how the new fastshermanmorrison has Could you check? Thanks!! |
|
Ooh nice one @vhaasteren - I didn't see your recent changes, but it looks like your approach would be a lot faster than what I have in this PR. Here's a summary of the median runtimes with n_toa=50000, n_basis=320, E=2940, es=17(thanks Claude for making a nice table): Writing a symmetric variant of this PR that uses and |
|
Anyway, I think both approaches get pretty close, but #8 is already ready to go and no new code to review :) So let me know what you'd like me to do here - I'm happy to just close this PR. |
1.15 to 1.6x speedup for
_solve_2D2on ARM, 1.05 to 1.25x for x86.Repeating the same factorization from nanograv/enterprise#445.
Same math as before which allows us to switch from BLAS level 2 (
dger_) to level 3 (dgemm_) (copy/pasted from my other PR):where we define:
Same flops, but fewer writes to
ZNZ, and switching from a bandwidth constrained BLAS level 2 kernel to a compute constrained level 3 kernel gives us some solid speedups on ARM (dgemm hits the Apple Matrix Co-Processor; dger does not). x86 picture is more complicated - some hardware configurations show 25% speedups, particularly newer Xeon chips with MKL installed, whereas other configurations show more modest gains.Benchmarked memory usage and I see a roughly 12mb RSS increase at NG15 scale. No leaks.
ARM Results
x86 Results
Hardware:
All 8 existing tests pass.