Perf: Use typed memoryviews in sqrtsolve (10 to 20% speedup) by jberg5 · Pull Request #11 · nanograv/fastshermanmorrison

jberg5 · 2026-05-01T15:54:49Z

Typed memoryviews are a nifty cython feature that let you pinky promise the compiler that your arrays have a particular layout in memory, unlocking vectorization: https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#typed-memoryviews.

We can achieve a 10-20% speedup this way (hardware dependent, as always), which should stack on #8.

Also made a related change to pull the np.empty()s out of the loop, and just allocate a reusable workspace that's big enough once.

I asked Claude to walk through the step by step, which is great because otherwise I am not fluent in NEON assembly:

  .pyx source (the change)                                                                                    
                                                                                                              
  -cdef cnp.ndarray[cnp.double_t, ndim=2] Wix = X.copy()                                                      
  +cdef cnp.ndarray[cnp.double_t, ndim=2] Wix = X.copy()                                                      
  +cdef double[:, ::1] Wix_v = Wix     # C-contig memoryview alias
                                                                                                              
   for i in range(n):                                                                                       
       t = sqrt(Nvec[i])                                                                                      
       for j in range(l):                                                                                     
  -        Wix[i, j] /= t
  +        Wix_v[i, j] /= t                                                                                   
                                                                                                            
  Generated C (inner-loop body, what the compiler actually sees)                                              
                                                                                                            
  Before — Wix[i, j] access via cnp.ndarray typing:                                                           
  *__Pyx_BufPtrStrided2d(double *, buf, i, strides[0], j, strides[1]) /= t;                                 
  //                                                       ^^^^^^^^^^                                         
  //                       runtime stride loaded from struct -- compiler can't                                
  //                       prove unit-stride, rejects vectorization           
                                                                                                              
  After — Wix_v[i, j] access via double[:, ::1] memoryview:                                                   
  *((double *)((double *)(data + i*strides[0])) + j) /= t;                                                    
  //                                            ^^^                                                           
  //                       j added to a `double*` in element units -- inner                                   
  //                       stride is sizeof(double) at compile time                                           
                                                                                                              
  Clang -Rpass=loop-vectorize remarks                                                                         
                                                                                                              
  Before:  codegen_probe.c:18923: remark: loop not vectorized                                                 
  After:   codegen_probe.c:19187: remark: vectorized loop (vectorization width: 2,                            
                                                            interleaved count: 4)                             
                                                                                                              
  ARM NEON assembly (Apple clang 17, -O2 -fno-wrapv)                                                          
                                                                                                              
  Before — scalar inner loop, one double per iteration:                                                       
  ldr     d1, [x15]           ; load 1 double                                                               
  fdiv    d1, d1, d0          ; scalar divide                                                                 
  str     d1, [x15]           ; store 1 double                                                                
  add     x15, x15, x26       ; advance by strides[1] (runtime-loaded)
  subs    x16, x16, #1                                                                                        
  b.ne    LBB180_32                                                                                           
                   
  After — 4×-interleaved NEON, 8 doubles per iteration:                                                       
  ldp     q2, q3, [x1, #-32]  ; load 4 doubles (2 NEON regs of 2 doubles each)                                
  ldp     q4, q5, [x1]        ; load 4 more                                   
  fdiv.2d v2, v2, v1          ; SIMD divide, 2 doubles                                                        
  fdiv.2d v3, v3, v1          ; SIMD divide, 2 doubles                                                        
  fdiv.2d v4, v4, v1          ; SIMD divide, 2 doubles                                                        
  fdiv.2d v5, v5, v1          ; SIMD divide, 2 doubles                                                        
  stp     q2, q3, [x1, #-32]  ; store 4 doubles                                                             
  stp     q4, q5, [x1], #64   ; store 4 more, advance by 64 bytes                                             
  subs    x0, x0, #8                                                                                        
  b.ne    LBB186_86

Some numbers (kernel-level speedup on cython_block_sqrtsolve_rank1)
Normally I try to minimize noisy neighbors on my macbook, but today I had other stuff running and didn't bother turning it off, so take the M3 results with a grain of salt.

Apple M3 Pro / Accelerate / clang 17

Config	Shape (n_toa × n_basis, E)	orig (ms)	patched (ms)	speedup
B1855+09-like	4005 × 151, E=235	0.996	0.888	1.12×
B1937+21-like	9730 × 178, E=369	3.742	3.338	1.12×
J1909-3744-like	10259 × 166, E=307	3.641	3.367	1.08×
large	15000 × 200, E=880	6.357	5.899	1.08×
very large	30000 × 260, E=1765	16.817	15.857	1.06×
NG15-yr-like	50000 × 320, E=2940	31.059	30.233	1.03×

Intel Cascade Lake / OpenBLAS / gcc 12.2 (GCP `c2-standard-4`)

Config	Shape	orig (ms)	patched (ms)	speedup
B1855+09-like	4005 × 151, E=235	2.508	2.162	1.16×
B1937+21-like	9730 × 178, E=369	9.932	9.442	1.05×
J1909-3744-like	10259 × 166, E=307	9.620	9.055	1.06×
large	15000 × 200, E=880	17.781	16.358	1.09×
very large	30000 × 260, E=1765	58.048	54.993	1.06×
NG15-yr-like	50000 × 320, E=2940	116.808	110.798	1.05×

Intel Emerald Rapids (Xeon Platinum 8581C) / OpenBLAS / gcc 12.2 (GCP `c4-standard-8`)

Config	Shape	orig (ms)	patched (ms)	speedup
B1855+09-like	4005 × 151, E=235	2.242	1.767	1.27×
B1937+21-like	9730 × 178, E=369	6.002	5.098	1.18×
J1909-3744-like	10259 × 166, E=307	5.856	4.983	1.18×
large	15000 × 200, E=880	10.777	8.651	1.25×
very large	30000 × 260, E=1765	36.219	30.859	1.17×
NG15-yr-like	50000 × 320, E=2940	97.615	86.843	1.12×

AMD EPYC 7B13 (Zen3) / OpenBLAS / gcc 12.2 (GCP `c2d-standard-8`)

Config	Shape	orig (ms)	patched (ms)	speedup
B1855+09-like	4005 × 151, E=235	2.030	1.599	1.27×
B1937+21-like	9730 × 178, E=369	5.540	4.778	1.16×
J1909-3744-like	10259 × 166, E=307	5.372	4.600	1.17×
large	15000 × 200, E=880	12.381	10.507	1.18×
very large	30000 × 260, E=1765	38.954	35.945	1.08×
NG15-yr-like	50000 × 320, E=2940	77.945	67.325	1.16×

Confirmed results are bitwise identical.

jberg5 added 2 commits May 1, 2026 16:18

Use typed memoryviews in sqrtsolve

e1d7cd2

explicit C ordering and some cleanup

9e6cf07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Use typed memoryviews in sqrtsolve (10 to 20% speedup)#11

Perf: Use typed memoryviews in sqrtsolve (10 to 20% speedup)#11
jberg5 wants to merge 2 commits into
nanograv:mainfrom
jberg5:typed-memoryviews

jberg5 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jberg5 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Apple M3 Pro / Accelerate / clang 17

Intel Cascade Lake / OpenBLAS / gcc 12.2 (GCP c2-standard-4)

Intel Emerald Rapids (Xeon Platinum 8581C) / OpenBLAS / gcc 12.2 (GCP c4-standard-8)

AMD EPYC 7B13 (Zen3) / OpenBLAS / gcc 12.2 (GCP c2d-standard-8)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jberg5 commented May 1, 2026 •

edited

Loading

Intel Cascade Lake / OpenBLAS / gcc 12.2 (GCP `c2-standard-4`)

Intel Emerald Rapids (Xeon Platinum 8581C) / OpenBLAS / gcc 12.2 (GCP `c4-standard-8`)

AMD EPYC 7B13 (Zen3) / OpenBLAS / gcc 12.2 (GCP `c2d-standard-8`)