Skip to content

Perf: Use typed memoryviews in sqrtsolve (10 to 20% speedup)#11

Open
jberg5 wants to merge 2 commits into
nanograv:mainfrom
jberg5:typed-memoryviews
Open

Perf: Use typed memoryviews in sqrtsolve (10 to 20% speedup)#11
jberg5 wants to merge 2 commits into
nanograv:mainfrom
jberg5:typed-memoryviews

Conversation

@jberg5
Copy link
Copy Markdown

@jberg5 jberg5 commented May 1, 2026

Typed memoryviews are a nifty cython feature that let you pinky promise the compiler that your arrays have a particular layout in memory, unlocking vectorization: https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#typed-memoryviews.

We can achieve a 10-20% speedup this way (hardware dependent, as always), which should stack on #8.

Also made a related change to pull the np.empty()s out of the loop, and just allocate a reusable workspace that's big enough once.

I asked Claude to walk through the step by step, which is great because otherwise I am not fluent in NEON assembly:

  .pyx source (the change)                                                                                    
                                                                                                              
  -cdef cnp.ndarray[cnp.double_t, ndim=2] Wix = X.copy()                                                      
  +cdef cnp.ndarray[cnp.double_t, ndim=2] Wix = X.copy()                                                      
  +cdef double[:, ::1] Wix_v = Wix     # C-contig memoryview alias
                                                                                                              
   for i in range(n):                                                                                       
       t = sqrt(Nvec[i])                                                                                      
       for j in range(l):                                                                                     
  -        Wix[i, j] /= t
  +        Wix_v[i, j] /= t                                                                                   
                                                                                                            
  Generated C (inner-loop body, what the compiler actually sees)                                              
                                                                                                            
  Before — Wix[i, j] access via cnp.ndarray typing:                                                           
  *__Pyx_BufPtrStrided2d(double *, buf, i, strides[0], j, strides[1]) /= t;                                 
  //                                                       ^^^^^^^^^^                                         
  //                       runtime stride loaded from struct -- compiler can't                                
  //                       prove unit-stride, rejects vectorization           
                                                                                                              
  After — Wix_v[i, j] access via double[:, ::1] memoryview:                                                   
  *((double *)((double *)(data + i*strides[0])) + j) /= t;                                                    
  //                                            ^^^                                                           
  //                       j added to a `double*` in element units -- inner                                   
  //                       stride is sizeof(double) at compile time                                           
                                                                                                              
  Clang -Rpass=loop-vectorize remarks                                                                         
                                                                                                              
  Before:  codegen_probe.c:18923: remark: loop not vectorized                                                 
  After:   codegen_probe.c:19187: remark: vectorized loop (vectorization width: 2,                            
                                                            interleaved count: 4)                             
                                                                                                              
  ARM NEON assembly (Apple clang 17, -O2 -fno-wrapv)                                                          
                                                                                                              
  Before — scalar inner loop, one double per iteration:                                                       
  ldr     d1, [x15]           ; load 1 double                                                               
  fdiv    d1, d1, d0          ; scalar divide                                                                 
  str     d1, [x15]           ; store 1 double                                                                
  add     x15, x15, x26       ; advance by strides[1] (runtime-loaded)
  subs    x16, x16, #1                                                                                        
  b.ne    LBB180_32                                                                                           
                   
  After — 4×-interleaved NEON, 8 doubles per iteration:                                                       
  ldp     q2, q3, [x1, #-32]  ; load 4 doubles (2 NEON regs of 2 doubles each)                                
  ldp     q4, q5, [x1]        ; load 4 more                                   
  fdiv.2d v2, v2, v1          ; SIMD divide, 2 doubles                                                        
  fdiv.2d v3, v3, v1          ; SIMD divide, 2 doubles                                                        
  fdiv.2d v4, v4, v1          ; SIMD divide, 2 doubles                                                        
  fdiv.2d v5, v5, v1          ; SIMD divide, 2 doubles                                                        
  stp     q2, q3, [x1, #-32]  ; store 4 doubles                                                             
  stp     q4, q5, [x1], #64   ; store 4 more, advance by 64 bytes                                             
  subs    x0, x0, #8                                                                                        
  b.ne    LBB186_86                  

Some numbers (kernel-level speedup on cython_block_sqrtsolve_rank1)
Normally I try to minimize noisy neighbors on my macbook, but today I had other stuff running and didn't bother turning it off, so take the M3 results with a grain of salt.

Apple M3 Pro / Accelerate / clang 17

Config Shape (n_toa × n_basis, E) orig (ms) patched (ms) speedup
B1855+09-like 4005 × 151, E=235 0.996 0.888 1.12×
B1937+21-like 9730 × 178, E=369 3.742 3.338 1.12×
J1909-3744-like 10259 × 166, E=307 3.641 3.367 1.08×
large 15000 × 200, E=880 6.357 5.899 1.08×
very large 30000 × 260, E=1765 16.817 15.857 1.06×
NG15-yr-like 50000 × 320, E=2940 31.059 30.233 1.03×

Intel Cascade Lake / OpenBLAS / gcc 12.2 (GCP c2-standard-4)

Config Shape orig (ms) patched (ms) speedup
B1855+09-like 4005 × 151, E=235 2.508 2.162 1.16×
B1937+21-like 9730 × 178, E=369 9.932 9.442 1.05×
J1909-3744-like 10259 × 166, E=307 9.620 9.055 1.06×
large 15000 × 200, E=880 17.781 16.358 1.09×
very large 30000 × 260, E=1765 58.048 54.993 1.06×
NG15-yr-like 50000 × 320, E=2940 116.808 110.798 1.05×

Intel Emerald Rapids (Xeon Platinum 8581C) / OpenBLAS / gcc 12.2 (GCP c4-standard-8)

Config Shape orig (ms) patched (ms) speedup
B1855+09-like 4005 × 151, E=235 2.242 1.767 1.27×
B1937+21-like 9730 × 178, E=369 6.002 5.098 1.18×
J1909-3744-like 10259 × 166, E=307 5.856 4.983 1.18×
large 15000 × 200, E=880 10.777 8.651 1.25×
very large 30000 × 260, E=1765 36.219 30.859 1.17×
NG15-yr-like 50000 × 320, E=2940 97.615 86.843 1.12×

AMD EPYC 7B13 (Zen3) / OpenBLAS / gcc 12.2 (GCP c2d-standard-8)

Config Shape orig (ms) patched (ms) speedup
B1855+09-like 4005 × 151, E=235 2.030 1.599 1.27×
B1937+21-like 9730 × 178, E=369 5.540 4.778 1.16×
J1909-3744-like 10259 × 166, E=307 5.372 4.600 1.17×
large 15000 × 200, E=880 12.381 10.507 1.18×
very large 30000 × 260, E=1765 38.954 35.945 1.08×
NG15-yr-like 50000 × 320, E=2940 77.945 67.325 1.16×

Confirmed results are bitwise identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant