Boosting the performance of `pyfftw_sdp.py` #20

NimaSarajpoor · 2025-10-05T05:55:47Z

We may end up with using different algos for calculating sliding-dot-product for arrays with different lengths. However, it would be great to see if one algo, like pyfftw_sdp, could outperform MATLAB's FFTW-based sliding dot product. It seems that pyfftw_sdp is slightly outperformed by MATLAB's fftw_sdp when the length of array is < 2^8.

After checking the performance of (R)FFT and I(R)FFT in #19 and having some discussion in pyFFTW/pyFFTW#425 (comment), I decided to see if I can boost the performance of pyfftw_sdp.

gitnotebooks · 2025-10-05T05:55:51Z

Review these changes at https://app.gitnotebooks.com/stumpy-dev/sliding_dot_product/pull/20

NimaSarajpoor · 2025-10-05T06:39:41Z

Tests are passing! Now, let's look at the performance. The data is saved in timing.csv.

# file: timing.sh

rm -rf sdp/__pycache__
./timing.py -niter 100 -pmin 2 -pmax 20 pyfftw challenger > timing.csv
rm -rf sdp/__pycache__

We can see some improvement particularly for cases where len(T) < 2^9

NimaSarajpoor · 2025-10-05T07:18:48Z

Now, let's check the performance on MATLAB Online. I've checked the following cases:

MATLAB fftw_sdp (Single Thread)
MATLAB fftw_sdp (Auto Threads)
MATLAB fftw_sdp (Max Threads, 512)
pyfftw_sdp (Single Thread)
challenger_sdp (Single Thread)

We will use the first case, i.e. MATLAB fftw_sdp (Single Thread) as the base, and will compare the performance of others w.r.t. that.

The code for MATLAB's fftw_sdp is provided below:

% ref: DAMP supporting materials

function [Z] = SLIDING_DOT_PRODUCT_MASS_V2(Q, T)
    m = length(Q);
    n = length(T);

    % reverse query and append zero
    Q = Q(end:-1:1);
    Q(m+1:n) = 0;

    %The main trick of getting dot products 
    % in O(n log n) time
    F_T = fft(T);
    F_Q = fft(Q);
    F_Z = F_T.*F_Q;
    Z = ifft(F_Z);
    Z = Z(m:n);

end

Each of the images below is for one len(T). The blue curve shows the performance of existing pyfftw_sdp, and the red curve shows the performance of challenger_sdp.

seanlaw · 2025-10-05T11:26:13Z

@NimaSarajpoor Can you please provide some conclusions/observations based on the outputs above?

NimaSarajpoor · 2025-10-05T17:38:05Z

The following observation is based on the results I obtained for T with length in {2^2, 2^3, ..., 2^20}. The proposed (single-threaded) challenger_sdp outperforms MATLAB's FFT-based (single-threaded & multi-threaded) sdp. So, it should be safe to replace the existing pyfftw_sdp module with the one proposed in challenger_sdp.py. Then, when we try to implement an algo in python (e.g. DAMP) and compare its performance with the one implemented in MATLAB, we know that there will be less chance of getting performance hit because of the sdp part.

I think there are still two items that we need to check:

How does the proposed algo perform for longer arrays, where len(T) > 2^20.
If we pre-compute the FFT plans for arrays with different lengths, should we keep the object or save the wisdom (and import later)? In the latter case, we need to see how much overhead the "import" part has on the performance.

@seanlaw
What do you think?

seanlaw · 2025-10-06T00:23:29Z

How does the proposed algo perform for longer arrays, where len(T) > 2^20.

Yes, let's see the results for len(T) > 2^20.

If we pre-compute the FFT plans for arrays with different lengths, should we keep the object or save the wisdom (and import later)? In the latter case, we need to see how much overhead the "import" part has on the performance.

Do we know if the saved wisdom is portable/transferable across different architectures/OS (i.e., is it simply a file that gets read)? I'd assume that the object is not portable/transferable. Given that we might use a Class, can we add an ability to cache the wisdom/object (in a Dict) after it has been computed the first time at different lengths? We can also add a class method to precompute the wisdoms once (and possibly save them) AND add another method that allows the user to load wisdom into the class if they've precomputed themselves.

So, creating the wisdom at runtime might be slow but probably still fine if the wisdom is reused many times. Then, the user can choose to precompute the wisdom themselves BEFORE they call SDP for a range of len(T). Technically, this will take the same amount of time as the first sentence in this paragraph but, in both cases, we should give the user the ability to save the wisdom/object and load it back later.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP 🤷‍♂️

Does that help?

NimaSarajpoor · 2025-10-07T05:09:15Z

How does the proposed algo perform for longer arrays, where len(T) > 2^20.

Yes, let's see the results for len(T) > 2^20.

As shown below, in most cases, challenger_sdp is >= 1. For longer arrays, the difference is negligible. That's okay. The main goal here is to see if the performance for shorter arrays can be improved without losing the performance for longer arrays. So, I think we are good here.

./timing.py -pmin 2 -pmax 25 pyfftw challenger > timing.csv

Do we know if the saved wisdom is portable/transferable across different architectures/OS (i.e., is it simply a file that gets read)?

The exported wisdom cannot be used in a different architecture. I checked this by exporting wisdom in Colab (linux), saved the file, and then tried to use it in my macOS.

Export wisdom in Colab (linux):

input_data = pyfftw.empty_aligned(64, dtype=np.float64)
output_data = pyfftw.empty_aligned(1 + 64//2, dtype=np.complex128)

fft_object = pyfftw.FFTW(
    input_data,
    output_data,
    direction='FFTW_FORWARD',
)

wisdom = pyfftw.export_wisdom()
with open('./fft_wisdom.dat', 'wb') as f:
  pickle.dump(wisdom, f)

And then tried to use wisdom in my macOS:

with open('./fft_wisdom.dat', 'rb') as f:
    wisdom = pickle.load(f)
pyfftw.import_wisdom(wisdom)

input_data = pyfftw.empty_aligned(64, dtype=np.float64)
output_data = pyfftw.empty_aligned(1 + 64//2, dtype=np.complex128)
fft_object = pyfftw.FFTW(
    input_data,
    output_data,
    direction='FFTW_FORWARD',
    flags=('FFTW_WISDOM_ONLY',)  # Raises ERROR if wisdom does not exist
)

Given that we might use a Class, can we add an ability to cache the wisdom/object (in a Dict) after it has been computed the first time at different lengths? We can also add a class method to precompute the wisdoms once (and possibly save them) AND add another method that allows the user to load wisdom into the class if they've precomputed themselves.

According to the pyfftw-based sdp, we basically need to store two wisdom/object for len(T)==2^k. So, for 2^2 to 2^25, we basically need 24 keys, and each key can be mapped to a tuple like (RFFT wisdom, IRFFT wisdom).

So, creating the wisdom at runtime might be slow but probably still fine if the wisdom is reused many times. Then, the user can choose to precompute the wisdom themselves BEFORE they call SDP for a range of len(T). Technically, this will take the same amount of time as the first sentence in this paragraph but, in both cases, we should give the user the ability to save the wisdom/object and load it back later.

Makes sense.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP

TBD

seanlaw · 2025-10-07T11:10:07Z

As shown below, in most cases, challenger_sdp is >= 1. For longer arrays, the difference is negligible. That's okay. The main goal here is to see if the performance for shorter arrays can be improved without losing the performance for longer arrays. So, I think we are good here.

Awesome! One thing that I worry about is the installation of FFTW. I know it's not REALLY our problem but we should probably point people to the best way to get it installed since it doesn't come with pyfftw

NimaSarajpoor · 2025-10-19T17:47:17Z

WIP

I know it's not REALLY our problem but we should probably point people to the best way to get it installed since it doesn't come with pyfftw

[Note to Self] In MATLAB and Colab, FFTW is already available. This can be checked by doing apt search libfftw3.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP

TBD

I have been thinking what should be our baseline here. If I am interested in performing FFT on arrays with different lengths, then I am basically interested in the following baseline:

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()  # START
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()  # STOP
    
    timing[i] = stop - start

There is no optimization here... simply it creates the object and performs the transformation. Now, I am trying to check two things:

(1) How does the performance look like if I export wisdom in advance, and use it later.
So, I can simply add wisdom_dict[p] = pyfftw.export_wisdom() at the end of the script above to save wisdoms for different lengths, and save it via pickle.

with open('wisdom_dict.pkl', 'wb') as f:
    pickle.dump(wisdom_dict, f)

Then, read it in a new process. The new process looks like the baseline above but with this difference that it reads wisdoms from the saved pickle file, and import it.

# load wisdom
with open('wisdom_files/wisdom_dict.pkl', 'rb') as f:
        wisdom_dict = pickle.load(f)

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()  # START
    pyfftw.import_wisdom(wisdom_dict[p])  # IMPORT WISDOM
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()  # STOP
    
    timing[i] = stop - start

AFAIU, in contrast to pyfftw.FFTW, we cannot pass the flag FFTW_WISDOM_ONLY to pyfftw.builders.rfft to make sure it does not recompute wisdom. But we can understand this by timing the script above, and plot its performance w.r.t baseline. It should also help us understand the impact of pyfftw.import_wisdom(...) on the running time.

The running time of importing wisdom is considerable for smaller p values (Why? Shouldn't it improve the performance because some computations of "planning" are bypassed??) The computation of twiddle factors will happen here again because "WISDOM" is only about planning, which basically tells FFTW what algo it should use for a particular array.

(2) What if I store rfft object? In this case, I create the objects first, and store them in a dictionary. Then, I use those objects in the for-loop.

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

rfft_dict = {}
for p in p_range:
    T = np.random.rand(2 ** p).astype(np.float64)
   
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    rfft_dict[p] = rfft_obj
    

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = rfft_dict[p]
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()
    
    timing[i] = stop - start

np.save(f'numpy_files/PreComputeObj_timing_{itr}.npy', timing)

seanlaw · 2025-10-21T03:06:28Z

What if I store rfft object? In this case, I create the objects first, and store them in a dictionary. Then, I use those objects in the for-loop.

That's speedup from storing the rfft object is pretty insane!

NimaSarajpoor · 2025-10-23T03:45:53Z

That's speedup from storing the rfft object is pretty insane!

Yeah... that was really interesting. (Probably, for shorter arrays, the overhead of some checks are considerable when the pyfftw's rfft's class is instantiated ??)

So far, the goal was to understand (1) the impact of WISDOM, and (2) the impact of storing pyfftw object in advance. Now, we can compare it with MATLAB's fft. This time, our baseline is (single-threaded) MATLAB's fft. For pyFFTW, I am considering three cases:

pyfftw-naive (creating rfft_obj when needed)
pyfftw-performant (creating rfft_obj in advance and store it in dict)
pyfftw-performant-optimized (same as above...but it uses re-usable input arrays and calls executive)

As observed, storing them in dict first, and calling them later speeds up the process.

codes:
matlab_fft_timing.m.zip
python_code.zip

NimaSarajpoor · 2025-10-25T05:20:12Z

I(R)FFT

pyfftw-naive (creating irfft_obj when needed)
pyfftw-performant (creating irfft_obj in advance and store it in dict)
pyfftw-performant-optimized (same as above...but it uses re-usable input arrays and calls executive)
pyfftw-performant-optimized-njit (same as above...but uses njit to update array)

codes:
matlab_ifft_timing.m.zip
python_irfft_timing_files.zip

For now, I am not that much worried about the performance of arrays with length <= 2 ^ 5 as we should be good when length is >= 2 ^ 5. And if we look at this DAMP webpage, we can see that most of window size is >= 2 ^ 5.

seanlaw · 2025-10-25T12:35:59Z

For now, I am not that much worried about the performance of arrays with length <= 2 ^ 5

For small lengths (length <= 2^16), njit might still be the fastest (at least it was according to this plot). We should check this at the end.

Btw, I don't mean to go around in circles! I want to find the simplest solution

NimaSarajpoor · 2025-10-29T03:04:59Z

For small lengths (length <= 2^16), njit might still be the fastest (at least it was according to #16 (comment)).

I took another look at that plot... noticed that the spike in the beginning is for len(Q)==len(T) case. In that special case, we can simply do np.dot(Q, T), and we should be good (This is included in my code below. This is 100-200 times faster). I think this plot is better. It shows that njit is getting slow for len(Q) >= 2^7

Have shared code below to not lose it for now. Need to re-design it to make it cleaner as it currently has stand-alone functions to store rfft/irfft objects. As discussed before, I can/should try to include that in the original sliding_dot_product class. (Maybe introduce a boolean attribute like keep_obj to store rfft/irfft object. Then turn it off)

class SLIDING_DOT_PRODUCT:
    # https://stackoverflow.com/a/30615425/2955541
    def __init__(self):
        self.m = 0
        self.n = 0
        self.threads = 1
        self.rfft_Q_obj = None
        self.rfft_T_obj = None
        self.irfft_obj = None


    def __call__(self, Q, T):
        if Q.shape[0] != self.m or T.shape[0] != self.n:
            self.m = Q.shape[0]
            self.n = T.shape[0]
            self.shape = pyfftw.next_fast_len(self.n)

            self.rfft_input = pyfftw.empty_aligned(self.shape, dtype=np.float64)
            self.rfft_obj = pyfftw.builders.rfft(
                self.rfft_input,
                overwrite_input=True,
                avoid_copy=True,
                n=self.shape,
                threads=self.threads,
            )

            self.irfft_input = pyfftw.empty_aligned(
                1 + self.shape // 2, dtype=np.complex128
            )
            self.irfft_obj = pyfftw.builders.irfft(
                self.irfft_input,
                overwrite_input=True,
                avoid_copy=True,
                n=self.shape,
                threads=self.threads,
            )

        self.rfft_input[: self.m] = Q[::-1] / self.shape  # Reverse/flip Q and scale
        self.rfft_input[self.m :] = 0.0
        self.rfft_obj.execute()
        self.irfft_input[:] = self.rfft_obj.output_array

        self.rfft_input[: self.n] = T
        self.rfft_input[self.n :] = 0.0
        self.rfft_obj.execute()
        self.irfft_input *= self.rfft_obj.output_array

        # irfft_input is ready
        self.irfft_obj.execute()

        return self.irfft_obj.output_array[self.m - 1 : self.n]


_sliding_dot_product = SLIDING_DOT_PRODUCT()


def setup(Q, T):
    _sliding_dot_product(Q, T)
    return


def sliding_dot_product(Q, T):
    return _sliding_dot_product(Q, T)
################

def get_rfft_dict(p_min=2, p_max=10, threads=1):
    rfft_collection = {}
    for p in range(p_min, p_max + 1):
        n = 2 ** p
        rfft_obj = pyfftw.builders.rfft(
            pyfftw.empty_aligned(n, dtype=np.float64), 
            threads=threads
        )
        rfft_collection[n] = rfft_obj

    return rfft_collection
    

def get_irfft_dict(p_min=2, p_max=10, threads=1):
    irfft_collection = {}
    for p in range(p_min, p_max + 1):
        n = 2 ** p
        irfft_obj = pyfftw.builders.irfft(
            pyfftw.empty_aligned(1 + n//2, dtype=np.complex128),
            threads=threads
        )
        irfft_collection[n] = irfft_obj

    return irfft_collection


def sliding_dot_product_main(
    Q, T, rfft_dict, irfft_dict, real_arr, complex_arr
):
    """
    Q : np.array
    T : np.array
    rfft_dict : dict, keys are 2^2, 2^3, ..., 2^pmax, and values are rfft objects
    irfft_dict : dict, keys are 2^2, 2^3, ..., 2^pmax, and values are irfft objects
    rfft_arr : np.array, an empty array with shape 2^pmax, created by pyfftw.empty_aligned
    irfft_arr : np.array, an empty array with shape 2^pmax, created by pyfftw.empty_aligned
    """
    m = len(Q)
    n = len(T)
    
    rfft_obj = rfft_dict.get(n, None)
    irfft_obj = irfft_dict.get(n, None)
    
    if n == m:
        out = np.dot(Q, T)
    elif irfft_obj is None or rfft_obj is None:
        out = sliding_dot_product(Q, T)
    else:
        shape = n
        Fshape = 1 + n//2  
         
        rfft_obj(T)
        complex_arr[:Fshape] = rfft_obj.output_array
        
        real_arr[:m] = Q[::-1]
        real_arr[m:n] = 0 
        rfft_obj(real_arr[:n])
        complex_arr[:Fshape] *= rfft_obj.output_array
        
        irfft_obj(complex_arr[:Fshape])
        
        out = irfft_obj.output_array[m - 1 : n]
   
    return out

##########

# timing

def timing_func(n_iter):
    p_min, p_max = 2, 24

    rfft_dict = get_rfft_dict(p_min=p_min, p_max=p_max, threads=1)
    irfft_dict = get_irfft_dict(p_min=p_min, p_max=p_max, threads=1)
    
    real_arr = pyfftw.empty_aligned(2 ** p_max, dtype=np.float64)
    complex_arr = pyfftw.empty_aligned(2 ** p_max, dtype=np.complex128)
    
    timing_arr = np.full((p_max, n_iter, p_max), -1.0, dtype=np.float64)
    for q in range(p_min, p_max + 1):
        print('q: ', q)
        Q = np.random.rand(2 ** q)
        for itr in range(n_iter):
            for p in range(q, p_max + 1):
                n = 2 ** p
                T = np.random.rand(n)

                start = time.time()
                sliding_dot_product_main(Q, T, rfft_dict, irfft_dict, real_arr, complex_arr)
                stop = time.time()
                
                timing_arr[q-1, itr, p-1] = stop - start

    return timing_arr


print(pyfftw.__version__)
print(np.__version__)

n_iter = 10
timing = timing_func(n_iter)
np.save('pyfftw_sdp_performant.py', timing)

For MATLAB's, we have this:

function [z] = MASS_SDP(y, x)
    %x is the data, y is the query
    m = length(y);
    n = length(x);
    y = y(end:-1:1);%Reverse the query
    y(m+1:n) = 0; %aappend zeros
    X = fft(x);
    Y = fft(y);
    Z = X.*Y;
    z = ifft(Z);
    z = z(m:n);
end
function timing = sdp_timing_func()    
    n_iter = 10;
    p_min = 2;
    p_max = 24;
    
    % Initialize timing array with -1
    timing = -1.0 * ones(p_max, n_iter, p_max);
    
    for q = p_min:p_max
        disp(q)
        Q = rand(1, 2^q);
        for itr = 1:n_iter
            for p = q:p_max
                T = rand(1, 2^p);
                
                start_time = tic;  % Start timer
                MASS_SDP(Q, T);
                elapsed = toc(start_time);  % Stop timer
                
                timing(q, itr, p) = elapsed; 
            end
        end
    end
end

According to the following plot, where color "red" means Python's SDP outperforms MATLAB's SDP.

If we do not store rfft/irfft objects and just create them when needed, then we will get the following plot. The color blue means MATLAB's SDP outperforms Python's.

Note: Need to wait a few days till I can work on MATLAB Online again to double check the performance of the original code shared above. In the meantime, I will work on cleaning up the code.

NimaSarajpoor · 2025-10-31T11:21:46Z

(1) I added timing_OneShot.py, which is similar to timing.py, except that it avoids making two consecutive calls to sliding_dot_product(Q, T) for inputs T of the same length.

(2) Added challenger2_sdp.py which allows user to store rfft/irfft objects with the help of new method create_reusable_objects. And then it uses those objects when possible for computing the sliding dot product of Q and T. Test function was added.

According to the plot shared in my previous comment, I can put my focus on len(Q) <= len(T) <= 2^16.

# in timing.sh

./timing_OneShot.py -niter 100 -pmin 2 -pmax 16 -pdiff 100 pyfftw njit challenger2 > timing.csv

Observations:
(1) As observed in our previous explorations, njit is faster for len(Q)<=2^6, len(T)<=2^16
(2) challenger2.py is faster than pyfftw_sdp since it creates rfft/irfft objects first and re-use them

@seanlaw
Do you think we should go with the following steps to finalize our work and close this PR?
(1) Review timing_OneShot.py and add it to repo (either here or in a separate PR)
(2) Compare the performance of challenger2.py with MATLAB's SDP one more time.
(3) Compare the performance of challenger2.py with pyfftw_sdp.py from timing.py perspective
(4) Review challenger2.py. If design is not important, we can add it to repo for now.

Or.... after step 2, once we know that we are good for the SDP part, we can just leave this PR as is and then I can resume my work on DAMP.

seanlaw · 2025-11-01T12:15:03Z

@NimaSarajpoor Maybe it's because I haven't looked at SDP in a long time but I'm finding it hard to follow all of the code changes/files that you are mentioning so maybe you can help guide my brain as it feels like I'm jumping all over the place, which is probably a bad sign that we are trying to do too many things simultaneously.

According to the following plot, where color "red" means Python's SDP outperforms MATLAB's SDP.
If we do not store rfft/irfft objects and just create them when needed, then we will get the following plot. The color blue means MATLAB's SDP outperforms Python's.

So, things are much faster when we store the rfft/irfft objects? Help me understand why we would ever NOT store the objects? Maybe if we are only calling SDP once? Having said that, it almost feels like the same problem as numba compilation time. So, maybe we just think of the storing of objects as a one-time upfront "fake compilation" cost and not complicate the code by allowing the user to choose whether or not to store the objects. That is, unless it takes a crazy amount of time to store the objects?

(1) I added timing_OneShot.py, which is similar to timing.py, except that it avoids making two consecutive calls to sliding_dot_product(Q, T) for inputs T of the same length.

It's not clear what this is and why do we need it? Did we discuss this in the past? Is this an edge case of some sort that differs from what we've already been doing? I just don't like seeing one-offs.

Observations:
(1) As observed in our previous explorations, njit is faster for len(Q)<=2^6, len(T)<=2^16
(2) challenger2.py is faster than pyfftw_sdp since it creates rfft/irfft objects first and re-use them

In many cases, I think that since you've done the work and are intimately aware of the details, you know what to do next. Additionally, you've done a better job in summarizing the observations. However, you are still missing the "and so what?" after stating the observations (e.g., "so based on this observation, the conclusion might be that we should go with njit for short lengths and challenger2 for longer lengths" or, if not, explain what information is missing)

I think what's missing is that the goals of your experiments aren't clear to the reader (they might be clear to you) and so it comes across as mixing and matching multiple experiments which is confusing (and I may be adding to the confusion with my comments).

Do you think we should go with the following steps to finalize our work and close this PR?

Similar to the above, I'm missing the common thread/story that walks us through everything. I know that we've branched a lot and some things didn't work and so can you clearly describe all of the things that were tried and succeeded? First, start with describing your ONE CLEAR GOAL and then everything else in your experimentation should be focused on whether or not you've achieved that goal.

Perhaps, what we need is to break things down into smaller digest-able PRs?

Do you think we should go with the following steps to finalize our work and close this PR?
(1) Review timing_OneShot.py and add it to repo (either here or in a separate PR)
(2) Compare the performance of challenger2.py with MATLAB's SDP one more time.
(3) Compare the performance of challenger2.py with pyfftw_sdp.py from timing.py perspective
(4) Review challenger2.py. If design is not important, we can add it to repo for now.
Or.... after step 2, once we know that we are good for the SDP part, we can just leave this PR as is and then I can resume my work on DAMP.

So, right now, I don't feel like I have a clear enough picture or understanding to answer this question. In the past, I would've also attempted to write a bunch of code to better understand the problem but I have not done that in this case so I rely on you to shave away the noise and to communicate exactly what I should (and shouldn't) focus on. This means that you'll need to take a step back and look at the broader/bigger picture and only focus on the essential experiment(s) that helped provide clarity to reach your final goal.

proposing new fftw-based sdp

c64956a

NimaSarajpoor added 3 commits October 31, 2025 01:55

Add new challenger and test

3702459

add timing when len(T) changes in two consecutive runs

4dfad0d

fixed format

4ff5f87

Boosting the performance of pyfftw_sdp.py #20

Are you sure you want to change the base?

Boosting the performance of pyfftw_sdp.py #20

Conversation

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitnotebooks bot commented Oct 5, 2025

Uh oh!

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Oct 5, 2025

Uh oh!

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 7, 2025

Uh oh!

seanlaw commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WIP

Uh oh!

seanlaw commented Oct 21, 2025

Uh oh!

NimaSarajpoor commented Oct 23, 2025

Uh oh!

NimaSarajpoor commented Oct 25, 2025

Uh oh!

seanlaw commented Oct 25, 2025

Uh oh!

NimaSarajpoor commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 31, 2025

Uh oh!

seanlaw commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Boosting the performance of `pyfftw_sdp.py` #20

Boosting the performance of `pyfftw_sdp.py` #20

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

seanlaw commented Oct 6, 2025 •

edited

Loading

seanlaw commented Oct 7, 2025 •

edited

Loading

NimaSarajpoor commented Oct 19, 2025 •

edited

Loading

NimaSarajpoor commented Oct 29, 2025 •

edited

Loading

seanlaw commented Nov 1, 2025 •

edited

Loading