Skip to content

Conversation

@NimaSarajpoor
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor commented Oct 3, 2023

This PR is a replacement for PR #872.

This PR is to address #606.


I will start with implementing the original algorithm, as proposed in the paper, as close as possible. Then, I will add notes on how to optimize the code. I will consider the MATLAB code too.


Notes

  • In contrast to the original paper, the MATLAB code currently shows the following logic to compute the initial chunksize for backward processing: 2 ^ next_pow2(8 * m), pay attention to 8.

  • As a follow up to previous point: what happens if m is already a "power of two" number?

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
For now, I have only implemented the backward processing algorithm as proposed in the DAMP paper. I have commented on the parts that can be changed to enhance the code.

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (857063c) 98.93% compared to head (5c110a1) 98.93%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #920   +/-   ##
=======================================
  Coverage   98.93%   98.93%           
=======================================
  Files          84       84           
  Lines       14292    14292           
=======================================
  Hits        14140    14140           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -0,0 +1,378 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #24.        For the given index i, segmentize the array np.arange(i) into 

Is there a better way to describe the task of this function? How about:

"Given the index i, divide the array np.arange(i) into chunks. Ensure the last chunk's size is chunksize_init, with the preceding chunks doubling in size. The output consists of (start, stop) indices of each chunk."

Also, we may add:

The left-most chunk always start at 0.


Reply via ReviewNB

Copy link
Contributor

@seanlaw seanlaw Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems weird. It's not clear why you need to end with the last element being chunksize_init

Something feels backwards here. The intent is hard to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels backwards here

Well, it is backward (I mean... the process itself is backward). but, I trust your lens! Please allow me to think more and see if I can come up with something that is more elegant.

Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?

I wanted to do that but I decided to start simple. What we can do is to keep shifting the start, stop indices of all chunks by one in each iteration. So, for instance, if I get the chunks for the subsequences in T[:idx]. Then, I can find the chunks for T[:(idx+1)] by just shifting start/stop indices of chunks by one to the right. We just need to keep track of the left-most chunk though as it can become the power two of a number at some point. In such case, the number of chunks will be increased by one.

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[FYR...before I forget]
Another approach is to ALWAYS OFFSET FROM THE QUERY INDEX idx by s, 2s, 4s, 8s, and so on, where s is power of two. The good thing is that the difference between any two consecutive offset is STILL a power of two. This may result in cleaner code. IIRC, I think I saw something similar in the MATLAB code before. Going to check that as well.

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MATLAB code uses the following lines for updating the start / stop of each chunk:

# Query is at index `i`, i.e. `query = T(i:i+SubsequenceLength-1)`
# X is the chunk size, initially set to `2 ^ nextpow2(8 * SubsequenceLength)`

X_start = i - X + 1 + (expansion_num * SubsequenceLength)
X_end = i - (X / 2) + (expansion_num * SubsequenceLength)

# and then
approximate_distance = min( real(MASS_V2(T(X_start:X_end), query))); 
X = X * 2
expansion_num = expansion_num  + 1

The term (expansion_num * SubsequenceLength) is to take into account the elements of the last subsequence in the new chunk. To keep the length of chunk untouched, the X_start has the same shift.


Note 1:
According to the paper / MATLAB code, the length of chunk, T in core.mass(Q, T), (NOT the number of subsequences) should be a power-of-two.

Note 2:
The reason behind considering the length power-of-two for chunk is to speed up the MASS algorithm (according to the authors)

So, I did a quick check to see how much having the length power-of-two affects the performance.

seed = 0
np.random.seed(seed)

T = np.random.rand(10_000_000)
m = 50

T, M_T, Σ_T, T_subseq_isconstant = stumpy.core.preprocess(T, m)

query_idx = 600000
t_lst = []
for stop in range(2 ** 20 - 1000 - 1, 2 ** 20 + 1000 + 1):
    time_start = time.time()

    QT = core.sliding_dot_product(
        T[query_idx : query_idx + m], 
        T[start : stop],
    )
    
    D = core._mass(
        T[query_idx : query_idx + m],
        T[start : stop],
        QT=QT,
        μ_Q=M_T[query_idx],
        σ_Q=Σ_T[query_idx],
        M_T=M_T[start : stop - m + 1],
        Σ_T=Σ_T[start : stop - m + 1],
        Q_subseq_isconstant=T_subseq_isconstant[query_idx],
        T_subseq_isconstant=T_subseq_isconstant[start : stop - m + 1],
    )

    time_end = time.time()
    t_lst.append(time_end - time_start)

And, the I plot it:

indices = np.arange(2 ** 20 - 1000 - 1, 2 ** 20 + 1000 + 1)
indices = indices[2:]
t_lst = t_lst[2:]

idx = np.flatnonzero(indices == 2 ** 20)[0]

plt.figure(figsize=(20, 5))
plt.scatter(indices[idx], t_lst[idx], color='r', label='chunksize = 2 ** 20')
plt.plot(indices[idx-200 : idx+200], t_lst[idx-200:idx+200]) 
plt.ylabel('running time')
plt.legend()
plt.show()
image

Well, it seems the running time for the chunk size 2 ** 20 is low (not necessarily the lowest) but it should be okay. To remain faithful to the paper, I am going to consider the length power-of-two for each chunk size.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I have provided a few comments. Please let me know what you think.

@@ -0,0 +1,569 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #28.        while start < chunk_stop:

Is this code readable? I tried to vectorize it but couldn't figure out "how". If the number of subsequences in each chunk were supposed to be a power of two, I think I would be able to vectorize it. However, according to the paper, the size of the chunk itself should be a power of two. In other words, the number of subsequences is like... 2 ** num - m + 1

Since I couldn't find out a cleaner solution, I am going with naive_get_range_damp for now.


Reply via ReviewNB

Copy link
Contributor

@seanlaw seanlaw Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see more examples of what the inputs/outputs are suppose to be in order to understand what is expected. Right now, I'm very confused. Can you give me some concrete examples and any gotchas (i.e., when things aren't perfectly square)?

@@ -0,0 +1,569 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #67.            PL[start : stop - m + 1] = np.minimum(PL[start : stop - m + 1], D)

Previously, we had the variable pruned , which was a boolean vector (of length `len(T)-m+1`), where pruned[i] is True if the i-th subsequence is pruned. And, instead of line above (i.e. line #67), we had:

mask = np.argwhere(D < bsf) + start

pruned[mask] = True

Recall that the paper's original algorithm does not compute PL[i] if pruned[i] is True. It just skips it. In such case, the original algorithm set PL[i] as follows:

PL[i] = PL[i-1]

which makes it difficult to find the correct index of discord. The MATALB code does a hack instead as follows:

PL[i] = PL[i-1] - 0.000001

and this does not seem to be a clean solution.

So, instead, I am updating the (approximate) matrix profile PL by using the computed distance profile D . This helps us to avoid the hack, and I think It should not be computationally more expensive compared to np.argwhere(D < bfs)



Reply via ReviewNB

@@ -0,0 +1,569 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 60-68 updates the chunks_range by shifting it by one to the right in each iteration. I feel the code is not that clean! Still trying to figure out if there is a better way for updating the chunks range. Any thoughts?


Reply via ReviewNB

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 19, 2023

@seanlaw

Would you please check out the comments and let me know what you think? In particular, I need your support to figure out:
(1) If there is a better / cleaner way to get the chunks range (see function naive_get_range_damp)
(2) If there is a better way to update the chunks range in each iteration in the main function DAMP

Note [Regarding exclusion zone]:
Also, as discussed, I am considering excl_zone (instead of the hardcoded value m) to make the result align with what one obtains using the stumpy.stump

@seanlaw
Copy link
Contributor

seanlaw commented Nov 20, 2023

@NimaSarajpoor Let's start with these comments first before we dive into DAMP.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 26, 2023

A few updates:

(1) The proposed get_damp_range function is added to the notebook.

(2) The running time of DAMP algorithm increases drastically when m becomes larger. See the new example 3 in the notebook. The running time of naive DAMP (stumpy-based DAMP) is ~5 sec, and the running time of DAMP algorithm is ~55 sec. I think it is mostly related to the computation of sliding dot product using core.sliding_dot_product. If we use the NUMBA JIT-compiled function core._ sliding_dot_product instead, the running time becomes ~41 sec. And, if we parallelize core._sliding_dot_product (by passing parallel=True to the njit decorator and using prange), the running time becomes ~12 sec. Still, this is longer than the stumpy-based naive DAMP.

  • Since I do not want to do any premature optimization, I am avoiding making any changes.
  • It might be worth it to check the paper and see if the authors compare DAMP with parallelized STOMP algorithm or not. Is DAMP really better than stumpy.stump in a batch case?

(3) An experimental top-k DAMP code is provided on the supporting webpage. I tried to make some changes on our current DAMP to reflect top-k code. I did a few tests and I think it works (i.e. the results are the same as the top-k discords from naive DAMP). Btw, I did not push these changes to this PR.

How it works:
In short, we just set bsf (best-so-far score) to the k-th discord score, and not the top-1. In contrast to the stumpy-based naive DAMP, we need to set k in the beginning of the algorithm.

(4) Is only ONE discord interesting?
I am not sure... but I FEEL that it should be useful in an online case (i.e. streaming data) when a user wants to see if the current subsequence (containing the latest data) is an anomaly or not when it is compared with a set of usual patterns (e.g. see GOLDEN DAMP in the paper)

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 1, 2023

didn't realize that STUMPY was that performant even for n = 100_000!
This is exciting to see!

Indeed exciting!

For a CPU that was released in Q2 of 2018 has only 6 cores/threads, 22.3 seconds seems really fast for 8 million data points. I'd be curious if we could reproduce these results using their Matlab code or our DAMP code.

According to this video, the split_index seems to be 2500, and the window size is 80. However, according to this, the window size seems to be set to 94. The data itself is not a 1D array. It is a 2D array (a matrix), with two rows, and each row has ~10M data points (not 8 million)

But is this really scalable? After you've investigated, it might be good to check with Eamonn and his team.

I will get to this after the investigation.


Before I forget, I am going to share a couple of observations. We may discuss/ignore them later after trying out the case you suggested above.

(1) Regarding the trillion-experiment data:
The data and the code is provided at the bottom of this page:
https://sites.google.com/view/discord-aware-matrix-profile/dataset#h.dwws21dk9mej

The data is a txt file, containing values, each in one line. The value returned by wc -l command is 118-540-512, which is around 100M (so not trillion?)

The data is about ~3GB. The maximum size that can be uploaded to MATLAB online server is 256 MB. So, I think I need to split it to several files, each with maximum size of 256MB, and then reconstruct the full time series on MATLAB after uploading the files there.

(2) I think the performance can be affected by the location of discord. So, in addition to the running time, it is worth it to report the location of discord as well.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 1, 2023

I am going to share a couple of observations

Excellent. Please let me know if there is anything I can do to help support you!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 2, 2023

According to this video, the split_index seems to be 2500, and the window size is 80. However, according to this, the window size seems to be set to 94. The data itself is not a 1D array. It is a 2D array (a matrix), with two rows, and each row has ~10M data points (not 8 million)

I couldn't find the time series with 8M data points. The mit_long_term_ecg14046.mat data provided for Fig. 12 on the supporting webpage has two rows, each with 10M data points. In the following cases, I considered the first row of this data, and I just considered the first 8M data points. The split_idx is set to 2500. Both Python and MATLAB codes were run on MATLAB online server. Thank you @seanlaw for setting this up!.

Case 1-1:

m=80 Python DAMP MATLAB DAMP 2.0
discord loc 973932 973933
discord dist 10.433 10.4333
running time [sec] 282.23 1044.47

Case 1-2:

m=94 Python DAMP MATLAB DAMP 2.0
discord loc 3711 3712
discord dist 11.254 11.254
running time [sec] 428 1162

Note 1: Indexing in MATLAB starts at 1.
Note 2: The MATLAB code uses DAMP 2.0, which is slightly different:

  • In the beginning of the algorithm, the program computes the full distance profile for the first ~ 16 * m subsequences located AFTER split_idx (with a mix of forward_process). And then, start using the backward-process / forward-process for the remaining test points (i.e. the points that are located AFTER split_idx + 16 * m)
  • Also, the initial chunksize in the backward process of DAMP 2.0 is the minimum power of two that is $\ge 8m$

So, if I have a data that has some repetitive patterns every, say, 2 * m period, then approx. 6 * m subsequences are being processed in the first chunk (almost) unnecessarily(?). However, it is possible to just early abandon the neighbours of a subsequences after considering just the first 2 * m. So, I am trying to understand the logic behind 8m vs m. I think it all comes to this point:

Is the running time of MASS(Q, T1 U T2) (roughly) equal to MASS(Q, T1) + MASS(Q, T2)?

If yes... then we may even avoid doubling the chunksize in each iteration of backward process, and just go backward using the same chunksize!

Next step
I can remove that first part of DAMP 2.0 that considers full distance profile for ~ 16 * m, and then change 8m to m for the initial chunksize, or change m to 8m in our python DAMP, and compare again...

[Update]
Okay... I modified the MATLAB code to avoid computing full distance profile for the first 16*m subsequences located AFTER split_idx. I also changed the initial chunksize from next_pow2(8m) to next_pow2(m).

Now, the computing time of MATLAB in Case 1-1 becomes 152 seconds.
And, the computing time of MATLAB in Caes 2-2 becomes 292 seconds.

[Update 2]
Adding njit decorator with fastmath results in reducing the running time of Python DAMP to 44 seconds for m=80 and to 72 seconds for m=94 case!
(Note: I had to use core._sliding_dot_product instead of core.sliding_dot_product to allow the functions backward / forward process work with njit decorator)


I think it is important to test the performance when m is large. So, I tested it for m=500.

m=500 MATLAB DAMP 2.0 MATLAB DAMP Python DAMP with njit
discord loc 3691 3691 3690
discord dist 28.216 28.216 28.216
running time [sec] TBD 1196 427

@seanlaw
Copy link
Contributor

seanlaw commented Dec 3, 2023

@NimaSarajpoor Thanks for running the comparison. From what I can tell, the Python implementation seems to be around the same performance as Matlab (maybe a little faster)? What should be the next steps? Do we have questions that we should pose to Eamonn (especially with regards to reproducibility)?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 8, 2023

@seanlaw
I tried to use stumpi-like concept (i.e. shifting QT and updating it as we consider new data points in T)

https://github.com/TDAmeritrade/stumpy/blob/21840d773470effd8bc3cd46c2ba1499f7f50bc1/stumpy/stumpi.py#L426

However, I think:
(1) it can take more time to implement (as one needs to carefully track the slice of T that is processed and bypass QT computation without re-computing chunks range) ,
and
(2) I have been doing pre-mature optimization!

So, for now, I just added @nji decorators.


From what I can tell, the Python implementation seems to be around the same performance as Matlab (maybe a little faster)?

Correct. I just added @njit decorator with fastmath flag (but no parallelism) and its performance becomes faster than MATLAB DAMP and MATLAB DAMP 2.0.

What should be the next steps?

(1) I think we can start reproducing a couple of figures of the paper.
(2) We can think about when it makes sense to use DAMP instead of stump (because stump uses parallel computing).
(3) Add param k (for top-k), and enhance the code accordingly.

Do we have questions that we should pose to Eamonn (especially with regards to reproducibility)?

I can think of three questions:

  • Where is that 8M data? Because, the provided dataset has two time series each with 10M data points.
  • Has DAMP been compared with other options like stumpi, or stump (which uses parallel computing) ?
  • Most examples provided on the supporting webpage except two have small m. I was wondering if the performance is tested for large m particularly when there is less periodic behaviour in the data.

Regarding the last item (i.e. performance of DAMP when window size is large)
To do a quick comparison, I considered the first 100k datapoints provided here.

Then:

# python
data = np.array(loadmat('./DAMP_data/mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)

l = 100000 # length of data
split_idx = 2500
m in range(50, 1001, 50)
image

@seanlaw
Copy link
Contributor

seanlaw commented Dec 8, 2023

(1) I think we can start reproducing a couple of figures of the paper.
(2) We can think about when it makes sense to use DAMP instead of stump (because stump uses parallel computing).
(3) Add param k (for top-k), and enhance the code accordingly.

Cool! I think reproducing the figures/performance found in the paper is key.

Where is that 8M data? Because, the provided dataset has two time series each with 10M data points.

We should definitely get some clarification for this.

Has DAMP been compared with other options like stumpi, or stump (which uses parallel computing) ?

Note that Eamonn likely has no idea what stumpi or stump is since we've renamed the original functions. Instead, refer to stompi and stomp/mpx

Most examples provided on the supporting webpage except two have small m. I was wondering if the performance is tested for large m particularly when there is less periodic behaviour in the data.

Based on your test of this, DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code? If "yes", then we should definitely ask Eamonn and his students.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 14, 2023

For sliding dot product between a query, Q, and a time series T, I tried MATLAB (fft) and Python (fft / njit). The computing time for these three cases are shown below.

image

Notice that Python with njit seems to be dependent on the window size. However, that is not the case for Python fft (with SciPy). However, to decorate the backward / forward functions with njit decorator (to speed up computation), the sliding dot product cannot take advantage of scipy fft as Numba does not support that!

Most examples provided on the supporting webpage except two have small m. I was wondering if the performance is tested for large m particularly when there is less periodic behaviour in the data.

Based on your test of this, DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code?

Now that we have checked the performance of sliding dot product, I will check the MATLAB DAMP performance considering different window size as you suggested. Will provide an update.

@seanlaw
What do you think? Please let me know if you have any suggestion.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 14, 2023

Notice that Python with njit seems to be dependent on the window size. However, that is not the case for Python fft (with SciPy). However, to decorate the backward / forward functions with njit decorator (to speed up computation), the sliding dot product cannot take advantage of scipy fft as Numba does not support that!

Yikes, our njit version is pretty bad. When you say "scipy fft", are you referring to scipy.signal.convolve? Note that convolve requires executing an FFT 3 times so please make sure that you are doing the same thing everywhere and not simply a single "FFT". Additionally, I vaguely remember reading scipy.signal.convolve will attempt to use heuristics to choose the fastest algorithm (based on the length of the array). I believe that a different algo is used when the length is over 500 in length and there may be an even better algorithm for longer time series. I think it would be worthwhile testing longer time series.

Can you please share the code that you used?

I think it might be worthwhile posting any comparisons to both the numba discourse channel to see if numba can get closer (either by dot or by fft/convolve) and also post to the scipy Github issues.

Update

It appears that numba now supports np.convolve according to a quick search? This might be relevant to compare

Note, at some point, we may want to implement our own parallel convolve (FFT) function from scratch without relying on numpy or scipy, which is likely using a much slower np.dot underneath the hood.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 14, 2023

are you referring to scipy.signal.convolve? Note that convolve requires executing an FFT 3 times so please make sure that you are doing the same thing everywhere and not simply a single "FFT"

Yes... Below, I am showing what I meant by each case :

  • MATLAB fft --> The part of MATLAB MASS function that computes the sliding dot product with help of fft
  • Python scipy fft --> core.sliding_dot_product, which uses scipy.signal.convolve
  • Python njit --> core._sliding_dot_product, which uses njit on top of np.dot

Btw, I think it makes sense that the computing time in the last case depends on the window size m as it simply uses np.dot. So, this computation has $\mathcal{O}({nm})$ time complexity. However, computing the sliding dot product via fft trick is said to have $\mathcal{O}({n\log{n}})$ time complexity.

Can you please share the code that you used?

All the following three cases were run on the MATLAB online server.

Case: [MATLAB] sliding dot product from MASS algorithm

# MATLAB 

t_lst = [];

idx = 1_000_000;
T = load("mit_long_term_ecg14046.mat");
T = T.mit_long_term_ecg_14046(1, 1:idx);
n = length(T);

m_values = 99:1:1000;
for m = m_values
    Q = T(1:m);
   
    t = tic; 
    Q = Q(end:-1:1); %Reverse the query
    Q(m+1:n) = 0;  %aappend zeros

    X = fft(T);
    Y = fft(Q);
    Z = X.*Y;
    z = ifft(Z);
    
    t_lst(end+1) = toc(t);
end

# later,  the slice `z(m:n)` is considered to compute the distance profile

Case: [Python] core.sliding_dot_product(Q, T), which uses scipy.signal.convolve

from stumpy import core

idx = 1_000_000
T = np.array(loadmat('./mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)[:idx]

t_lst = []
for m in range(99, 1000):
    Q = T[:m]
    t1 = time.time()
    core.sliding_dot_product(Q, T)
    t2 = time.time()
    t_lst.append(t2 - t1)

Case: [Python] core._sliding_dot_product(Q, T), which uses njit decorator on top of np.dot

from stumpy import core

idx = 1_000_000
T = np.array(loadmat('./mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)[:idx]

t_lst = []
for m in range(99, 1000):
    Q = T[:m]
    t1 = time.time()
    core._sliding_dot_product(Q, T)
    t2 = time.time()
    t_lst.append(t2 - t1)

I vaguely remember reading scipy.signal.convolve will attempt to use heuristics to choose the fastest algorithm (based on the length of the array).

Right. I read the doc. It has a parameter called method that is set to auto by default, which automatically chooses between fft and another method.

I think it would be worthwhile testing longer time series.

Will try it out. Note that this will affect the case that uses scipy.signal.convolve. Still, we cannot decorate it with njit (unless we use np.convolve with Numba as you suggested), which seems to have a considerable impact on reducing the running time of DAMP. I will try to check the performance of DAMP without njit decorator and using core.sliding_dot_product. I GUESS that its running time should not depend on m but it cannot outperform the case where DAMP's private functions are decorated with njit

It appears that numba now supports np.convolve according to a quick search? This might be relevant to compare

Interesting! I will see if I can easily/safely replace scipy convolve with numpy convolve, and then compare again.

Note, at some point, we may want to implement our own parallel convolve (FFT) function from scratch without relying on numpy or scipy, which is likely using a much slower np.dot underneath the hood.

I understand. In fact, I already tried rocket-fft, which supports Numba for fft, and then I tried to do what MATLAB code does in python using np.fft. However, I did not notice any improvement in the performance. I did not put much time into it though.

I think it might be worthwhile posting any comparisons to both the numba discourse channel to see if numba can get closer (either by dot or by fft/convolve) and also post to the scipy Github issues.

@seanlaw
... get closer to the MATLAB's performance? Will do so 👍

@seanlaw
Copy link
Contributor

seanlaw commented Dec 14, 2023

I was curious as to why Matlab's FFT might be faster and came across this post.

@NimaSarajpoor
Copy link
Collaborator Author

I was curious as to why Matlab's FFT might be faster and came across this post.

Also: https://www.reddit.com/r/Julia/comments/5wel0j/fft_speed_vs_matlab/

@seanlaw
Copy link
Contributor

seanlaw commented Dec 14, 2023

Also: https://www.reddit.com/r/Julia/comments/5wel0j/fft_speed_vs_matlab/

In 2017, I experimented with FFTW and PyFFTW but hadn't noticed any significant performance difference while noting that they were much harder to use (due to its need to make "plans"). I understand that FFTW uses a ton of tricks and optimizations underneath the hood that could help make DAMP faster but, after a quick search, it looks like FFTW's license (GPL) is incompatible with STUMPY's (BSD3) license. At some point, we may just need to accept our performance-dependency trade off.

Maybe there's a nice way to detect whether pyfftw+FFTW3 are installed and, if so, hand off the calculations. Otherwise, default to scipy.signal.convolve. Or simply set a config.FFTW = True flag to force STUMPY to monkey patch itself or something like that. Regardless, we should create a new Github issue.

@NimaSarajpoor
Copy link
Collaborator Author

At some point, we may just need to accept our performance-dependency trade off.

I understand. The best option is to see if we can improve the performance with the current dependencies. If not, then...

Maybe there's a nice way to detect whether pyfftw+FFTW3 are installed and, if so, hand off the calculations. Otherwise, default to scipy.signal.convolve. Or simply set a config.FFTW = True flag to force STUMPY to monkey patch itself or something like that.


Regardless, we should create a new Github issue.

It makes sense. I created the issue #938 to continue our conversation regarding the performance of sliding dot product.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 18, 2023

We recently noticed that Python DAMP's running time increases as window size, m, gets larger and larger. To investigate MATLAB DAMP regarding same matter, we first started with comparing MATLAB MASS (with fft trick) and PYTHON _MASS. We noticed that the performance of stumpy.core.sliding_dot_product, which uses scipy.signal.convolve (fft trick), seems to be independent of window size. This is what we want as MATLAB MASS function shows similar behaviour. However, this function cannot be njit-decorated. An alternative function, stumpy.core.sliding_dot_product is njit-decorated; however, it uses np.dot for the sliding dot product. This can results in having high computing time when m is large (see the following experiments).

(1) Comparing the "sliding dot product" functions
#920 (comment)

(2) Comparing Python Naive (i.e. with STUMP using multiple threads) vs Python DAMP (using one thread but nit-decorated with fast math. We used core._sliding_dot_product)
#920 (comment)

Based on your test of this, DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code? If "yes", then we should definitely ask Eamonn and his students.

Recall that the current version of MATLAB DAMP provided on the supporting webpage has some differences compared to the original algo proposed in the paper. To this end, I am going to show the result for both versions. To reproduce the following result, here is the code:

# MATLAB

running_time_lst_comp = [];
discord_lst_comp = [];

m_values = 50:50:1000;

idx = 100000;
T = load("mit_long_term_ecg14046.mat");
T = T.mit_long_term_ecg_14046(1, 1:idx);
location_to_start_processing = 2500;
for m = m_values
    t = tic;
    
    [discord_score,position] = DAMP(T, m, location_to_start_processing,'lookahead', m);
    
    running_time_lst_comp(end+1) = toc(t);
    discord_lst_comp(end+1) = position;
end

(1) MATLAB DAMP Original
where, the initial chunksize is next_pow2(m).

The following figure shows the running time of DAMP original for different window sizes. Group of adjacent datapoints with the same color has the same next_pow2(m) value, which is the initial chunksize (in moving backward) and the lookahead size (for forward process).

image

One reason for having ascending trend for group of window sizes with similar next_pow2(m) can be the fact that the initial chunk will have less and less subsequences as m gets larger and larger. For instance, for both m=300 and m=500, the value of next_pow2(...) is 512. In the former case (i.e. m=300), we have 213 subsequences in our first initial chunk size. However, in the latter case (i.e. m=500), we have 13 subsequences. So, maybe it is more efficient to consider more subsequences per chunk (?)... let's see the next case!

(2) MATLAB DAMP v2
where, the initial chunksize is next_pow2(16*m).

image

Compared to the previous case, we can see a slightly larger SLOPE in the running time for a group of adjacent data points. One reason can be: "For the same window size between the two cases, i.e. DAMP original and DAMP v2, DAMP v2 needs to compare the query subsequence agains more subsequences as each chunk is larger in this version of DAMP.

Now, let's get back to our question.

DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code?

According to my experiments, I think it is true.


Let's compare MATLAB DAMP with Python DAMP (njit version), for different window sizes, by running the latter on the MATLAB online services.

image

Now, let's compare again but with larger window size, [3500, 4000, 4500]

image

We can now see the impact of higher time complexity of sliding dot product in Python.


Finally, let's just focus on MATLAB DAMP and consider a wide range of values np.arange(250, 4500+250, 250) for window size m to better see the impact of window size on the performance MATLAB DAMP. The split index is set to 20_000

Group of adjacent data points with similar color are associated with window sizes that have the same next_pow2(m) value.

image

As an aside, I should mention that the MATLAB code has the following block that gives warning and suggests a window size when the provided value by user is either too small or too large

# MATLAB

if SubsequenceLength <= 10 || SubsequenceLength > 1000
        [autocor,lags] = xcorr(T,3000,'coeff');
        [~,ReferenceSubsequenceLength] = findpeaks(autocor(3010:4000),lags(3010:4000),'SortStr','descend','NPeaks',1);
        ReferenceSubsequenceLength(isempty(ReferenceSubsequenceLength))=1000;
        ReferenceSubsequenceLength = floor(ReferenceSubsequenceLength);

        disp("WARNING: ");
        disp("The subsequence length you set may be too large or too small");
        disp(strcat("For the current input time series, we recommend setting the subsequence length to ",num2str(ReferenceSubsequenceLength)));
        fprintf("------------------------------------------\n\n");
    end
  • Why 1000 as the upper bound for SubsequenceLength?
  • Why 3000 as the maximum number of lags?

What does the above results mean for us?

(1) The performance depends on size of m, even in MATLAB DAMP

(2) The sliding dot product without fft trick has the time complexity O(nm). We can see the high running time of Python DAMP when m is large

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2023

Thanks @NimaSarajpoor

@Kotrix
Copy link

Kotrix commented Jan 18, 2024

Thanks for the good work @NimaSarajpoor

Do you plan to add anything else or the DAMP is ready to roll?

One comment:

The data is a txt file, containing values, each in one line. The value returned by wc -l command is 118-540-512, which is around 100M (so not trillion?)

From slides here: https://docs.google.com/presentation/d/1tQQfKuKrOa3j5-WJ9peoYfZGOnJbBcl9/edit#slide=id.p31

we count a trial successful if the top-1 discord is found in the first 100,000 datapoint (created by [e]), rather than from the appended ninety-nine million nine hundred thousand datapoints. We conducted ten thousand such trials, and found no false positives.
100 million times 10 thousand is 1 trillion

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 19, 2024

@Kotrix

Do you plan to add anything else or the DAMP is ready to roll?

Not ready yet. We are currently focusing on implementing numba-friendly Fast Fourier Transform (FFT), which is an important component of DAMP (see #938 and #939 to track the progress for FFT).

One comment:

The data is a txt file, containing values, each in one line. The value returned by wc -l command is 118-540-512, which is around 100M (so not trillion?)

From slides here: https://docs.google.com/presentation/d/1tQQfKuKrOa3j5-WJ9peoYfZGOnJbBcl9/edit#slide=id.p31

we count a trial successful if the top-1 discord is found in the first 100,000 datapoint (created by [e]), rather than from the appended ninety-nine million nine hundred thousand datapoints. We conducted ten thousand such trials, and found no false positives.
100 million times 10 thousand is 1 trillion

Ahhh.... good catch! Thanks! 😄

@seanlaw
Copy link
Contributor

seanlaw commented Aug 8, 2025

@NimaSarajpoor I've been wondering if I've overthought this and that maybe we could/should attempt to reproduce DAMP using pyFFTW and see how fast it is. Then, if it is indeed fast (with pyFFTW), then we make it an optional dependency (i.e., give the user a warning that they aren't using pyFFTW with DAMP and they should consider installing it for better performance).

@NimaSarajpoor
Copy link
Collaborator Author

@NimaSarajpoor I've been wondering if I've overthought this and that maybe we could/should attempt to reproduce DAMP using pyFFTW and see how fast it is. Then, if it is indeed fast (with pyFFTW), then we make it an optional dependency (i.e., give the user a warning that they aren't using pyFFTW with DAMP and they should consider installing it for better performance)

Given the complexity of the FFT and the fact that STUMPY's main focus is Matrix Profile, this sounds a good idea! As a side... this is my first comment after a long pause. Hopefully it will continue 🤞

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 20, 2025

[Update]
After running the examples of the notebook, I noticed that pyfftw-based sliding dot product might not be able to improve the performance alone, and allowing the sliding_dot_product function to switch between algorithms should help. I recall that there was a discussion around having a one-time process that detects the best sliding-dot-product function for different lengths of Q and T. Going to add that piece.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 20, 2025

I don't know if it was for any Q and T but I think the "planning" and "wisdom" is specific to a given Q and T pair. So, if you repeatedly perform FFT on the same length Q and T, then you can benefit from the existing "plan". However, if your lengths vary then you will likely need a plan for all of the different lengths. I don't know this for a fact but that is what I vaguely remember so please confirm.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 5, 2025

but I think the "planning" and "wisdom" is specific to a given Q and T pair. So, if you repeatedly perform FFT on the same length Q and T, then you can benefit from the existing "plan". However, if your lengths vary then you will likely need a plan for all of the different lengths.

I did some research, and this is correct. Let's focus on one array for now. For the array arr, the following code should build a plan:

import pyfftw
obj = pyfftw.builders.rfft(np.empty(len(arr)), ...)

and then, to calculate RFFT:

RFFT_arr = obj(arr)

IIUC, as long as I keep that obj, I should be able to re-use the plan for faster computation of FFT for any array with length len(arr). And if the length of array changes, then my obj is going to be new, and cannot access that previous plan anymore. I added the following code as "warm up" to DAMP. It creates RFFT / IRFFT plans for arrays with lengths 2^1, 2^2, ..., 2^20, and save them in dictionaries.

rfft_objects = {}  # to save RFFT plans
irfft_objects = {}  # to save IRFFT plans

for p in range(1, 21):
    n  = 2 ** p
    rfft_objects[n] = pyfftw.builders.rfft(
        np.empty(n), overwrite_input=True, n=n, threads=1
    )
    irfft_objects[n] = pyfftw.builders.irfft(
        rfft_objects[n].output_array, overwrite_input=True, n=n, threads=1
    )

For a given query Q in DAMP, we need to perform sliding_dot_product(Q, T) on many vary-in-length Ts (chunks of data). The length of first chunk is next_pow2(len(Q)), and the second chunk is 2 * next_pow2(len(Q)), and so on. The length of the last chunk may not be a power of two. For now, I am finding the next_pow2 of that last chunk so that I can leverage the existing plans for that chunk too. So, I've added the following code to DAMP:

# pre-compute plans
rfft_objects = {}
irfft_objects = {}
for p in range(1, 21):
    n  = 2 ** p
    rfft_objects[n] = pyfftw.builders.rfft(
        np.empty(n), overwrite_input=True, n=n, threads=1
    )
    irfft_objects[n] = pyfftw.builders.irfft(
        rfft_objects[n].output_array, overwrite_input=True, n=n, threads=1
    )

# re-usable arrays 
out = np.empty(2 ** 20, dtype=np.complex128)  # re-usable array to store output of sliding dot product

t1 = time.time()
_DAMP(..., rfft_objects, irfft_objects, out)
t2 = time.time()

And then, _DAMP calls the sliding-dot-product function:

def next_pow2(val):
    return int(math.pow(2, math.ceil(math.log2(val))))


def pyfftw_sliding_dot_product(Q, T, rfft_objects, irfft_objects, out):
    m = len(Q)
    n = len(T)
    
    s = next_pow2(len(T))
    n_out = s // 2 + 1
    
    Q_pad = np.empty(s, dtype=np.float64)
    Q_pad[:m] = Q[::-1]
    Q_pad[m:] = 0.0
    
    T_pad = np.empty(s, dtype=np.float64)
    T_pad[:n] = T
    T_pad[n:] = 0.0
    
    rfft_plan = rfft_objects[s]
    out[:n_out] = rfft_plan(Q_pad)
    out[:n_out] = np.multiply(out[:n_out], rfft_plan(T_pad))
     
    irfft_plan =  irfft_objects[s]
    return irfft_plan(out[:n_out]).real[m-1:n]


def sliding_dot_product(Q, T, rfft_objects, irfft_objects, out):
    if len(Q) <= 64 or len(Q) == len(T):
        return njit_sliding_dot_product(Q, T)
    else:
        return pyfftw_sliding_dot_product(Q, T, rfft_objects, irfft_objects, out)

For the example 3 of the notebook, I can see a performance boost. However, the performance is just close to the performance of the naive DAMP. Am I missing something here? As a side, note that functions are no longer njit-decorated.

@seanlaw
Copy link
Contributor

seanlaw commented Sep 5, 2025

However, the performance is just close to the performance of the naive DAMP. Am I missing something here? As a side, note that functions are no longer njit-decorated.

So, the short answer is "I don't know". However, maybe this is a good opportunity to take a step back and

  1. Directly compare the FFT sliding dot product performance between Matlab FFTW and pyFFTW. Focusing on that will help us understand whether it is an pyFFTW performance issue (e.g., Wisdom)
  2. If performance from step 1 is the same then we should compare the Matlab DAMP implementation to your new pyFFTW DAMP implementation to identify which functions are slow
  3. Maybe Example 3 is too small/short of an example? Also, if the performance is close to naive DAMP then that sounds like a 5x improvement compared with your previous implementation (looking at the notebook results):
Screenshot 2025-09-05 at 2 18 27 PM

Does that help @NimaSarajpoor?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 6, 2025

@seanlaw

  1. Directly compare the FFT sliding dot product performance between Matlab FFTW and pyFFTW. Focusing on that will help us understand whether it is an pyFFTW performance issue
  2. If performance from step 1 is the same then we should compare the Matlab DAMP implementation to your new pyFFTW DAMP implementation to identify which functions are slow

Good point! Will try to limit MATLAB FFTW to single thread so that I can do apple-to-apple comparison with pyFFTW with one thread. Then, I will remove that limit in MATLAB, and in pyFFTW, I will play with the number of threads.

  1. Also, if the performance is close to naive DAMP then that sounds like a 5x improvement compared with your previous implementation (looking at the notebook results)

Correct! that was the good part of the story that I did not share. That boost in performance indicates that for larger cases, FFT-based sliding-dot-product performs better. But...I was hoping to see a much better performance than naive DAMP. Maybe it is because of what you mentioned....

Maybe Example 3 is too small/short of an example?

...that the example is not large enough, and naive DAMP still performs better thanks to the parallel mechanism in stumpy.stump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants