Speed up DAMP #904
Replies: 2 comments 5 replies
-
Maybe I'm not getting your point but I'm wondering if you are conflating two things here:
So, the best place to use multiple threads is when you have processes (or slices) that are completely independent of each other and also do not rely on the (final or intermediate) outputs of the other threads. In your case, it appears that there is a global/shared |
Beta Was this translation helpful? Give feedback.
-
The item (2) occurs after finishing local DAMP process for all threads, each on their own without needing the best-so-far distance from other threads. So, Let's say I have two threads, and my time series has the length 100M. According to the original DAMP, we should start from the beginning, and update best-so-far discord distance as we move towards to the end of the time series. This computation happens in one thread. However, while we are going through the first half (i.e. first 50M datapoints), we can try to find the best-so-far distance (bsf) on the second half of the time series in parallel [This procedure does not depend on the best-so-far distance of the first half]. The best-so-far distance computed in the second thread is just an approximate and it is basically an upperbound of the discord distance for a potential discord located in the slice At the end, when the computation in both threads finish, I can compare bsf of the first half [which is exact] with the bsf of the second half [which is approx and it is an upperbound]. If the former is still larger, I can skip the next 50M data points. |
Beta Was this translation helpful? Give feedback.
-
[If you are not familiar with DAMP, first see #606]
According to the paper, DAMP algorithm is designed to perform early abondoning, which can speed up discord discovery. Recently, I have been thinking about whether it would be possible to take advantage of parallelism as well or not. So, I have an idea but have not tested it yet. To avoid premature optimization, I prefer to just explain the idea here for our future reference.
Let's say I have a time series
T
, with length1250
; and the first250
data points are considered as the training part. And, also let's say I am looking for the top discord with lengthm
inT[250: ]
. The existing DAMP algo starts at the subsequecneT[250:250+m]
, and does backward processing to see if its distance to 1-[left]nearest-neighbor is greater than the best-so-far discord distane or not. It also performs forward processing to prune some of the forthcoming subsequences. It then jumps to the next have-not-been-pruned-yet subsequence and does the backward/ forward processing, and so on.Let's see if we can use more than one thread to expedite the process. Let's say I have four threads. Therefore:
For each thread, we perform DAMP locally:
Let's consider the second thread, which explores subsequences in
T[500:750]
. Note that for each subsequence in this slice, all subsequences in T[250:500] are considered as its left neighbors but they were not considered in the discord-discovery process in the second thread. But, there is an interesting point here: the distanced_j
is an upperbound for subsequences in the slice T[500:750]. So, we can divideT
and perform DAMP locally on one slice per thread, and then we start from thread 1 to 4, to see if can skip a whole slice. So, let's say the best-so-far distance according to thread 1 is10
, i.e.d_i = 10
. Ifd_j < d_i
(e.g.d_j == 9
), then we can see that there is no need to explore T[500:750]. Ifd_j > d_i
, then that means we should explore T[500:750] by considering the left neighbors in T[250:500].Beta Was this translation helpful? Give feedback.
All reactions