Another question on multidimensional series pattern matching #686

PatrickKudo · 2022-09-26T18:07:29Z

PatrickKudo
Sep 26, 2022

Hello, I was reading this conversation as well as the fast pattern matching tutorial and I think I have a basic understanding of how to use the MASS algorithm. However, I am wondering if there's a more effective approach to what I'm trying to do.

Basically, assume I have a multidimensional dataframe df with 10 variables and I want to find a match of the 1st time series (T1) using the other 9 time series. I set up a subsequence of T1 as Q_df, then iterate through and calculate distance profiles for each column. Then I find the lowest distance profile value's corresponding index, and write it to an array.

# Create a subquery of the desired time series sequence to find a match to (randomly chose 24 consecutive intervals)
Q_df = df['T1'][25:49]
# Loop through df for all data sources other than desired time series
# Calculate distance_profile
dp_df = pd.DataFrame()
idx_start_array = []
for column in df.columns[1:]:
    dp_df[column] = stumpy.mass(Q_df, df[column])
    idx_col = np.argmin(dp_df[column])
    idx_start_array.append(idx_col)

For picking the best candidate matches, I was thinking I would compare values from idx_start_array against the start index of Q_df (arbitrarily started with 25), and whichever is closest to that index is the best match. But I was wondering if it would be better to find a match by comparing the z-norm distance profiles for all column variables. I also considered using stumpy.match, but I am still trying to understand it, especially the no-threshold application involving the stumpy.config.STUMPY_EXCL_ZONE_DENOM setting.

Is my approach reasonable or is there a way to make stumpy.match work somehow for my data? Thanks!

-Patrick

seanlaw · 2022-09-26T20:35:40Z

seanlaw
Sep 26, 2022
Maintainer

@PatrickKudo Thanks for the question. First off, when you want to put your code inside of a code-block in a comment, rather than using single backticks at the start/end of each line, you can use three consecutive backticks at the start and at the end of the code block like:

```
import numpy as np
import stumpy

for i in range(10):
   print(i)
```

Now, on to your question.

But I was wondering if it would be better to find a match by comparing the z-norm distance profiles for all column variables.

So, it depends on your goal. Right now, you are comparing the query subsequence with each column separately and each column will return a set of "top matches". It is completely possible that Q_df is really well matched with subsequences within, say, df['T1'] but not so close to the other columns in T (i.e., the matches are poor but still within some threshold). To ensure that everything is compared on equal footing (i.e., using the same distance threshold across all columns), you can first:

Append an extra row via df.loc[len(df)] = np.full(len(df.columns), np.nan)
Concatenate all of the columns together via concat_df = pd.DataFrame(pd.concat([df[col] for col in df]), columns=['Tall'])
Compute the (globally) best matches from this concatenated time series via stumpy.mass(Q_df, concat_df['Tall'])

Of course, you'll have to do some simple bookkeeping to keep track of what the "new" indices mean relative to the original un-concatenated time series but, hopefully, you get the point. Note that we add an np.nan row so that no matches will come from the overlap between columns.

1 reply

PatrickKudo Sep 29, 2022
Author

Hi Sean, thanks for the quick response! Sorry about the code formatting.

I realized that calculating the difference between Q_z_norm and nn_z_norm was meaningless for trying to compare how close the target time series Q_df is with a candidate time series, and that I should just calculate the Euclidean distance for that measure. It does look like I can use MASS to find candidate time series which match up to the subsequence, and then sort by minimum Euclidean distance to determine closest time series to the target subsequence.

I was thinking stumpy.match might be what I wanted to use, but looking at the API documentation it seems like it relies on additional threshold parameters to establish matches, which I'm not sure how useful it'd be for my application as I don't want to get false positives, if that is even a possibility.

I think your approach of concatenating all the candidate time series into one is interesting. Is searching through 1 giant time series more efficient than looping through a dataframe column-by-column? For my application, the total intervals I'd be searching through at a given time would be on the order of 10^6, with 10^3 target subsequences.

seanlaw · 2022-09-29T18:19:17Z

seanlaw
Sep 29, 2022
Maintainer

I was thinking stumpy.match might be what I wanted to use, but looking at the API documentation it seems like it relies on additional threshold parameters to establish matches, which I'm not sure how useful it'd be for my application as I don't want to get false positives, if that is even a possibility.

When you say "false positives", It sounds like you are looking for exact matches (i.e., you are looking for subsequences in a longer time series that matches your query subsequence exactly)?

For my application, the total intervals I'd be searching through at a given time would be on the order of 10^6, with 10^3 target subsequences.

For 10^6, this is just long enough that you may start seeing a difference in computing time. I think what matters is how many columns do you have in total? And how many different target subsequences will you be querying for?

2 replies

PatrickKudo Oct 6, 2022
Author

Hello @seanlaw, thank you again for the response --- I am searching through about ~2,000 columns with 1 target subsequence (with maybe ~10,000 other potential targets). FWIW, I was trying out my approach more with stumpy.mass and the computation time wasn't too bad (~500 s), considering there is inefficient code (dataframe concatenation) in my for loop I need to address.

From the results I'm seeing now, stumpy.mass is working as it's intended, but I think need to further separate the haystack of candidate sources (metered energy) before calling it. What I think I'm seeing is that the query subsequence I'm targeting is most often (as might be expected with energy) a conserved motif that exists in most of the candidate sources, which is great to know, but I'm after a near-exact match. I tried just doing a Euclidean distance calculation between the query subsequence and each candidate subsequence to further "rank" which pairs of subsequences is closest (distance_base, path = fastdtw(Q_df, df[column].astype(float), dist=euclidean)) but this only seems to work well when I limit how many candidate series are to be search based on domain knowledge of what regions of the energy system the targets subsequence are from. So I am thinking of some sort of hierarchical time series clustering may be needed.

seanlaw Oct 6, 2022
Maintainer

but I'm after a near-exact match

Maybe I'm not understanding your goal. Calling stumpy.mass between a query, Q, and a longer time series, T, will return ALL pairwise distances between Q and each subsequence of len(Q) found in T. This is the so-called "distance profile" and not only can you now search for the one nearest neighbor, you can also search for the second, third, fourth, and len(T) - len(Q) + 1 nearest neighbor. As you traverse to the next nearest neighbor, presumably, the subsequence at that index would be less and less of an exact match. Note, after you've identified your top one-nearest neighbor (e.g., located at T[idx_0 : idx_0 + m], you can "blank out" that subsequence by setting the distance to T[idx_0 : idx_0 + m] = np.inf and then look for the next subsequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another question on multidimensional series pattern matching #686

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Another question on multidimensional series pattern matching #686

PatrickKudo Sep 26, 2022

Replies: 2 comments · 3 replies

seanlaw Sep 26, 2022 Maintainer

PatrickKudo Sep 29, 2022 Author

seanlaw Sep 29, 2022 Maintainer

PatrickKudo Oct 6, 2022 Author

seanlaw Oct 6, 2022 Maintainer

PatrickKudo
Sep 26, 2022

Replies: 2 comments 3 replies

seanlaw
Sep 26, 2022
Maintainer

PatrickKudo Sep 29, 2022
Author

seanlaw
Sep 29, 2022
Maintainer

PatrickKudo Oct 6, 2022
Author

seanlaw Oct 6, 2022
Maintainer