Anomaly Detection on multivariate (redundant) sensor data #992

alex-stefitz · 2024-06-19T09:59:38Z

alex-stefitz
Jun 19, 2024

Hello,

First of all thanks for this cool tool and the active support, it looks really promising!

I have following scenario: I'm writing my master thesis about anomaly detection of redundant irradiance sensor data. I have data from about 15 irradiance sensors on a specific location. They face two different orientations, so the expectation is that the ones facing the same direction measure more or less the same (small deviances are totally fine due to local shading or just measurement inaccuracies), while the ones facing different directions still behave similarly, but there can be a shift when it reaches the peak.

Generally, the data should follow quite a strict schema: always ~0 during the night, rising in the morning, peak around noon, then falling again until sunset. However, due to all sorts of possible weather phenomenons, basically all behaviours during the day are possible (clear-sky day; sunny day until noon, then thunderstorm and cloudy; fast moving clouds and therefore high fluctuation, completely bad day with less than 10% of theoretically possible irradiance....), and this is alright as long as all sensors agree. Weather is random, so not too many re-occuring patterns to expect.

The data set has some known errors: While the whole data set is multivariate, the errors usually are not. It is possible that two (or more) errors exist at the same time, but they should be considered independent. The errors can show in different ways, the most common are that the broken sensor (=one variable)

is constant [e.g. a sensor is completely broken]
sends completely random data [e.g. a data connection is broken]
shows the right behaviour, but weakened in comparison with the other sensors [e.g. one sensor is covered in dust]

The following pictures show data with errors (first two pictures) and different variations of correct data.

The task is to detect the errors as good as possible. I created an algorithm which is able to detect the errors quite well using pairwise regression, but I need a comparison technique and would love to use Matrix Profile for that, since i like that it proved to perform great in many cases even though it is generally a quite simple approach.

I spent quite some time into MP in the last weeks, but unfortunately, I was not able to get it working. The problem is that applying MP on one sensor does not seem to work. When doing this, i get really high values for days with interesting weather phenomenons (which makes sense since this behaviour has not be seen before), but not even all days with very obvious errors (e.g. random walk) get high MP values.

(I set m to 96, since that's the number of observations per day, and added a NaN between every day so that only whole days are compared with each other. I saw that in one of the _in_official tutorials, it does not change much to have a continous graph)

So then I tried to expand it to multivariate following the paper Matrix Profile XXVIII. However, this also does not seem to work since I still mainly find special days (which makes sense, since this behaviour can be seen in all sensors).

So I don't really know how to proceed. I spent quite some time into MP, adapted my library for it and would really love to use it, but I really don't know how to adapt it in my case: To summarize it up, I need it to detect if one variable is performing different than the others, but it's totally fine to have weird behaviour, if all sensors show it.

If you have any questions, feel free to ask. I would be really happy for your support!

Alex

seanlaw · 2024-06-19T11:14:02Z

seanlaw
Jun 19, 2024
Maintainer

@alex-stefitz Welcome to the STUMPY community and thank you for your kind words. I was wondering if it would be possible to share a 14-day sample of your multidimensional data (probably a CSV with 14 columns that we can read into pandas) that contains an error/anomaly. Also, how long are each time series (i.e., how many data points)? Of course, it would be great if you can share a full data set (i.e., more than 14 days).

From what I understand, you are trying to find a situation when one (or a few) sensors don't behave/look like the majority of the other sensors within a given day. Does that sound correct? So, in your first plot, you are trying to identify the orange line and, in the second plot, the blue line? And in the third plot, there are no anomalies in any of the sensors but there is some variation that is "acceptable"?

Without fully understanding your problem, I don't think that computing the 1D matrix profile on each 1D time series will be helpful since it has no knowledge of the other time series. I also don't think that mstump is the answer since it should really be looking for the same pattern across time series. I will need more information/data to think about this if possible

0 replies

alex-stefitz · 2024-06-19T16:11:16Z

alex-stefitz
Jun 19, 2024
Author

Thank you for your quick response! The time series currently has about 9 month (≈9*30*96=25960) of observations and 10 columns. This will most likely be extended to 1,5 years and I will get another one also with 1,5 years and ~25 sensors.

Unfortunately, I can't share the original data I'm working on, but with the help of the PV-Live dataset I quickly created some data which is really close to the one I'm working with. I attached 3 tables: The clean data, the data including anomalies and a table describing the anomalies.

(For my experiment setup, I work with artificially added errors, that's why I can provide the clean file now, but generally my problem is an unsupervised one without ground truth)

_pv-artificial-irr_clean.csv
_pv-artificial-irr_errors.csv
_pv-artificial-irr_faulty.csv

And yes, I think your understanding is correct. Of course it would also be nice if the approach realizes behaviour which is generally impossible (e.g. significantly negative values, no drop to 0 in the night, ...), but my main focus is the comparison of sensors among each other, so if all show the same behaviour (no matter what is it), it could be considered correct and not an anomaly.

And yes, also your interpretations of the plots are correct! In the first two plots, I want to detect the orange and the blue sensor. In the third plot, there is no anomaly, but you nicely see different weather situations (clear-sky on the first day, some clouds on the second) as well as the difference between the two orientations.

Thank you for your help and have a nice day/evening!

1 reply

seanlaw Jun 20, 2024
Maintainer

@alex-stefitz While I didn't use stumpy directly, I did borrow some of the basic concepts of z-norm, full AB-join (i.e., computing the full Euclidean distance between pairs of time series/subsequences), and finding a one-nearest time series and came up with the following simple approach:

Import Relevant Python Packages

import pandas as pd
from scipy.spatial.distance import cdist
import numpy as np

Define Some Helper Functions

def apply_z_norm(a, axis=0):
    """
    Z-normalize each time series (represented by each row of of the 2D numpy array)
    """
    std = np.std(a, axis, keepdims=True)
    std = np.where(std > 0, std, 1.0)

    return (a - np.mean(a, axis, keepdims=True)) / std


def compute_distance_matrix(Ts):
    """
    Compute the pairwise distance matrix between each time series
    """
    z_norm = apply_z_norm(Ts)
    distance_matrix = cdist(z_norm, z_norm, metric="minkowski", p=2)
    return distance_matrix


def nearest_neighbor_distances(Ts):
    """
    For each time series, find the one other time series that is closest to it (z-normalized Euclidean distance)
    """
    distance_matrix = compute_distance_matrix(Ts)
    np.fill_diagonal(distance_matrix, np.inf)
    return distance_matrix.min(axis=1)

Read Faulty Data

faulty_df = (
    pd.read_csv("https://github.com/user-attachments/files/15903535/_pv-artificial-irr_faulty.csv")
    .pipe(
        lambda df: df.assign(timestamp=pd.to_datetime(df.timestamp))
    )
)

Extract One Bad Example (Based on the first row from the "Error" data)

example_df = (
    faulty_df.query('timestamp >= "2020-09-10 13:00:00"')
    .head(96)
    set_index('timestamp')
example_df.plot()

It is clear that east4 (red line) is the faulty sensor and notice that all of the other sensors have at least one nearest neighbor time series that looks like itself. So, let's compute the pairwise z-normalized Euclidean distance between each time series and return the distance to its one-nearest neighbor (using our helper functions above):

nearest_neighbor_distances(example_df.T.values)

This results in:

array([ 0.28749638,  0.34411935,  0.34411935, 28.2405162 ,  0.28749638,
        0.32102048,  0.43709424,  0.26788829,  0.32102048,  0.26788829])

Notice how ALL except ONE of the distances are below 1.0 and one of them has a one-nearest-neighbor distance of 28.2405162!! Based on its index, it belongs to east4. Based on this approach:

You'll need to repeat this for each day
You may need to establish some sort of distance threshold in order to detect the outlier
You may consider doing a two-nearest-neighbor or three-nearest-neighbor
You will need to consider the presence of MULTIPLE bad sensors that may be above the aforementioned threshold

I hope this makes sense but feel free to ask any follow-up/clarifying questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomaly Detection on multivariate (redundant) sensor data #992

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Anomaly Detection on multivariate (redundant) sensor data #992

alex-stefitz Jun 19, 2024

Replies: 2 comments · 1 reply

seanlaw Jun 19, 2024 Maintainer

alex-stefitz Jun 19, 2024 Author

seanlaw Jun 20, 2024 Maintainer

Import Relevant Python Packages

Define Some Helper Functions

Read Faulty Data

Extract One Bad Example (Based on the first row from the "Error" data)

alex-stefitz
Jun 19, 2024

Replies: 2 comments 1 reply

seanlaw
Jun 19, 2024
Maintainer

alex-stefitz
Jun 19, 2024
Author

seanlaw Jun 20, 2024
Maintainer