Skip to content

Time Series Data Screening and Error Detection Methods

Nicky Sandhu edited this page Jan 9, 2025 · 1 revision

Time Series Data Screening and Error Detection Methods

Data Screening Methods

The auto_screen.py source code defines the following data screening methods:

  • dip_test(ts, low, dip):
    Checks for anomalies based on dips in the time series data. It identifies points where the difference between consecutive values exceeds dip and the value is below a specified threshold (low). [1]

  • repeat_test(ts, max_repeat, lower_limit=None, upper_limit=None):
    Identifies anomalies based on repeated values in a time series. Flags instances where a value repeats more than max_repeat times, optionally within a specified range (lower_limit and upper_limit). [1]

  • short_run_test(ts, small_gap_len, min_run_len):
    Detects small clusters of valid data points surrounded by larger gaps. Flags these clusters as anomalies based on their isolation, considering the length of small gaps (small_gap_len) and the minimum length of a valid data run (min_run_len). [1]

These screening methods operate within a broader framework defined by a configuration file (config). This file specifies the steps and parameters for each screening method, allowing users to customize the data screening process based on their specific needs and data characteristics.


Error Detection Methods

The error_detect.py module defines several error detection methods:

  • nrepeat(ts):
    Returns the lengths of consecutive runs of repeated values in a DataFrame or Series. [2]

  • threshold(ts, bounds, copy=True):
    Masks values in a time series (ts) that fall outside specified bounds. If copy is True, a copy of the series is modified; otherwise, the original series is altered. [2]

  • bounds_test(ts, bounds):
    Detects anomalies in a time series (ts) based on specified bounds. Returns a DataFrame of boolean values indicating whether each data point is an anomaly. [3]

  • median_test(ts, level=4, filt_len=7, quantiles=(0.005, 0.095), copy=True):
    Detects outliers in a time series (ts) using a median filter. [3]

  • median_test_oneside(ts, scale=None, level=4, filt_len=6, quantiles=(0.005, 0.095), copy=True, reverse=False):
    Uses a one-sided median filter to detect outliers. Calculates the difference between actual values and predicted values based on a rolling median. Flags values as anomalies if this difference exceeds level times the interquartile range (scale). [3]

  • med_outliers(ts, level=4.0, scale=None, filt_len=7, range=(None, None), quantiles=(0.01, 0.99), copy=True, as_anomaly=False):
    Compares the difference between the original series and a median-filtered version to the interquartile range (IQR). Flags values as outliers if they deviate by more than level times the IQR (scale). [4]

  • median_test_twoside(ts, level=4, scale=None, filt_len=7, quantiles=(0.01, 0.99), copy=True, as_anomaly=True):
    Similar to med_outliers, but uses a two-sided median filter to detect outliers. [5]

  • gapdist_test_series(ts, smallgaplen=0):
    Fills small gaps in a time series (ts) with a placeholder value (-99999999) for gap analysis. [6]

  • steep_then_nan(ts, level=4.0, scale=None, filt_len=11, range=(None, None), quantiles=(0.01, 0.99), copy=True, as_anomaly=True):
    Identifies outliers near large gaps in the data. Similar to med_outliers, but considers proximity to gaps. [7]

  • despike(arr, n1=2, n2=20, block=10):
    Implements a despiking algorithm to remove spikes from an array of data (arr). [8]


Conclusion

These data screening and error detection methods provide a comprehensive toolkit for ensuring the quality of time series data. By leveraging appropriate configurations and methods, users can improve data reliability and enhance the validity of subsequent analyses.