-
Notifications
You must be signed in to change notification settings - Fork 0
Time Series Data Screening and Error Detection Methods
The auto_screen.py
source code defines the following data screening methods:
-
dip_test(ts, low, dip)
:
Checks for anomalies based on dips in the time series data. It identifies points where the difference between consecutive values exceedsdip
and the value is below a specified threshold (low
). [1] -
repeat_test(ts, max_repeat, lower_limit=None, upper_limit=None)
:
Identifies anomalies based on repeated values in a time series. Flags instances where a value repeats more thanmax_repeat
times, optionally within a specified range (lower_limit
andupper_limit
). [1] -
short_run_test(ts, small_gap_len, min_run_len)
:
Detects small clusters of valid data points surrounded by larger gaps. Flags these clusters as anomalies based on their isolation, considering the length of small gaps (small_gap_len
) and the minimum length of a valid data run (min_run_len
). [1]
These screening methods operate within a broader framework defined by a configuration file (config
). This file specifies the steps and parameters for each screening method, allowing users to customize the data screening process based on their specific needs and data characteristics.
The error_detect.py
module defines several error detection methods:
-
nrepeat(ts)
:
Returns the lengths of consecutive runs of repeated values in a DataFrame or Series. [2] -
threshold(ts, bounds, copy=True)
:
Masks values in a time series (ts
) that fall outside specified bounds. Ifcopy
isTrue
, a copy of the series is modified; otherwise, the original series is altered. [2] -
bounds_test(ts, bounds)
:
Detects anomalies in a time series (ts
) based on specified bounds. Returns a DataFrame of boolean values indicating whether each data point is an anomaly. [3] -
median_test(ts, level=4, filt_len=7, quantiles=(0.005, 0.095), copy=True)
:
Detects outliers in a time series (ts
) using a median filter. [3] -
median_test_oneside(ts, scale=None, level=4, filt_len=6, quantiles=(0.005, 0.095), copy=True, reverse=False)
:
Uses a one-sided median filter to detect outliers. Calculates the difference between actual values and predicted values based on a rolling median. Flags values as anomalies if this difference exceedslevel
times the interquartile range (scale
). [3] -
med_outliers(ts, level=4.0, scale=None, filt_len=7, range=(None, None), quantiles=(0.01, 0.99), copy=True, as_anomaly=False)
:
Compares the difference between the original series and a median-filtered version to the interquartile range (IQR). Flags values as outliers if they deviate by more thanlevel
times the IQR (scale
). [4] -
median_test_twoside(ts, level=4, scale=None, filt_len=7, quantiles=(0.01, 0.99), copy=True, as_anomaly=True)
:
Similar tomed_outliers
, but uses a two-sided median filter to detect outliers. [5] -
gapdist_test_series(ts, smallgaplen=0)
:
Fills small gaps in a time series (ts
) with a placeholder value (-99999999) for gap analysis. [6] -
steep_then_nan(ts, level=4.0, scale=None, filt_len=11, range=(None, None), quantiles=(0.01, 0.99), copy=True, as_anomaly=True)
:
Identifies outliers near large gaps in the data. Similar tomed_outliers
, but considers proximity to gaps. [7] -
despike(arr, n1=2, n2=20, block=10)
:
Implements a despiking algorithm to remove spikes from an array of data (arr
). [8]
These data screening and error detection methods provide a comprehensive toolkit for ensuring the quality of time series data. By leveraging appropriate configurations and methods, users can improve data reliability and enhance the validity of subsequent analyses.