Is this new published OD algorithm valid or comparable with PyOD detection modales? #610

clevilll · 2024-11-22T20:14:20Z

I'm exploring outlier detection algorithms, and I found a newly published journal paper offering a new cross-density-distance outlier detection algorithms. I checked the paper and it's mainly doing the following steps with human-factored parameters to allocate weighs to manipulate thresholds:

after using the kernel height estimation function (KHE):
$KHE(x \mid Z)=\sum_{k=1}^n z_k^{(j)} \cdot K\left(\frac{x-z_k^{(i)}}{h}\right)$
$K(u)=e^{-0.5 \cdot u^2}$

... A substantial amount of academic literature exists that explains the most optimal selection of bandwidth ( $h$ ) (Scott, 2015). However, throughout our experiments, we utilized Silverman's rule of thumb (Silverman, 2018):

$$ h=0.9 \cdot \min \left(\hat{\sigma}, \frac{I Q R}{1.34}\right) \cdot n^{\frac{-1}{5}} $$

In this work, we propose another transform function $C_w \in[d, 1]$, to derive a confidence weight for an outlier score given the total number of observations beneath the corresponding histogram. The fundamental concept entails starting from the minimum weight $(d)$ and elevating the outlier score as we approach the anticipated minimum baseline size expectation:

$$C_w(n, d)=1-(1-d) \cdot e^{\frac{-e n^2}{h^2}}$$ $$S_c\left(X_i \mid M_i\right)=C_w(n, d)$$

in sub-section 2.2. Ensemblers stated

Formally, the objective is to derive the ensemble score $\bar{E}_i$ for a given data point $X_i$ passed through all submodels in $\mathbf{M}$,

where each submodel $\left(M_i:=\mathcal{M}_{f \mid c, y}^\omega\right)$ outputs the following attributes for every data point $X_i$ :

Outlier score $S(X i \mid M i) \in[0,1]$

Outlier confidence score $S c(X i \mid M i) \in[0,1]$

Submodel's weight $\omega \in[0,1]$

and in equation (#18):

$$ E(O, W)=\sum_{i=1}^n\left(s_i \cdot s_{c i} \cdot \omega_i\right)^3 $$

Compute the final ensemble score Oi for data point $X i$ passed through all submodels

Despite some vague mathematic relationships and figures that try to justify the claimed scoring approach, I understand a single fact trying to distinguish outlierness concerning contextualized features potentially.

The author states after the equation (#15) in the paper (page #7):

Throughout our experiment, we set the density score weight ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) to 0.8 and the distance score weight ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) to 0.2.
It's not clear how the author comes up even empirically with these coefficients to weigh and mix scoring the way the user prefers!
The impact of this approach is indicated Fig 4 between density-based and distance-based scoring using empirical coefficients denoted in equation (#15) as weights or the impact of human assistance!


Fig. 4: pic credit from the owner of the algorithm in the paper '

but the soundness of the claimed algorithm with no open source to reproduce results and with some unnecessary or empirical-based stages like reformulating ensemble models like IsolationForest or ensembled HBOS detection models. They showed that the expert version of this algorithm has super good performance

I would be happy if someone familiar with outlier detection mathematics checked the math and shared the insight if this algorithm could be as valid as it claimed since there is no open-source GitHub repo to reproduce the results of further investigations. In my personal opinion, math is not clear and sound using:

Kernel density Estimation
quantile transformer

to modify some scoring using human factor which manipulates scoring to get the best results while the art is finding non-parametric detection to minimize human factor parameterizations. Maybe I'm not fully informed and as an enthusiast want to understand the reliability and validity of this promising detection.

clevilll changed the title ~~this new published OD algorithm is valid or comparable with PyOD detection modales?~~ Is this new published OD algorithm valid or comparable with PyOD detection modales? Nov 22, 2024

clevilll mentioned this issue Nov 27, 2024

Understanding heuristic-based outlier detection #611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this new published OD algorithm valid or comparable with PyOD detection modales? #610

Is this new published OD algorithm valid or comparable with PyOD detection modales? #610

clevilll commented Nov 22, 2024 •

edited

Loading

Is this new published OD algorithm valid or comparable with PyOD detection modales? #610

Is this new published OD algorithm valid or comparable with PyOD detection modales? #610

Comments

clevilll commented Nov 22, 2024 • edited Loading

clevilll commented Nov 22, 2024 •

edited

Loading