Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this new published OD algorithm valid or comparable with PyOD detection modales? #610

Open
clevilll opened this issue Nov 22, 2024 · 0 comments

Comments

@clevilll
Copy link

clevilll commented Nov 22, 2024

I'm exploring outlier detection algorithms, and I found a newly published journal paper offering a new cross-density-distance outlier detection algorithms. I checked the paper and it's mainly doing the following steps with human-factored parameters to allocate weighs to manipulate thresholds:

after using the kernel height estimation function (KHE):
$KHE(x \mid Z)=\sum_{k=1}^n z_k^{(j)} \cdot K\left(\frac{x-z_k^{(i)}}{h}\right)$
$K(u)=e^{-0.5 \cdot u^2}$

image

... A substantial amount of academic literature exists that explains the most optimal selection of bandwidth ( $h$ ) (Scott, 2015). However, throughout our experiments, we utilized Silverman's rule of thumb (Silverman, 2018):

$$ h=0.9 \cdot \min \left(\hat{\sigma}, \frac{I Q R}{1.34}\right) \cdot n^{\frac{-1}{5}} $$

In this work, we propose another transform function $C_w \in[d, 1]$, to derive a confidence weight for an outlier score given the total number of observations beneath the corresponding histogram. The fundamental concept entails starting from the minimum weight $(d)$ and elevating the outlier score as we approach the anticipated minimum baseline size expectation:

$$C_w(n, d)=1-(1-d) \cdot e^{\frac{-e n^2}{h^2}}$$ $$S_c\left(X_i \mid M_i\right)=C_w(n, d)$$

in sub-section 2.2. Ensemblers stated

Formally, the objective is to derive the ensemble score $\bar{E}_i$ for a given data point $X_i$ passed through all submodels in $\mathbf{M}$,

where each submodel $\left(M_i:=\mathcal{M}_{f \mid c, y}^\omega\right)$ outputs the following attributes for every data point $X_i$ :

  • Outlier score $S(X i \mid M i) \in[0,1]$
  • Outlier confidence score $S c(X i \mid M i) \in[0,1]$
  • Submodel's weight $\omega \in[0,1]$

and in equation (#18):

$$ E(O, W)=\sum_{i=1}^n\left(s_i \cdot s_{c i} \cdot \omega_i\right)^3 $$

Compute the final ensemble score Oi for data point $X i$ passed through all submodels

Despite some vague mathematic relationships and figures that try to justify the claimed scoring approach, I understand a single fact trying to distinguish outlierness concerning contextualized features potentially.

The author states after the equation (#15) in the paper (page #7):

Throughout our experiment, we set the density score weight ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) to 0.8 and the distance score weight ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) to 0.2.
It's not clear how the author comes up even empirically with these coefficients to weigh and mix scoring the way the user prefers!
The impact of this approach is indicated Fig 4 between density-based and distance-based scoring using empirical coefficients denoted in equation (#15) as weights or the impact of human assistance!

Fig. 4: pic credit from the owner of the algorithm in the paper '

but the soundness of the claimed algorithm with no open source to reproduce results and with some unnecessary or empirical-based stages like reformulating ensemble models like IsolationForest or ensembled HBOS detection models. They showed that the expert version of this algorithm has super good performance

image

I would be happy if someone familiar with outlier detection mathematics checked the math and shared the insight if this algorithm could be as valid as it claimed since there is no open-source GitHub repo to reproduce the results of further investigations. In my personal opinion, math is not clear and sound using:

  • Kernel density Estimation
  • quantile transformer

to modify some scoring using human factor which manipulates scoring to get the best results while the art is finding non-parametric detection to minimize human factor parameterizations. Maybe I'm not fully informed and as an enthusiast want to understand the reliability and validity of this promising detection.

@clevilll clevilll changed the title this new published OD algorithm is valid or comparable with PyOD detection modales? Is this new published OD algorithm valid or comparable with PyOD detection modales? Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant