Issues with the DenStream Implementation #1555

CayJoBla · 2024-06-10T17:37:58Z

CayJoBla
Jun 10, 2024

Hello River community,

As I've been using River's implementation of the DenStream clustering algorithm, I have noticed several inconsistencies and potential bugs compared to the original algorithm described in the paper. I wanted to share my findings and see if others have observed similar issues. Additionally, I am new to contributing to open-source projects, but I am willing to contribute to fixing these issues and implementing the algorithm correctly, if that is okay.

Identified Issues

Initialization Inconsistency: The current implementation uses $\mu$ as the threshold when first initializing the p-micro-clusters using DBSCAN (line 305), which is not consistent with the original paper's use of the $\beta \mu$ threshold (see "Initialization").
Overwriting o-Micro-Clusters: The way that micro-clusters are deleted and added results in an indexing error that potentially overwrites micro-clusters. This is because micro-clusters are stored in a dictionary, and when a micro-cluster is deleted from the dictionary, the indices are not updated. This is seen in the following sections of the source code:
- The _merge method
  - Deleting an o-micro-cluster (line 221)
  - Assigning p-micro-cluster to index based on dictionary length (line 222)
  - Assigning o-micro-cluster to index based on dictionary length (line 232)
- The learn_one method
  - Deleting a p-micro-cluster (line 339)
  - Deleting an o-micro-cluster (line 352)
Prediction Neighbor Search: In the predict_one method during clustering, the current implementation adds neighbors of neighbors that have already been labeled, instead of adding the ones that have not been labeled (lines 386-388). This results in labelled neighbors of neighbors being added and then removed later (lines 378-379), and does not allow the algorithm to do the full BFS.
Final Clustering: The final clustering prediction does not allow for noise or outlier points, as it naively finds the closest cluster to each point without considering the $\epsilon$ threshold. This results in all points being assigned to a cluster, even if they are not within the $\epsilon$ neighborhood of any cluster. Furthermore, the current implementation doesn't appear to consider cluster weight in the final clustering result, as mentioned in section 4.2 of the paper.
Micro-Cluster Deletion: The implementation does not handle the deletion of micro-clusters as described in section 5 the paper. For example, it does not address the issue of overlapping micro-clusters.

I would greatly appreciate it if others could confirm these issues or provide any additional insights. A lot of these seem like fairly simple fixes, but I wanted to make sure I wasn't missing anything before I started working on them.

Thank you!

Edit: I think the final clustering as it stands actually does consider the cluster weight in the calculations of whether two micro-clusters are directly density reachable

smastelini · 2024-06-13T15:03:27Z

smastelini
Jun 13, 2024
Maintainer

Pinging @hoanganhngo610 :D

0 replies

CayJoBla · 2024-06-13T19:42:53Z

CayJoBla
Jun 13, 2024
Author

As I have looked more into the code, I have also discovered the following issues:

The DenstreamMicroCluster class does not properly handle the running weighted linear sum and squared sum values, resulting in erroneous computations of the center, radius, and weight values for each micro cluster. The difference comes from the current implementation keeping a running sum and then applying decay/weighting of the same value to the entire sum, whereas the paper uses a separate weight for each point.
As part of the merge operation, micro-clusters are copied so that a new point can be added to test the radius value. If the radius does not meet a certain threshold, the point is not officially added to the micro-cluster. In the current implementation, the copy operation is not deep, and so the copied micro-cluster is still changing values (specifically the linear and squared sums) of the original cluster even when the point is not officially added.

0 replies

hoanganhngo610 · 2024-06-14T10:51:28Z

hoanganhngo610
Jun 14, 2024
Maintainer

Hi @CayJoBla. Thank you very much for your discussion. I will have a look at it ASAP and will make the changes wherever appropriate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with the DenStream Implementation #1555

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Issues with the DenStream Implementation #1555

CayJoBla Jun 10, 2024

Identified Issues

Replies: 3 comments

smastelini Jun 13, 2024 Maintainer

CayJoBla Jun 13, 2024 Author

hoanganhngo610 Jun 14, 2024 Maintainer

CayJoBla
Jun 10, 2024

smastelini
Jun 13, 2024
Maintainer

CayJoBla
Jun 13, 2024
Author

hoanganhngo610
Jun 14, 2024
Maintainer