Replies: 3 comments
-
Pinging @hoanganhngo610 :D |
Beta Was this translation helpful? Give feedback.
0 replies
-
As I have looked more into the code, I have also discovered the following issues:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @CayJoBla. Thank you very much for your discussion. I will have a look at it ASAP and will make the changes wherever appropriate. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello River community,
As I've been using River's implementation of the DenStream clustering algorithm, I have noticed several inconsistencies and potential bugs compared to the original algorithm described in the paper. I wanted to share my findings and see if others have observed similar issues. Additionally, I am new to contributing to open-source projects, but I am willing to contribute to fixing these issues and implementing the algorithm correctly, if that is okay.
Identified Issues
Initialization Inconsistency: The current implementation uses$\mu$ as the threshold when first initializing the p-micro-clusters using DBSCAN (line 305), which is not consistent with the original paper's use of the $\beta \mu$ threshold (see "Initialization").
Overwriting o-Micro-Clusters: The way that micro-clusters are deleted and added results in an indexing error that potentially overwrites micro-clusters. This is because micro-clusters are stored in a dictionary, and when a micro-cluster is deleted from the dictionary, the indices are not updated. This is seen in the following sections of the source code:
_merge
methodlearn_one
methodPrediction Neighbor Search: In the
predict_one
method during clustering, the current implementation adds neighbors of neighbors that have already been labeled, instead of adding the ones that have not been labeled (lines 386-388). This results in labelled neighbors of neighbors being added and then removed later (lines 378-379), and does not allow the algorithm to do the full BFS.Final Clustering: The final clustering prediction does not allow for noise or outlier points, as it naively finds the closest cluster to each point without considering the$\epsilon$ threshold. This results in all points being assigned to a cluster, even if they are not within the $\epsilon$ neighborhood of any cluster. Furthermore, the current implementation doesn't appear to consider cluster weight in the final clustering result, as mentioned in section 4.2 of the paper.
Micro-Cluster Deletion: The implementation does not handle the deletion of micro-clusters as described in section 5 the paper. For example, it does not address the issue of overlapping micro-clusters.
I would greatly appreciate it if others could confirm these issues or provide any additional insights. A lot of these seem like fairly simple fixes, but I wanted to make sure I wasn't missing anything before I started working on them.
Thank you!
Edit: I think the final clustering as it stands actually does consider the cluster weight in the calculations of whether two micro-clusters are directly density reachable
Beta Was this translation helpful? Give feedback.
All reactions