Skip to content

Commit b446096

Browse files
committed
Complete requested edits
1 parent a032d62 commit b446096

File tree

6 files changed

+78
-49
lines changed

6 files changed

+78
-49
lines changed

chapter_2/2_dbscan.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ \section{The DBSCAN algorithm}
1717
dense region, where dense is defined precisely as follows: given a point $x$,
1818
consider the $\epsilon$-ball around it. If the $\epsilon$-ball has
1919
at least $m$ data points around it, then we consider it a core point. Formally,
20-
if we denote the set of core points as $C$, then $C=\{x \in X: |X \cap B(x,\epsilon)|\geq m\}$.
20+
if we denote the set of core points as $C$, then .
2121
\item \textbf{Border points}: These are points that do not satisfy the $\epsilon$-ball
2222
density condition for core points, but are within $\epsilon$ of a core point.
2323
Note that if a border point is within $\epsilon$ of more than one core point

chapter_2/3_dbscan_variations.tex

Lines changed: 36 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
% Contributors: William Tong, Will Zheng
22
\section{Variations on a Theme of DBSCAN}
33

4-
With an algorithm as old and widely adopted as DBSCAN, many have dedicated considerable energy towards improving the method. Common themes have included accelerating the runtime, improving robustness, alleviating sensitivity to parameters, and discovering clusters of varying densities. To name some examples:\cite{dbscan_improvs}
4+
With an algorithm as old and widely adopted as DBSCAN, many have dedicated considerable energy towards improving the method. Common themes have included accelerating the runtime, improving robustness, alleviating sensitivity to parameters, and discovering clusters of varying densities. To give a flavor of the breadth, some examples are:\cite{dbscan_improvs}
55

66
\begin{itemize}
7-
\item \textbf{GDBSCAN}: Generalized DBSCAN: generalizes the algorithm to non-Euclidean spaces
8-
\item \textbf{ST-DBSCAN}: Spatial-Temporal DBSCAN: considers an added time dimension
9-
\item \textbf{DVBSCAN}: Density Variation Based Spatial Clustering of Applications with Noise: addresses local density variation within a cluster
10-
\item \textbf{MR-DBSCAN}: MapReduce DBSCAN: uses the popular big data algorithm to speed computation; also addresse skewed data
11-
\item \textbf{HDBSCAN\textsuperscript{*}}: Hierarchical DBSCAN: excludes border points from clusters; more closely follows the original algorithm proposed by Hartigan
12-
\item \textbf{PACA-DBSCAN}: Polymorphic Ant Colony Algorithm DBSCAN: applies ant colony optimization techniques to aid clustering in higher dimensions
7+
\item \textbf{GDBSCAN} (Generalized DBSCAN): generalizes the algorithm to non-Euclidean spaces, for example general metric spaces
8+
\item \textbf{ST-DBSCAN} (Spatial-Temporal DBSCAN): considers an added time dimension when clustering
9+
\item \textbf{DVBSCAN} (Density Variation Based Spatial Clustering of Applications with Noise): addresses local density variation within a cluster, allowing decent performance on clusters with non-uniform density
10+
\item \textbf{MR-DBSCAN} (MapReduce DBSCAN): uses the popular big data algorithm to speed computation; also addresses clusters with uneven density distribution
11+
\item \textbf{HDBSCAN\textsuperscript{*}} (Hierarchical DBSCAN): extends DBSCAN to extract hierarchical clusters
12+
\item \textbf{PACA-DBSCAN} (Polymorphic Ant Colony Algorithm DBSCAN): applies ant colony optimization (ACO) techniques to aid clustering in higher dimensions. ACOs reduce the problem to a graph traversal setting, finding an optimal path through exploration methods inspired by ants.
1313
\end{itemize}
1414

15-
In the following sections, we will focus particularly on two improvements: DBSCAN++ and OPTICS \\
15+
In the following sections, we will focus particularly on two improvements that have substantial potential or success: DBSCAN++ and OPTICS \\
1616

1717
\noindent\textbf{DBSCAN++}
1818

@@ -37,7 +37,12 @@ \section{Variations on a Theme of DBSCAN}
3737
\end{algorithmic}
3838
\end{algorithm}
3939

40-
Hence instead of iterating over all points to discover cores, we iterate over a subset $S$. The cluster queries then occur only for core points within the subset, reducing the overall runtime to $O(sn)$, where $s = |S| < n$, which is sub-quadratic. See Figure \ref{fig:runtime} for a direct comparison of runtimes.
40+
Hence instead of iterating over all points to discover cores, we iterate over a subset $S$. The cluster queries then occur only for core points within the subset, reducing the overall runtime to $O(sn)$, where $s = |S| < n$. In fact, under reasonable regularity conditions, it can be shown that the runtime can be subquadratic:
41+
42+
$$ O(n^{2 - 2\beta / (2\beta + D})$$
43+
where $D$ is the number of dimensions and $\beta$ is a positive constant representing a smoothness assumption placed on the density level sets (elaborated below in the description of Robust DBSCAN). The bound is achieved when $|S|$ is roughly greater than $n^{D / (2\beta + D)}$, on a dataset with $n$ points. Note that a higher $\beta$ corresponds to a "smoother" dataset, where the differences in density are harder to discern, resulting in a smaller number of larger clusters. The chance of a random sample including core points from a cluster rises. This effect is reflected in the runtime, in which a higher $\beta$ yields a faster runtime. On the other hand, higher dimensional data implies sampling more points to cover the space. So as the $D$ increases, the algorithm naturally slows.
44+
45+
See Figure \ref{fig:runtime} for a direct comparison of runtimes.
4146

4247
\begin{figure}[h]
4348
\centering
@@ -52,7 +57,7 @@ \section{Variations on a Theme of DBSCAN}
5257
\item \textbf{K-Center}: greedy furthest-first traversal approximation to the K-Center problem applied to the dataset, where $k$ is the number of elements in the sample
5358
\end{enumerate}
5459

55-
Both methods have the same consistency guarantees for approximating the density in a dataset. But in practice, the K-Center approach may yield a more even spread of samples than uniform sampling, potentially contributing to better result (see Figure \ref{fig:dbscan_plus}).
60+
Both methods have the same consistency guarantees for approximating the density in a dataset. But in practice, the K-Center approach may yield a more even spread of samples than uniform sampling, potentially contributing to better result (see Figures \ref{fig:dbscan_plus} and \ref{fig:dbscan_plus_scores}).
5661

5762
\begin{figure}[h]
5863
\centering
@@ -61,45 +66,52 @@ \section{Variations on a Theme of DBSCAN}
6166
\label{fig:dbscan_plus}
6267
\end{figure}
6368

69+
\begin{figure}[h]
70+
\centering
71+
\includegraphics[scale=0.20,angle=90]{chapter_2/files/dbscan_plus_scores.png}
72+
\caption{An empirical comparison of DBSCAN++ and DBSCAN on different datasets. Rand index and mutual information scores are generated to compare the clusters produced by DBSCANA and the two sampling methods of DBSCAN++, with varying $\epsilon$. It is apparent that DBSCAN++ is able to achieve comparable performance to DBSCAN, despite sampling a fraction of the total points. Source: \cite{dbscanpp}}
73+
\label{fig:dbscan_plus_scores}
74+
\end{figure}
75+
6476
\noindent \textbf{OPTICS}
6577

6678
An important issue with DBSCAN is its inability to capture clusters of varying density. Furthermore, cluster density must be manually tuned with the $\epsilon$ bandwidth parameter. Small changes in $\epsilon$ may devastate the quality of the resulting clusters.
6779

68-
A popular solution to both problems is OPTICS: Ordering Points To Identify the Clustering Structure. The basic setup is similar to DBSCAN, but defines two additional metrics: core distance and reachability distance. Suppose $p_{(m)}$ is the $m$th closest point to $p$. Then with $\epsilon$ bandwidth and $m$ minimum points to qualify a cluster, we define:\cite{wiki:optics}, \cite{medium:optics}
80+
A popular solution to both problems is OPTICS: Ordering Points To Identify the Clustering Structure. The basic setup is similar to DBSCAN, but defines two additional metrics: core distance and reachability distance. Suppose $p_{(m)}$ is the $m$th closest point to $p$. Then with $\epsilon$ bandwidth and $m$ minimum points to qualify a cluster, considering the $\epsilon$ neighborhood $N_\epsilon(P)$ of a point $p$, we define the values:
6981

70-
\begin{align}
82+
\begin{align*}
7183
\text{core-dist}_{\epsilon, m}(p) &=
7284
\begin{cases}
73-
d(p, p_{(m)}) & |N_\epsilon(p)| \geq m \\
85+
d(p, p_{(m)}) & \text{if}\; |N_\epsilon(p)| \geq m \\
7486
\text{undefined} & \text{otherwise} \\
7587
\end{cases}\\
7688
\text{reachability-dist}_{\epsilon, m}(p, o) &=
7789
\begin{cases}
78-
\max \lbrace d(p, o), \text{core-dist}_{\epsilon, m} (p) \rbrace & |N_\epsilon(p)| \geq m \\
90+
\max \lbrace d(p, o), \text{core-dist}_{\epsilon, m} (p) \rbrace & \text{if}\; |N_\epsilon(p)| \geq m \\
7991
\text{undefined} & \text{otherwise} \\
8092
\end{cases}\\
81-
\end{align}
93+
\end{align*} \cite{wiki:optics}, \cite{medium:optics}
8294

8395
First, observe that both metrics apply only to core points. For border and noise points, they are undefined. Indeed, under most OPTICS implementations, border and noise points do not earn membership in any cluster, and are ignored.
8496

8597
\begin{figure}[h]
8698
\centering
8799
\includegraphics[scale=0.42]{chapter_2/files/optics_distances.png}
88-
\caption{An example of core and reachability distances, from a core point $p$ and $m=4$. Source: \cite{medium:optics}}
89-
\label{fig:dbscan_plus}
100+
\caption{An example of core and reachability distances, from a core point $p$ and $m=4$. $\epsilon$ remains the bandwidth from the original DBSCAN algorithm. $\epsilon'$ is the core distance. Source: \cite{medium:optics}}
101+
\label{fig:optics}
90102
\end{figure}
91103

92104
Core distance is a measure of a core point's local density. If $\text{core-dist}_{\epsilon, m}(p) = \epsilon'$, then $\epsilon' << \epsilon$ implies that the core point must be in a region of high density.
93105

94-
Reachability distance is a measure of the distance between a core point $p$ and a nearby point $o$. In most implementations, $o$ is drawn from $N_\epsilon(p)$, in which case reachability distancec will always be less than $\epsilon$. If $o$ is especially close to $p$, then reachability distance is the same as core distance. Otherwise, it is the actual distance between the two points.
106+
Reachability distance is a measure of the distance between a core point $p$ and a nearby point $o$. In most implementations, $o$ is drawn from $N_\epsilon(p)$, in which case reachability distance will always be less than $\epsilon$. If $o$ is especially close to $p$ (i.e. within $\epsilon'$ distance of $p$), then reachability distance is the same as core distance. Otherwise, it is the actual distance between the two points.
95107

96-
The goal of OPTICS is to assign a reachability distance to all core points between themselves and the closest, untraversed neighbor. In this way, the algorithm proceeds along a greedy shortest path traversal of the dataset, generating a reachability plot that encodes the clusters.
108+
The goal of OPTICS is to assign a reachability distance to all core points between themselves and the closest neighbor. In this way, the algorithm proceeds along a greedy shortest path traversal of the dataset, generating a reachability plot that encodes the clusters.
97109

98110
The algorithm proceeds as follows:
99111

100112
\begin{algorithm}[H]
101113
\caption{OPTICS Algorithm}
102-
\label{dbscan++ alg}
114+
\label{optics alg}
103115
\begin{algorithmic}[1]
104116
\renewcommand\algorithmicrequire{\textbf{input}}
105117
\REQUIRE $X = {x_1\cdots x_n}$, $\epsilon \in \bbR^+$, $m \in \bbN^*$
@@ -115,6 +127,8 @@ \section{Variations on a Theme of DBSCAN}
115127
\RETURN $P$
116128
\end{algorithmic}
117129
\end{algorithm}
130+
131+
$P$ encodes the set of reachability distances. To produce it, we traverse the core points using lowercase point $p$, gradually performing a shortest-path traversal until all core points have been touched. The output is the set of reachability distances (along with their corresponding points) in $P$
118132

119133
One important quality to observe is that the bandwidth parameter $\epsilon$ has less direct impact on clustering than in traditional DBSCAN. Indeed, an $\epsilon$ large enough to cover every data point may still induce high-quality clusters, though of course runtime will suffer (worse case is $O(n^2)$). In this way, $\epsilon$ functions more like a check on the minimum density for which a cluster is interesting.
120134

@@ -125,8 +139,8 @@ \section{Variations on a Theme of DBSCAN}
125139
\begin{figure}[h]
126140
\centering
127141
\includegraphics[scale=0.36]{chapter_2/files/optics_run.png}
128-
\caption{An example run of OPTICS. The upper left is a plot of the dataset. The bottom shows the reachability plot. The upper right shows the resulting clusters. Overall, note that the varying densities of the clusters would pose a significant problem to DBSCAN. Source: \cite{wiki:optics}}
142+
\caption{An example run of OPTICS. The upper left is a plot of the dataset. The bottom shows the reachability plot. The upper right shows the resulting clusters. A separate algorithm is needed to perform postprocessing to extract the clusters. Overall, note that the varying densities of the clusters would pose a significant problem to DBSCAN. Source: \cite{wiki:optics}}
129143
\label{fig:optics_plot}
130144
\end{figure}
131145

132-
Of particular importance is the reachability plot at the bottom. The horizontal axis corresponds to the order in which the dataset was traversed, and the vertical axis the calculated reachability score. Because of the greedy shortest path traversal, points of the same cluster are also adjacent on the reachability plot. Regions of short reachability distance also appear as valleys. Denser clusters will have deeper values than sparser clusters, but the valley formation nonetheless remains intact. A separate algorithm (or manual identification) may then be used to extract the valley points as clusters, completing the process.
146+
Of particular importance is the reachability plot at the bottom of the figure. The horizontal axis corresponds to the order in which the dataset was traversed, and the vertical axis the calculated reachability score. Because of the greedy shortest path traversal, points of the same cluster are also adjacent on the reachability plot. Regions of short reachability distance also appear as valleys. Denser clusters will have deeper values than sparser clusters, but the valley formation nonetheless remains intact. A separate algorithm (or manual identification) may then be used to extract the valley points as clusters, completing the process. Also note, the version of the algorithm presented has no way of addressing potential noise points in the data, but further extensions have been proposed that do provide meaningful ways of doing so (for example, see \href{https://link.springer.com/content/pdf/10.1007\%2F978-3-540-48247-5_28.pdf}{OPTICS-OF}).

0 commit comments

Comments
 (0)