An implement of OPTICS(Ordering Points To Identify The Clustering Structure) algorithm in python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from optics import OPTICS
OPTICS
class need a pandas DataFrame
to construct, the by-passed DataFrame
's labels colum should have been trimmed off. The native distance calculating method was only designed for contineous attributes. If need to handle nominal or categorical attributes, re-write the distance calculating function.
df = pd.read_table("dataset.txt")
df1 = df.iloc[:,:-1]
model = OPTICS(df1)
Run the ordering method, this could take some minutes when deal with a large amount of data. 2 Arguments need to be passed in, the Eps
for max epsilon and MinPts
for min points number. Usually, Eps
could be set with inf
and MinPts
with the dimensions of data entries + 1.
Notice: the MinPts
does NOT contain the core point itself
model.optics(Eps=float('inf'), MinPts=4)
Now the ordering result should be written to model.result_queue
, e.g. model.result_queue[0]
is 17 means data entry df1.iloc[17,:]
is the first element of ordering result queue.
And in model.core_distances
and model.reachabel_distances
are the cd and rd of every point accroding to the order of dataframe (not result queue).
Accroding to spirit of OPTICS, now an ordered reachable distance graph should be painted to see which Eps
we should choose to get the best clustering result. This module dosen't provide visualization method, a 3rd-party plot lib is needed such as matplotlib
.
Once you decide which Eps
to use, you can extract the final clustering result into model.cluster_labels
, it's ordered by the dataframe instead of model.result_queue
.
model.cluster_extract(Eps=4)
- python after version 3.5
- numpy
- pandas
- matplotlib