Abstract
Rapid growth of high dimensional datasets in recent years has created an emergent
need to extract the knowledge underlying them. Clustering is the process of automatically
finding groups of similar data points in the space of the dimensions or attributes
of a dataset. Finding clusters in the high dimensional datasets is an important and
challenging data mining problem. Data group together differently under different subsets
of dimensions, called subspaces. Quite often a dataset can be better understood by
clustering it in its subspaces, a process called subspace clustering. But the exponential
growth in the number of these subspaces with the dimensionality of data makes the
whole process of subspace clustering computationally very expensive. There is a growing
demand for efficient and scalable subspace clustering solutions in many Big data application
domains like biology, computer vision, astronomy and social networking. Apriori based
hierarchical clustering is a promising approach to find all possible higher dimensional
subspace clusters from the lower dimensional clusters using a bottom-up process. However,
the performance of the existing algorithms based on this approach deteriorates drastically
with the increase in the number of dimensions. Most of these algorithms require multiple
database scans and generate a large number of redundant subspace clusters, either
implicitly or explicitly, during the clustering process. In this paper, we present
SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with
minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the
dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm
and its evaluation in this paper.