Font Size: a A A

Research Of Clustering Methods On High Dimensional Data

Posted on:2015-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z RenFull Text:PDF
GTID:1268330422981519Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of information technology and sharp increaseof the size of data, how to effectively analyze these massive amount of data and extract use-ful information has become one of the research hotspots and difficulties. Cluster analysis, anunsupervised machine learning method, attracts more and more attentions and has developedrapidly. It has been used in bioinformatics, Internet technology and image analysis, et al. Mostclustering algorithms can achieve good performance on low-dimensional data, but will occur”curse of dimensionality” when dealing with high-dimensional data. With the increase of di-mensionality, the data becomes sparse, distances among objects tend to be the same and noisyand redundant features increase as well, thereby reducing the effectiveness of clustering algo-rithms. Therefore, research of clustering methods on high-dimensional data becomes a popularand difficult reasearch area. Meanwhile, many clustering algorithms have a lot of constraints onthe data, such as restrictions on the number of clusters and on their shapes, which probably arenot satisfied in real-world tasks. So how to design effective nonparameter clustering techniquesis very important as well.To address the problem associated with clustering on high-dimensional data, we make useof ensemble technique and Boosting, et al., and propose several novel clustering methods. Themain contributions of the thesis are summaried as follows:(1) Ensemble clustering aims to generate a stable and robust clustering through the con-solidation of multiple base clusterings. In recent years many ensemble clustering methods havebeen proposed, most of which treat each clustering and each object as equally important. Someapproaches make use of weights associated with clusters, or with clusterings, when assemblingthe different base clusterings. However, not much effort has been put towards incorporatingweighted objects into the consensus process. To fill this gap, in this paper we propose an ap-proach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficultit is to cluster an object by constructing the co-association matrix that summarizes the baseclustering results, and we then embed the corresponding information as weights associated toobjects. We propose three different consensus techniques to leverage the weighted objects. Allthree reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-artconsensus clustering methods and is robust to parameter settings.(2) The mean shift algorithm is a nonparametric clustering technique that does not makeassumptions on the number of clusters and on their shapes. It achieves this goal by performingkernel density estimation, and iteratively locating the local maxima of the kernel mixture. Thesetofpointsthatconvergetothesamemodedefinesacluster. Whileappealing, theperformanceof the mean shift algorithm significantly deteriorates with high dimensional data due to thesparsity of the input space. Noisy features can also disguise the mean shift procedure.Inthispaperweextendthemeanshiftalgorithmtoovercometheselimitations, whilemain-taining its desirable properties. To achieve this goal, we first estimate the relevant subspace foreachdatapoint, andthenembedsuchinformationwithinthemeanshiftalgorithm, thusavoidingcomputing distances in the full dimensional input space. The resulting approach achieves thebest-of-two-worlds: effective management of high dimensional data and noisy features, whilepreserving a nonparametric nature. Our approach can also be combined with random samplingto speedup the clustering process with large scale data, without sacrificing accuracy. Extensiveexperimental results on both synthetic and real-world data demonstrate the effectiveness of theproposed method.(3)Mean shift is a nonparametric clustering technique that does not require the number ofclusters in input and can find clusters of arbitrary shapes. While appealing, the performance ofthe mean shift algorithm is sensitive to the selection of the bandwidth, and can fail to capture thecorrect clustering structure when multiple modes exist in one cluster. DBSCAN is an efficientdensity based clustering algorithm, but it is also sensitive to its parameters and typically mergesoverlapping clusters. In this paper we propose Boosted Mean Shift Clustering (BMSC) to ad-dress these issues. BMSC partitions the data across a grid and applies mean shift locally on thecells of the grid, each providing a number of intermediate modes (iModes). A mode-boostingtechnique is proposed to select points in denser regions iteratively, and DBSCAN is utilized topartition the obtained iModes iteratively. Complexity analysis shows its potential to deal withlarge-scale data and extensive experimental results on both synthetic and real benchmark datademonstrate its effectiveness and robustness to parameter settings.
Keywords/Search Tags:Clustering, High Dimensional Data, Ensemble Clustering, Mean Shift
PDF Full Text Request
Related items