Research Of Clustering Methods On High Dimensional Data

Posted on:2015-01-05

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Z Ren

Full Text:PDF

GTID:1268330422981519

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the rapid development of information technology and sharp increaseof the size of data, how to effectively analyze these massive amount of data and extract use-ful information has become one of the research hotspots and difficulties. Cluster analysis, anunsupervised machine learning method, attracts more and more attentions and has developedrapidly. It has been used in bioinformatics, Internet technology and image analysis, et al. Mostclustering algorithms can achieve good performance on low-dimensional data, but will occur”curse of dimensionality” when dealing with high-dimensional data. With the increase of di-mensionality, the data becomes sparse, distances among objects tend to be the same and noisyand redundant features increase as well, thereby reducing the effectiveness of clustering algo-rithms. Therefore, research of clustering methods on high-dimensional data becomes a popularand difficult reasearch area. Meanwhile, many clustering algorithms have a lot of constraints onthe data, such as restrictions on the number of clusters and on their shapes, which probably arenot satisfied in real-world tasks. So how to design effective nonparameter clustering techniquesis very important as well.To address the problem associated with clustering on high-dimensional data, we make useof ensemble technique and Boosting, et al., and propose several novel clustering methods. Themain contributions of the thesis are summaried as follows:(1) Ensemble clustering aims to generate a stable and robust clustering through the con-solidation of multiple base clusterings. In recent years many ensemble clustering methods havebeen proposed, most of which treat each clustering and each object as equally important. Someapproaches make use of weights associated with clusters, or with clusterings, when assemblingthe different base clusterings. However, not much effort has been put towards incorporatingweighted objects into the consensus process. To fill this gap, in this paper we propose an ap-proach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficultit is to cluster an object by constructing the co-association matrix that summarizes the baseclustering results, and we then embed the corresponding information as weights associated toobjects. We propose three different consensus techniques to leverage the weighted objects. Allthree reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-artconsensus clustering methods and is robust to parameter settings.(2) The mean shift algorithm is a nonparametric clustering technique that does not makeassumptions on the number of clusters and on their shapes. It achieves this goal by performingkernel density estimation, and iteratively locating the local maxima of the kernel mixture. Thesetofpointsthatconvergetothesamemodedefinesacluster. Whileappealing, theperformanceof the mean shift algorithm significantly deteriorates with high dimensional data due to thesparsity of the input space. Noisy features can also disguise the mean shift procedure.Inthispaperweextendthemeanshiftalgorithmtoovercometheselimitations, whilemain-taining its desirable properties. To achieve this goal, we first estimate the relevant subspace foreachdatapoint, andthenembedsuchinformationwithinthemeanshiftalgorithm, thusavoidingcomputing distances in the full dimensional input space. The resulting approach achieves thebest-of-two-worlds: effective management of high dimensional data and noisy features, whilepreserving a nonparametric nature. Our approach can also be combined with random samplingto speedup the clustering process with large scale data, without sacrificing accuracy. Extensiveexperimental results on both synthetic and real-world data demonstrate the effectiveness of theproposed method.(3)Mean shift is a nonparametric clustering technique that does not require the number ofclusters in input and can find clusters of arbitrary shapes. While appealing, the performance ofthe mean shift algorithm is sensitive to the selection of the bandwidth, and can fail to capture thecorrect clustering structure when multiple modes exist in one cluster. DBSCAN is an efficientdensity based clustering algorithm, but it is also sensitive to its parameters and typically mergesoverlapping clusters. In this paper we propose Boosted Mean Shift Clustering (BMSC) to ad-dress these issues. BMSC partitions the data across a grid and applies mean shift locally on thecells of the grid, each providing a number of intermediate modes (iModes). A mode-boostingtechnique is proposed to select points in denser regions iteratively, and DBSCAN is utilized topartition the obtained iModes iteratively. Complexity analysis shows its potential to deal withlarge-scale data and extensive experimental results on both synthetic and real benchmark datademonstrate its effectiveness and robustness to parameter settings.

Keywords/Search Tags:

Clustering, High Dimensional Data, Ensemble Clustering, Mean Shift

PDF Full Text Request

Related items

1	Research On Clustering Algorithms For High-Dimensional Data
2	Study On Ensemble Clustering For High-dimensional Data
3	Research On Personal Recommendation Algorithm Based On Clustering Ensemble Of High-dimensional Data
4	Research On Clustering Ensemble Methods And Their Applications
5	Research On Key Technologies Of Clustering Ensemble
6	The Researches On Related To Key Technologies Among Clustering Based On High-dimensional Data Space
7	Research On The Effectiveness Element Theory And Method Of Clustering Ensemble
8	Adaptive Semi-supervised Clustering Ensemble For High Dimensional Data
9	Research And Application Of Rough Clustering Algorithm For High Dimensional Data Sets
10	Research On Key Technologies Of Clustering High-dimensional Data Based On Sparse Subspace And Their Applications