Font Size: a A A

Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce

Posted on:2015-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2268330422967672Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing exponentially of the scale of data on the Internet, spectralclustering suffers from a new challenge in both computational time and memory usefor large-scale high-dimensional data. Based on the challenge, Hadoop MapReduceparallel approximate spectral clustering algorithm starts to embrace. First of andforemost, the thesis focuses on the process of constructing sparse similarity matrix inapproximate spectral clustering algorithm. Investigate in-depthly the t nearestneighbors sparsifying similarity matrix method and Nystr m low-rank submatrixsampling similarity matrix method insight into approximate similarity matrix and,Focusing on the t nearest sparse neighbor similarity matrix due to using subjectiveparameter t which is set so oversize that it expands the scope of neighbors, resultingin the inaccuracy impact of quality of approximate spectral clustering algorithmresulting from outliers in sparse matrix similar. Therefore, render a novel outliersoptimization solution based on the t nearest neighbors similarity matrix. By thetheoretical proofing the t nearest neighbors similarity matrix which contains theoutliers exist an optimization solution, and the optimization solution was applied toapproximate spectral clustering algorithm and, proposed optimized approximatespectral clustering algorithm in order to improve the quality of approximate spectralclustering for large-scale high-dimensional data. Additionally, for the sake ofavoiding the empty clusters’ generations and obtaining the non-optimal clusteringsolutions for clustering large-scale high-dimensional data, the thesis impose on nearest neighbors rough set model to choose k-means initialization clustering centerposition in the designed approximate spectral clustering algorithm. Through auxiliaryand comparative experiments between the above-mentioned approximate spectralclustering algorithm and the state-of-the-art based on orthogonalization Nystr mlow-rank submatrix sampling approximate similarity matrix spectral clusteringalgorithm along with t nearest neighbors sparse approximate similarity matrix spectralclustering algorithm, experimental results show that although the approximate similarity matrix optimization time remarkably high, but its clustering accuracyscores outperforms the latter’s.Under the Hadoop Distributed File System MapReduce parallel computingprogramming model, this thesis is of paramount importance to design and implementapproximate spectral clustering algorithm for clustering large-scale high-dimensionaldata. By exhausting the Mapper and Reducer parallel programming processes andparallel algorithms interdependent steps decomposition in Hadoop MapReduce, thethesis abundantly investigate and detailedly design the parallel strategy as to based onMapReduce optimization outliers of t nearest neighbors approximate similarity matrix,Laplacian eigenvalue decomposition and based on nearest neighbors’ choice ofk-means initialization clustering center position respectively and, as well as mapfunction and reduce function, of which the designed parallel strategies and dependentsteps decomposition perhaps regard as reference ideas for analyzing large-scalehigh-dimensional data in fields of machine learning, data mining, pattern recognition,information retrieval, web log data analysis, computer vision, medical imaging, signaland image processing and bioinformatics, etc. In addition, analyze their timecomplexity before and after Hadoop MapReduce parallel. With the Hadoop clustercomposed of12Dell server (Dell2161DS), using Bag of Words datasets (UCI) tovalidate the performance and clustering quality of the designed MapReduceapproximate spectral clustering algorithm. Experimental results show that thedesigned parallel approximate spectral clustering algorithm achieving some of theexpected assumptions, the representative spectral clustering evaluation criteria used inparallel experiment also further confirms the correctness and effectiveness of thedesigned parallel approximate spectral clustering algorithm in dealing withlarge-scale high-dimensional data.
Keywords/Search Tags:Hadoop distributed system, MapReduce parallel computing, Approximate spectral clustering algorithm, Approximate similarity matrix, Eigenvalue decomposition, k-means initialization methods, Large-scale highdimensional datasets
PDF Full Text Request
Related items