Font Size: a A A

Spectral Clustering In Distributed Systems

Posted on:2014-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:C MengFull Text:PDF
GTID:2248330398972135Subject:Detection Technology and Automation
Abstract/Summary:PDF Full Text Request
Cluster analysis is a basic understanding activity in human life and is a classic problem in machine learning. The clustering algorithms gather different things into a class according to certain attribute of things, so that the similarity between the classes as small as possible, within the class as great as possible. K-means clustering algorithm is a classical and popular algorithm as a center-based clustering algorithm. When the data set distributed for the convex spherical, K-means algorithm has good performance and be able to get a better clustering results. However, when the sample space is convex, K-means algorithm often fails, and the algorithm using iterative optimization method for solving the optimal solution, so the algorithm will fall into the local optimal solution.Recent years, An unsupervised clustering algorithm that spectral clustering algorithm emerge for clustering in the sample space of any shape, overcome the K-means algorithm emerging into a local optimum the shortcomings of the solution, and can converge to the global optimal solution. The algorithm has the ability to identify the sample space of arbitrary shape, and will not fall into the local optimal solution. The application in practical problems is well for the algorithm. However, spectral clustring suffers from a problem in both memory use and computational time when the size of data set is large. To perform clustering on large data sets, We can transplanted the algorithm in a distributed environment, at the same time we use different sparse matrix methods to reduce the use of memory space.This article focuses on how to achieve efficient spectral clustering algorithm based on a distributed environment. Specific content including:1. Thinning similarity matrix using two different methods and compare two different sparse methods. The two methods are t nearest neighbor method and Nystrom method, in order to select a more excellent method; the two methods were compared from different angles. Finally, we found that the use of the t nearest neighbor method can get better clustering results verified by experiments.2. By the t nearest neighbor to the similarity matrix sparse, we can use the MPI model and Google’s MapReduce model to build a distributed environment. After that we transplant spectral clustering algorithm to distributed platforms. The results show that the algorithm is able to fully utilize the resources of each node to improve the operating efficiency of the algorithm and has good scalability. It provides a good solution for the application of massive data.
Keywords/Search Tags:spectral clustering, distributed computing, Nystromalgorithm
PDF Full Text Request
Related items