Font Size: a A A

Research Of Parallel Text Spectral Clustering Algorithm Based On Spark

Posted on:2017-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:H WuFull Text:PDF
GTID:2348330503989901Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the data on the Internet is exploding, most of which are in the form of text information. With the large-scale text data,serial text clustering algorithm is a bottleneck in the storage and computing speed.Spectral clustering algorithm based on graph theory overcomes some shortcomings of traditional clustering algorithms, and guarantees to converge to the optimal solution. In this paper, based on the spectral clustering algorithm and the distributed computing framework Spark, the clustering algorithm is used to cluster the large scale text data set.Apache Spark is a general purpose parallel computing framework, which greatly improves the speed of large data processing through the memory computing power. The main work of this paper is to realize the parallelization of spectral clustering algorithm in text clustering, which based on Spark. The use of Spark computing platform scalability and based on the characteristics of memory computing, the spectral clustering algorithm is applied to the text clustering.Combined with the Spark distributed computing framework,the spectral clustering algorithm can adapt to the expansion of the data scale and improve the performance of text clustering.Compared with the traditional clustering methods such as k-means algorithm. The experimental results show that the spectral clustering algorithm in precision, recall and F-score is better than other clustering algorithms.Combined with Spark programming model,studied and designed the parallel between the text vector similarity matrix calculation, the Laplacian matrix minimum of K features values corresponding eigenvector decomposition and reduction of the dimensions of characteristic matrix of K-means clustering.The time complexity of each step is analyzed, and the speedup of the running time of each step in different scale cluster is analyzed.The experimental results show that the spectral clustering algorithm based on Spark has a good performance and clustering effect in text clustering.
Keywords/Search Tags:Spark, Text Clustering, Spectral Clustering, Parallelization
PDF Full Text Request
Related items