Font Size: a A A

Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform

Posted on:2019-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:R Q ZhangFull Text:PDF
GTID:2428330545454465Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet,the number of network texts has exploded,and it is of great significance to quickly and efficiently find valuable information in large-scale data.Finding the effective knowledge of the text is an important branch of data mining.It uses appropriate text representation models to represent texts,and cluster semantically similar texts.In order to improve the problem that the single-machine serial programming model is not ideal for mass data clustering,thesis combines the big data technology and the text clustering technology to process the text data.On the Hadoop,distributed storage and calculation of text data are implemented.Implementing text vector parallelization and parallel clustering using clustering algorithm based on MapReduce programming model.The MapReduce framework idea can be briefly summarized as follows: Segmentation and reduction.Divide the text data s into data blocks,which store to HDFS(Distributed File System),each work node in the cluster processes the split data blocks in parallel,and the parallel processing results are combined store to HDFS.The traditional k-means clustering algorithm is a typical algorithm for solving clustering problems.Processing large data sets has better scalability and extensibility.However,the initial center of the algorithm is chosen randomly,and the result of each operation of the algorithm is unstable.To solve the above problem,an initial center optimization method based on density segmentation idea and sampling idea is proposed.The data set is sampled in parallel and the sample set is obtained.The maximum and minimum value of each sample in the sample set is searched to find the best candidate cluster center.Merging data objects in parallel.According to the density of the candidate centers,the noise is eliminated.The densest objects and the remaining objects within the specified range are merged into clusters,and the cluster center is used as the output result of the initial center optimization selection strategy.The selected initial cluster center replaces the center point randomly selected by the k-means algorithm,and the clustering algorithm is parallelized.Experiments show that the improved k-means algorithm can effectively reduce the number of iterations.Using cluster quality and efficiency as evaluation methods,the k-means parallel algorithm based on the initial centroid optimization and the clustering results of other clustering algorithms are compared and analyzed.The improved k-means parallel algorithm in thesis has the advantages of cluster quality,efficiency,and parallel performance.
Keywords/Search Tags:Text Clustering, MapReduce, Parallel Computing, K-means
PDF Full Text Request
Related items