Font Size: a A A

Parallel Text Clustering Based On MapReduce

Posted on:2015-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:X S YuFull Text:PDF
GTID:2298330422989795Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text clustering is a research of important application, in some application area,the size of data processed by text clustering is growing at a very fast speed.Large-scale data need an efficient large-scale data analysis technology. The traditionalsequential programming model is a serious lack of scalability, therefore, it can’tsatisfy the large-scale data processing demand for computing resources and storageresources. Distributed computing technology represented by MapReduce can greatlyimprove the efficiency of data-intensive algorithms and make full use of the ability ofcomputing clusters that based on the commodity hardware because of its goodscalability.The emergence of MapReduce distributed computing frameworks greatly reducesthe threshold of parallel computing, and MapReduce programming model with itsexcellent architecture design has became the best choice of large-scare dataprocessing technology. To solving the poor scalability of traditional hierarchicalclustering when dealing with large-scale corpuses, this paper proposes a parallel texthierarchical clustering algorithm based on MapReduce. As the traditional hierarchicalclustering algorithm is designed for sequential programming model, we need toconsider the differences between sequential programming model and distributedprogramming model and make full use of the characteristics of distributed parallelcomputing platform when parallelize the hierarchical clustering. The following is abrief summary of the main contents of this paper:1) Further study of MapReduce distributed computing framework, includingMapReduce data distribution strategy, sorting features, and the necessary conditionsof traditional sequential algorithm when porting to MapReduce programming model.Make a detailed analysis of the most important open source implementation ofMapReduce which is one of the key modules of Hadoop.2) Make deep analysis of the key technologies in MapReduce parallelizationwhen text clustering. Redesign the key steps of text vectorization including text segmentation, feature selection and feature weights. Lay a solid foundation ofMapReduce parallelization when the whole process of text clustering.3) For the text hierarchical clustering algorithm is difficult to achieve parallelcomputing, propose parallel hierarchical clustering algorithm based on data partition,and complete the parallelization of text hierarchical clustering algorithm. Parallel texthierarchical clustering introduce the data partition into the traditional hierarchicalclustering algorithm, make proper use of the sorting characteristics of MapReduceand secondary sorting technology to efficiently select the merge point. Datapartitioning algorithm adopt vertical partitioning algorithm based on text featurevector components group statistics, for its simple and efficient and effectivelydividing large-scale data.4) Program and complete these critical steps in the algorithms, and experiment ontwo different data sets. First, build a small Hadoop cluster, and then to verify theaccuracy and parallel performance of clustering algorithms through five majorexperiments, finally discuss the influence of some important parameters configurationand the input mode of the data set. Experimental results show that parallel textclustering algorithm based on MapReduce can effectively cluster the large-scale textwith good scalability.
Keywords/Search Tags:Text clustering, Hierarchical clustering, Data partitioning, MapReduce, Parallel computing
PDF Full Text Request
Related items