Font Size: a A A

Research On Parallelization Of Text Clustering Based On Hadoop

Posted on:2017-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:F M CuiFull Text:PDF
GTID:2308330503485261Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Text is one of the most important information carrier on the Internet. For the scale of text becomes increasingly larger as the network develops rapidly, it has important practical significance to obtain valuable information from the mass texts quickly and efficiently. As an important text mining technology, text clustering can discover deep knowledge hidden in texts automatically, which provides an effective method to obtain text information. However, traditional text clustering based on serial mode is unable to meet the processing requirement of large-scale text either in efficiency or in scalability, while the development of cloud computing technology provides an effective solution for it.Hadoop, a distributed cloud computing platform most widely used now, can process large-scale data sets efficiently, reliably and scalably in distributed way. It utilizes the HDFS to store data, and the MapReduce to process data in parallel. Hadoop allows users to build up a cost-effective computing cluster easily. Simultaneously, the parallel program above can be designed more simply and performed better in scalability than traditional ones.In order to improve the processing capabilities of text clustering for large-scale text data, this paper combines text clustering with Hadoop to implement a distributed parallel text clustering. By analyzing technologies about text clustering and Hadoop,according to the characteristics and the process of text clustering, parallelization of text clustering is implemented from text preprocessing and clustering algorithm. Firstly, techniques which can be applied in text preprocessing are analyzed and compared to determine its method and processes; the parallelization of text preprocessing are designed and implemented by utilizing the processing model of MapReduce. Secondly, K-means, Canopy-K-means and MMK-means are treated as the text clustering algorithm respectively, and parallelization of these algorithm is implemented on Hadoop. Finally, Hadoop cluster environment is constructed, and two experiment are conducted: one is to apply test analysis on running efficiency of parallel text clustering, the other is to compare the performance of three parallel clustering algorithms. Experimental results show that not only is the parallel text clustering based on Hadoop proved to be highly effective and scalable, but also parallel MMK-means has higher efficiency and better clustering quality compared to the other two algorithms.
Keywords/Search Tags:text cluster, Hadoop, distributed computing, MMK-means
PDF Full Text Request
Related items