Research On Parallelization Of Text Clustering Based On Hadoop

Posted on:2017-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:F M Cui

Full Text:PDF

GTID:2308330503485261

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Text is one of the most important information carrier on the Internet. For the scale of text becomes increasingly larger as the network develops rapidly, it has important practical significance to obtain valuable information from the mass texts quickly and efficiently. As an important text mining technology, text clustering can discover deep knowledge hidden in texts automatically, which provides an effective method to obtain text information. However, traditional text clustering based on serial mode is unable to meet the processing requirement of large-scale text either in efficiency or in scalability, while the development of cloud computing technology provides an effective solution for it.Hadoop, a distributed cloud computing platform most widely used now, can process large-scale data sets efficiently, reliably and scalably in distributed way. It utilizes the HDFS to store data, and the MapReduce to process data in parallel. Hadoop allows users to build up a cost-effective computing cluster easily. Simultaneously, the parallel program above can be designed more simply and performed better in scalability than traditional ones.In order to improve the processing capabilities of text clustering for large-scale text data, this paper combines text clustering with Hadoop to implement a distributed parallel text clustering. By analyzing technologies about text clustering and Hadoop,according to the characteristics and the process of text clustering, parallelization of text clustering is implemented from text preprocessing and clustering algorithm. Firstly, techniques which can be applied in text preprocessing are analyzed and compared to determine its method and processes; the parallelization of text preprocessing are designed and implemented by utilizing the processing model of MapReduce. Secondly, K-means, Canopy-K-means and MMK-means are treated as the text clustering algorithm respectively, and parallelization of these algorithm is implemented on Hadoop. Finally, Hadoop cluster environment is constructed, and two experiment are conducted: one is to apply test analysis on running efficiency of parallel text clustering, the other is to compare the performance of three parallel clustering algorithms. Experimental results show that not only is the parallel text clustering based on Hadoop proved to be highly effective and scalable, but also parallel MMK-means has higher efficiency and better clustering quality compared to the other two algorithms.

Keywords/Search Tags:

text cluster, Hadoop, distributed computing, MMK-means

PDF Full Text Request

Related items

1	Research On The Hadoop-based Distributed Full-text Retrieval And Related Technologies
2	The Research And Application Of Distributed System Based On Hadoop
3	A Research And Implementation With Improved K-Means Clustering Algorithm To Image Retrieval System Based On Hadoop Platform
4	Research And Application Of Text Mining Based On Hadoop
5	Design And Implementation Of Distributed Text Clustering System Based On K-means
6	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
7	Distributed SVM Algorithm With K-means
8	The Research And Design Of Distributed Data Mining System Based On Hadoop
9	The Research And Development Of Distributed Web Text Retrieval System Based On Hadoop
10	Design And Implementation Of Distributed Clustering Framework Based On Model Fusion