Font Size: a A A

Study Of Parallelized Text Mining Algorithm Based On Cloud Computing Framework

Posted on:2016-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:J Y TengFull Text:PDF
GTID:2308330479985769Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of information society, network data grows rapidly. In fact, most of the data exist in the form of text. The case that how to mining useful information from massive text within the effective time has become a research hotspot. Therefore, more and more attention has been paid to the research of parallelized text mining. In recent years, there are many implementaions of parallelized text mining algorithm based on Map Reduce, which can handle large-scale text. But these methods still exist some issues such as poor parallelization efficiency and algorithm difficult to realize. This paper proposes a novel parallel algorithm for large scale text mining Based on Spark, which is the new generation of big data processing architecture. The main purpose of this paper is to improve the efficiency of text mining on the basis of effectiveness.Text clustering and classification is the foundation and core of text mining technology. According to the traditional text mining algorithms to deal with massive text slowly or even impossible to deal with, the work finished by this paper is summarized as follows:(1) The large-scale data parallel processing technology have been studied, including analysis of the traditional parallel framework Hadoop about Distributed File System and Map Reduce model, study on key technologies of new generation of parallel computing system Spark: resilient distributed datasets and programming model.(2) Related technologies of text clustering and classification have been studied and most of them have been introduced, in which text preprocessing is described in detail.(3) K-Means Clustering and Naive Bayes text categorization parallel processing based on Spark programming framework have been designed and realized. Meanwhile some work is done to optimize the system and make a contrast with the implementation based on Hadoop.Finally, the experimental results based on the cluster show that the parallelized text mining algorithm in this article not only improves the efficiency of large-scale text mining, but also ensures the validity and accuracy, as well as the advantages of high reliability and easy scalability. In addition, compared with the experiments based on Hadoop, the algorithms based on Spark have outstanding performance on the main performance indicators(speedup, scalability, run-time, etc), which prove the validity of this paper’s work.
Keywords/Search Tags:Text Mining, Parallelization, K-Means, Naive Bayes, Spark
PDF Full Text Request
Related items