Study Of Parallelized Text Mining Algorithm Based On Cloud Computing Framework

Posted on:2016-04-03

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Teng

Full Text:PDF

GTID:2308330479985769

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the development of information society, network data grows rapidly. In fact, most of the data exist in the form of text. The case that how to mining useful information from massive text within the effective time has become a research hotspot. Therefore, more and more attention has been paid to the research of parallelized text mining. In recent years, there are many implementaions of parallelized text mining algorithm based on Map Reduce, which can handle large-scale text. But these methods still exist some issues such as poor parallelization efficiency and algorithm difficult to realize. This paper proposes a novel parallel algorithm for large scale text mining Based on Spark, which is the new generation of big data processing architecture. The main purpose of this paper is to improve the efficiency of text mining on the basis of effectiveness.Text clustering and classification is the foundation and core of text mining technology. According to the traditional text mining algorithms to deal with massive text slowly or even impossible to deal with, the work finished by this paper is summarized as follows:(1) The large-scale data parallel processing technology have been studied, including analysis of the traditional parallel framework Hadoop about Distributed File System and Map Reduce model, study on key technologies of new generation of parallel computing system Spark: resilient distributed datasets and programming model.(2) Related technologies of text clustering and classification have been studied and most of them have been introduced, in which text preprocessing is described in detail.(3) K-Means Clustering and Naive Bayes text categorization parallel processing based on Spark programming framework have been designed and realized. Meanwhile some work is done to optimize the system and make a contrast with the implementation based on Hadoop.Finally, the experimental results based on the cluster show that the parallelized text mining algorithm in this article not only improves the efficiency of large-scale text mining, but also ensures the validity and accuracy, as well as the advantages of high reliability and easy scalability. In addition, compared with the experiments based on Hadoop, the algorithms based on Spark have outstanding performance on the main performance indicators(speedup, scalability, run-time, etc), which prove the validity of this paper’s work.

Keywords/Search Tags:

Text Mining, Parallelization, K-Means, Naive Bayes, Spark

PDF Full Text Request

Related items

1	Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics
2	Text Categorization Based On Naive Bayes Method
3	Research On Web Text Classification Algorithm Based On Parallelism
4	Research On The Text Categorization Based On Spark
5	The Parallelization And Optimization Of K-means Algorithm Based On Spark
6	Research Of Sentiment Analysis In Text Based On Spark
7	A Text Classifier About High Blood Pressure Based On Naive Bayes
8	Analysis Of Laptop Network Scoring Based On Text Mining
9	Research On Text Sentiment Analysis Via Spark And Machine Learning
10	Research On Text Classification Algorithm Based On Naive Bayes Method