Research On Text Mining Based On MapReduce

Posted on:2016-11-08

Degree:Master

Type:Thesis

Country:China

Candidate:M He

Full Text:PDF

GTID:2308330473954510

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet and communication technology, data is characterized by massive, heterogeneous and diverse. At present of knowledge on the Internet, 80% of information exists as the form of text. An embarrassing situation which people often facing is â€œriching in data but lacking in knowledgeâ€. The challenges people facing today is no longer how to access the information while getting useful knowledge from complex and mass data quickly and immediately The emergence of text mining,which solve the problem of information clutter at a greater degree, can locate the information which is required by user conveniently and accurately. As a important foundation and hotspot in text mining and information retrieval, text classification techniques has got extensive attention and rapid progresses in fields of information retrieval, public opinion analysis, information filtering fields and news classification With the exponentially growing of text data, traditional serial algorithm is difficult to meet the computing space and capabilities that massive text data analysis required,which also led to text classification facing new problems and challenges. MapReduce emerged to address this issue with distributed and parallel processing methods, which has been widely recognized and studied both in the academic in industryFocusing on text classification and parallel processing, this paper studies the status of two aspects. Based on the parallel computing platform of Hadoop, implements a simple and effective algorithmâ€”a text classification method of average multinomial Naive Bayes. Experimented on different sizes, different language corpus, this paper Compared with the general Bayesian method on the difference of training time and performance of classification. Due to reducing the impact of redundancy features information and good scalability of parallel computing, the results indicate that it is more suitable for massive text data classification, Especially in the case that traditional serial algorithm can not handle; for the different language datasets which has similar size, It performs well on the English corpus than on the Chinese corpus, In experiments of classification performance, this method was superior to the general Bayesian method on the classification accuracy, showing good performance on speedup as well.

Keywords/Search Tags:

Data mining, Text classification, Naive Bayes, Parallel computing, Big data

PDF Full Text Request

Related items

1	Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics
2	Research And Application On Naive Bayes Classification Algorithm
3	Research And Implement On Data Mining Algorithm Parallel Based On Hadoop
4	Research And Application On The Technology Of Web Text Mining
5	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
6	Research On Text Classification Algorithm Based On Naive Bayes Method
7	Text Categorization Based On Naive Bayes Method
8	Research On Improved Naive Bayes Classification Model For Imbalanced E-commerce Review Text
9	Research And Improvement Of Attribute Weighted Naive Bayes Classification Algorithm
10	The Research And Implementation Of Parallel Algorithm For Bayesian Text Classification Based Spark Computing Environment