Font Size: a A A

Research On Text Mining Based On MapReduce

Posted on:2016-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:M HeFull Text:PDF
GTID:2308330473954510Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and communication technology, data is characterized by massive, heterogeneous and diverse. At present of knowledge on the Internet, 80% of information exists as the form of text. An embarrassing situation which people often facing is “riching in data but lacking in knowledge”. The challenges people facing today is no longer how to access the information while getting useful knowledge from complex and mass data quickly and immediately The emergence of text mining,which solve the problem of information clutter at a greater degree, can locate the information which is required by user conveniently and accurately. As a important foundation and hotspot in text mining and information retrieval, text classification techniques has got extensive attention and rapid progresses in fields of information retrieval, public opinion analysis, information filtering fields and news classification With the exponentially growing of text data, traditional serial algorithm is difficult to meet the computing space and capabilities that massive text data analysis required,which also led to text classification facing new problems and challenges. MapReduce emerged to address this issue with distributed and parallel processing methods, which has been widely recognized and studied both in the academic in industryFocusing on text classification and parallel processing, this paper studies the status of two aspects. Based on the parallel computing platform of Hadoop, implements a simple and effective algorithm—a text classification method of average multinomial Naive Bayes. Experimented on different sizes, different language corpus, this paper Compared with the general Bayesian method on the difference of training time and performance of classification. Due to reducing the impact of redundancy features information and good scalability of parallel computing, the results indicate that it is more suitable for massive text data classification, Especially in the case that traditional serial algorithm can not handle; for the different language datasets which has similar size, It performs well on the English corpus than on the Chinese corpus, In experiments of classification performance, this method was superior to the general Bayesian method on the classification accuracy, showing good performance on speedup as well.
Keywords/Search Tags:Data mining, Text classification, Naive Bayes, Parallel computing, Big data
PDF Full Text Request
Related items