Font Size: a A A

Research And Implement On Data Mining Algorithm Parallel Based On Hadoop

Posted on:2015-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z L WuFull Text:PDF
GTID:2298330467963265Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of Internet technology and cloud computing technology, Internet companies provided network services need to deal with the daily generation and exploding data. Massive data has gradually surrounded us. Growing data to bring a great value, but also to bring a huge challenge. How to analyze and mine the valuable information hidden behind data has become the focus of many large companies are concerned.Automated processing of large-scale document information resources is a massive data processing in the field of more concern, companies are classified by text data, not only can effectively organize digital resources, and to ensure that digital resources are fully searchable and fully utilized to meet user demand for consulting services. But while the text data and Internet companies have generated massive, complex features, and now face the rapid growth of text data, using traditional methods to deal with alone has gradually failed to meet people’s needs, how efficient sort of massive text and dig out valuable information, which is a concern of this article.Hadoop is the most popular open-source framework for distributed processing of massive data. The main components include Hadoop HDFS and MapReduce. HDFS is a distributed file system provides a Hadoop cluster, and MapReduce is a distributed framework, through a combination of both, massive text data can be effectively treated.This paper studies the procedures and principles of Hadoop distributed processing, and on its basis to achieve a text classification process in various parts of parallelization, and Bayesian algorithms to calculate the conditional probability and the prior probability of statistical methods and the adoption of parallel samples were grouped training methods implemented based on Naive Bayes and support vector machine classification algorithm for distributed two text classification system, by comparing the results with the stand-alone system, demonstrate the efficiency of the system during the Hadoop to text classification higher than the stand-alone, and achieved good classification results.
Keywords/Search Tags:Text Classification, Hadoop, Naive Bayes, SVM
PDF Full Text Request
Related items