Research On Classification Algorithm Used HADOOP

Posted on:2017-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:Q H Dong

Full Text:PDF

GTID:2308330482982432

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Classification research is one of the important research direction in the field of data mining. Bayesian classification method is based on the classical theory of probability and statistics of a classification method. Because the bayesian classification method in the face of huge amounts of data are substantially speed is slow, data processing, such as low efficiency and poor stability of classification results limitations. Cloud computing is a kind of parallel calculation based on the Internet, to deal with large amount of data, data type, data structure, more complex, fast data incremental data set has obvious advantages. HADOOP environment is developed by the Apache company a cloud computing platform, HADOOP provides a low cost of parallel data processing scheme. Therefore implement parallelization bayes classification method, solve the traditional bayesian classification method under the environment of big data to keep the accuracy and stability with theoretical significance and practical application background.Bayesian classification is based on statistical data classification method. Naive bayesian classification method is the most basic bayesian classification, the classification of the most commonly used method, the method requires the data between the attribute values are independent of each other, and the actual data is generally not independent or between attribute values in its independence is difficult to determine, and then influence the scope of the naive bayesian classification method and classification effect. Aiming at the existing problem of naive bayes classification method, the use of the method of statistic distribution is given a weighted bayesian classification method of distribution, chi-square independence assumption of naive bayesian method independence assumption is widely used to make hard to effectively improve. In classical data set on the results of classification accuracy are analyzed in comparison. To the effective improvement of the method is verified by experimental results.Traditional mail filtering system to the filtration efficiency is low, the massive E-mail data big shortcomings such as slow speed, computational cost resources. And according to the naive bayesian classification method is highly parallel, the characteristics of self-organization, large amount of calculation, proposed under the Hadoop cluster parallel bayesian classification algorithm based on cloud computing. Parallel to the traditional bayesian classification algorithm, using graphs Hadoop platform framework for classification algorithm is the advantage of parallel computing, improve traditional bayesian classification model, improve the filtering speed and efficiency.

Keywords/Search Tags:

classification algorithml, naive bayesian, HADOOP, cloud computing, MapReduce

PDF Full Text Request

Related items

1	Research And Application On Naive Bayes Classification Algorithm
2	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop
3	Research On Decision Tree Classification Algorithm Based On Hadoop
4	Cloud Computing Research And Application Of Filtering Spam Messages Based On Bayesian Classification
5	Researches About Cloud Computing And Expolit And Test Hadoop Program
6	The Research And Application Of Text Classification Based On Cloud Computing
7	The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data
8	Research Of Data Mining Classification Algorithm Based On Cloud Computing And The Solar Wind Data
9	Research On Cloud Computing Technologies For Mass Spam Mail Filtering
10	Design And Implementation Of Data Mining Classification System Based On Hadoop