Font Size: a A A

Research On Classification Algorithm Used HADOOP

Posted on:2017-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q H DongFull Text:PDF
GTID:2308330482982432Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification research is one of the important research direction in the field of data mining. Bayesian classification method is based on the classical theory of probability and statistics of a classification method. Because the bayesian classification method in the face of huge amounts of data are substantially speed is slow, data processing, such as low efficiency and poor stability of classification results limitations. Cloud computing is a kind of parallel calculation based on the Internet, to deal with large amount of data, data type, data structure, more complex, fast data incremental data set has obvious advantages. HADOOP environment is developed by the Apache company a cloud computing platform, HADOOP provides a low cost of parallel data processing scheme. Therefore implement parallelization bayes classification method, solve the traditional bayesian classification method under the environment of big data to keep the accuracy and stability with theoretical significance and practical application background.Bayesian classification is based on statistical data classification method. Naive bayesian classification method is the most basic bayesian classification, the classification of the most commonly used method, the method requires the data between the attribute values are independent of each other, and the actual data is generally not independent or between attribute values in its independence is difficult to determine, and then influence the scope of the naive bayesian classification method and classification effect. Aiming at the existing problem of naive bayes classification method, the use of the method of statistic distribution is given a weighted bayesian classification method of distribution, chi-square independence assumption of naive bayesian method independence assumption is widely used to make hard to effectively improve. In classical data set on the results of classification accuracy are analyzed in comparison. To the effective improvement of the method is verified by experimental results.Traditional mail filtering system to the filtration efficiency is low, the massive E-mail data big shortcomings such as slow speed, computational cost resources. And according to the naive bayesian classification method is highly parallel, the characteristics of self-organization, large amount of calculation, proposed under the Hadoop cluster parallel bayesian classification algorithm based on cloud computing. Parallel to the traditional bayesian classification algorithm, using graphs Hadoop platform framework for classification algorithm is the advantage of parallel computing, improve traditional bayesian classification model, improve the filtering speed and efficiency.
Keywords/Search Tags:classification algorithml, naive bayesian, HADOOP, cloud computing, MapReduce
PDF Full Text Request
Related items