Font Size: a A A

A Parallel Decision Tree Using Sampling Splits With Estimation

Posted on:2015-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y YangFull Text:PDF
GTID:2298330452959586Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Decision tree is an important branch of inductive learning, in essence, it is a set ofclassification rules generalized from the training data. Decision tree has become one of themost popular classification models because of its high efciency and accuracy of prediction,good readability and its robustness with respect to noise.In recent years, along with highly developed techniques of information, network andwidespread applications of computer, data generated is increasing day by day, classificationfor very large datasets has been an important research area in machine learning and datamining. The precision of decision tree classification model depends on the scale of trainingdatasets, however, due to the problem of data storage, memory bottleneck, high time com-plexity when processing massive datasets, existing decision tree classification methods arehard to be promoted, in addition, the efciency of them are too low when dealing with con-tinuous attributes. As a result, it possesses important theoretical and practical significanceto make further improvement on the performance of the decision tree to adapt it with therequirement of the development of data mining technology.Considering the disadvantages of existing decision tree methods, in this paper, westudy the parallelization of decision tree algorithms based on sampling splits techniquesand propose a parallel decision tree classifier using sampling splits with estimation whichis suitable for handling large dataset. Our main work is as follows:1. In order to improve the accuracy and efciency of decision tree when dealing withcontinuous attribute, we use SSE method which can greatly reduce the computa-tional cost of finding the best split point.2. We design the MRSPDT and bounding the error of it through theoretical analysis,in the end we implement the MRSPDT on the Hadoop platform.3. Experiments conducted on benchmark databases indicate that the efciency and s-calability of the algorithm.
Keywords/Search Tags:Classification, Decision Tree, Sampling, Parallel
PDF Full Text Request
Related items