Font Size: a A A

Study And Parallel Implementation Of Decision Tree Classification Based On MapReduce

Posted on:2012-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhuFull Text:PDF
GTID:2218330338968489Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification is a very important topic in data mining and machine learning, which is widely used in many fields. Currently proposed classification methods include Bayesian classification, decision tree method, neural network, support vector machine and so on. Because of quick construction, simple creational pattern and high accuracy, decision tree classification method becomes one of the most popular classification methods.Some of the most widely used decision tree algorithms are ID3, SLIQ, and SPRINT. The precision of decision tree classification model depends on the scale of datasets, and which means complexity will becomes extremely high when processing massive datasets. Therefore, decision tree method is hard to promote. As a result, the parallelliation study of decision tree classification becomes quite necessary.This paper systematically studies the parallelization of decision tree algorithms based on MapReduce. It first focus on the main idea of several common-used decision tree algorithms, and analysis several existing parallel programming model, and conclude that MapReduce is a suitable model for handling massive datasets. Paralleliztion of decision tree algorithms could be divided into the following steps: partition training datasets, decide the best spilt attribute of nodes in parallel and divide attribute lists from parent nodes to their child nodes in parallel. This paper studies the parallel design of decision tree algorithms based on three implementation framework of MapReduce respectively, and implement SPRINT algorithm on Phoenix. Experiment results show that the parallelization of decision tree classification based on MapReduce not only programs easier, but also works better and has a higher speed-up ratio while running on more computing nodes than the ones based on other parallel programming models.
Keywords/Search Tags:decision tree classification, SPRINT, parallel programming model, MapReduce, Phoenix
PDF Full Text Request
Related items