Font Size: a A A

The Research On Classification And Regression Tree's Parallelization Based On Spark Platform

Posted on:2017-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:X N WangFull Text:PDF
GTID:2348330509453994Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer technology and information technology, our data increases exponentially. To take advantage of these data, data mining technology has developed rapidly. As an important means of data mining, classification technology has also been developed widely.Decision tree classification technology is an important branch of the classification technology. This paper mainly researches on the subject of decision tree classification technology in-depth, details some of the key concepts of decision tree classification technology, like the process of decision tree and the major research content of decision tree. The major research content of decision tree includes: data preprocessing, attribute selection strategy, decision tree pruning, decision tree parallelization and incremental decision tree. And the attribute selection strategy is the most important research content. The biggest difference between different decision tree algorithm is the difference of the attribute selection strategy.This paper also describes the CART decision tree algorithm in detail, including its attribute selection strategy, the different treatment methods of two different types of attributes(discrete attribute and continuous attribute) and the pruning algorithm. And then it describes two important process of the CART algorithm: "building tree" and "pruning" in detail by examples.It also showes the Spark distributed processing framework, introduces the features of Spark distributed processing framework in detail and showes the differences between Spark distributed processing framework and Hadoop distributed processing framework.It also analyzes the shortcomings of CART decision tree algorithm, and then make our own improvements, including improving CART algorithm parallelization and reducing the unnecessary calculations. And also it combines these to Spark distributed processing framework to improve the parallelism of CART algorithm in another way.In the end, it describes the cluster environment of our Spark distributed processing framework and our experimental procedure, and then it proves that our improvements can improve the computational efficiency of the CART algorithm effectively through experiment.
Keywords/Search Tags:Data Mining, Decision Tree, CART, SPARK, Parallelism
PDF Full Text Request
Related items