The Research On Classification And Regression Tree's Parallelization Based On Spark Platform

Posted on:2017-12-27

Degree:Master

Type:Thesis

Country:China

Candidate:X N Wang

Full Text:PDF

GTID:2348330509453994

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of computer technology and information technology, our data increases exponentially. To take advantage of these data, data mining technology has developed rapidly. As an important means of data mining, classification technology has also been developed widely.Decision tree classification technology is an important branch of the classification technology. This paper mainly researches on the subject of decision tree classification technology in-depth, details some of the key concepts of decision tree classification technology, like the process of decision tree and the major research content of decision tree. The major research content of decision tree includes: data preprocessing, attribute selection strategy, decision tree pruning, decision tree parallelization and incremental decision tree. And the attribute selection strategy is the most important research content. The biggest difference between different decision tree algorithm is the difference of the attribute selection strategy.This paper also describes the CART decision tree algorithm in detail, including its attribute selection strategy, the different treatment methods of two different types of attributes(discrete attribute and continuous attribute) and the pruning algorithm. And then it describes two important process of the CART algorithm: "building tree" and "pruning" in detail by examples.It also showes the Spark distributed processing framework, introduces the features of Spark distributed processing framework in detail and showes the differences between Spark distributed processing framework and Hadoop distributed processing framework.It also analyzes the shortcomings of CART decision tree algorithm, and then make our own improvements, including improving CART algorithm parallelization and reducing the unnecessary calculations. And also it combines these to Spark distributed processing framework to improve the parallelism of CART algorithm in another way.In the end, it describes the cluster environment of our Spark distributed processing framework and our experimental procedure, and then it proves that our improvements can improve the computational efficiency of the CART algorithm effectively through experiment.

Keywords/Search Tags:

Data Mining, Decision Tree, CART, SPARK, Parallelism

PDF Full Text Request

Related items

1	Improved Algorithm And Application Of CART Decision Tree Based On GA
2	Data Mining Technology In The Audit
3	Research On Cybersecurity Analysis Method For Big Data
4	The Application Of Data Mining's Decision Tree Induction In Accelerated Freight Transportation
5	Research And Design Of The Off-network User Analyzer Of Mobile Telecommunications On The Basis Of Decision Tree
6	Research On Parallel Decision Tree Algorithm Based On Spark
7	Data Mining Applications In The Census Data
8	The Research Of Decision Tree Algorithm In Data Mining
9	The Parallel Reseach On Decision Tree Classification Algorithm Based On Hadoop
10	The Design And Implementation Of Bank Customer Rating System Based On Data Mining