Research On Parallel Decision Tree Algorithm Based On Spark

Posted on:2021-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:X Lu

Full Text:PDF

GTID:2428330611487193

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data mining technology is an important means to explore large-scale data sets.It reveals the hidden rules of each data set.Applying these rules in different scenarios can solve the problems and difficulties directly.As an important branch of data mining,decision tree classification technology is easy to understand and highly free to operate,which makes it widely used in life.The emergence of distributed parallel decision tree algorithm is a great change to the traditional decision tree algorithm.This algorithm liberates the process of building decision tree model from the original single machine operation,and uses the way of multi machine common computing to complete the building of decision tree.The advantage of multi machine mode is that the computing task is no longer concentrated on one machine,but each data node in the cluster is evenly distributed,and each data node cooperates with each other to complete high-intensity computing tasks.Therefore,this mode will not have high requirements for the configuration of data nodes,and the computing tasks of each data node are independent of each other,which can complete the computing in parallel.Among many distributed parallel decision tree algorithms,the Spark platform decision tree algorithm(MLlib DecisionTree,MLDT for short in this article)based on inmemory computing is widely used.The data operation speed of spark platform is 10-100 times faster than that of Hadoop platform,and it is more suitable for processing largescale data sets.Therefore,the decision tree model of big data sets trained by spark platform will be more rapid.However,the MLDT algorithm of the Spark platform also has many shortcomings,such as the large amount of information transfer between the data nodes of the distributed construction decision tree in the cluster,which results in a high network resource occupation,and the calculation of the information entropy when the tree node splits.and many more.This article mainly takes MLDT as the basis of research,and proposes a parallel decision tree algorithm(SPDT)based on the Spark platform.The main improvements of SPDT include the following three aspects: Firstly,the data set of the training decision tree is preprocessed,and the data set is partitioned by columns to keep the complete attributes stored in each data node of the distributed cluster,so that the calculation of information entropy is completed independently in the process of tree building,and the occupation of network resources caused by the information transfer between nodes is reduced.Then compress the data stored in the data node to save more space for the calculation task.At last,the continuous attribute discretization method based on the boundary point class judgment is used to optimize the algorithm,reduce the number of information entropy calculation,and use the weighted average information gain ratio as the standard of selecting tree nodes,so as to reduce the dependence of tree node selection on attributes of multi-attribute values.The experimental results show that the improved algorithm improves the efficiency of tree model building of distributed decision tree,and maintains the classification accuracy similar to MLDT algorithm.

Keywords/Search Tags:

Distributed, Decision tree, Spark, Data partition, Data compression, Boundary points

PDF Full Text Request

Related items

1	Thermal Power Plant Energy Saving Analysis Based On Spark Big Data Platform
2	Design And Implementation Of Distributed Data Mining Algorithms Based On Spark
3	The Research On Classification And Regression Tree's Parallelization Based On Spark Platform
4	Research On The Classification Algorithm Of Unbalance Data Based On Spark
5	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
6	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
7	The Research Of Nonparametric Clustering Boundary Detection Algorithm
8	Research On Test Data Partition Compression Of System-on-Chip
9	Research And Optimization Of Data Placement Method In Spark Partitioner
10	Research Of Data Skew On Spark Based On Imporved Partition Method