Font Size: a A A

Research On Parallel Decision Tree Algorithm Based On Spark

Posted on:2021-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:X LuFull Text:PDF
GTID:2428330611487193Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining technology is an important means to explore large-scale data sets.It reveals the hidden rules of each data set.Applying these rules in different scenarios can solve the problems and difficulties directly.As an important branch of data mining,decision tree classification technology is easy to understand and highly free to operate,which makes it widely used in life.The emergence of distributed parallel decision tree algorithm is a great change to the traditional decision tree algorithm.This algorithm liberates the process of building decision tree model from the original single machine operation,and uses the way of multi machine common computing to complete the building of decision tree.The advantage of multi machine mode is that the computing task is no longer concentrated on one machine,but each data node in the cluster is evenly distributed,and each data node cooperates with each other to complete high-intensity computing tasks.Therefore,this mode will not have high requirements for the configuration of data nodes,and the computing tasks of each data node are independent of each other,which can complete the computing in parallel.Among many distributed parallel decision tree algorithms,the Spark platform decision tree algorithm(MLlib DecisionTree,MLDT for short in this article)based on inmemory computing is widely used.The data operation speed of spark platform is 10-100 times faster than that of Hadoop platform,and it is more suitable for processing largescale data sets.Therefore,the decision tree model of big data sets trained by spark platform will be more rapid.However,the MLDT algorithm of the Spark platform also has many shortcomings,such as the large amount of information transfer between the data nodes of the distributed construction decision tree in the cluster,which results in a high network resource occupation,and the calculation of the information entropy when the tree node splits.and many more.This article mainly takes MLDT as the basis of research,and proposes a parallel decision tree algorithm(SPDT)based on the Spark platform.The main improvements of SPDT include the following three aspects: Firstly,the data set of the training decision tree is preprocessed,and the data set is partitioned by columns to keep the complete attributes stored in each data node of the distributed cluster,so that the calculation of information entropy is completed independently in the process of tree building,and the occupation of network resources caused by the information transfer between nodes is reduced.Then compress the data stored in the data node to save more space for the calculation task.At last,the continuous attribute discretization method based on the boundary point class judgment is used to optimize the algorithm,reduce the number of information entropy calculation,and use the weighted average information gain ratio as the standard of selecting tree nodes,so as to reduce the dependence of tree node selection on attributes of multi-attribute values.The experimental results show that the improved algorithm improves the efficiency of tree model building of distributed decision tree,and maintains the classification accuracy similar to MLDT algorithm.
Keywords/Search Tags:Distributed, Decision tree, Spark, Data partition, Data compression, Boundary points
PDF Full Text Request
Related items