With the continuous progress and development of information technology,the number of effective and useful data is increasing.Knowledge is abundant in the data,and the most urgent task is to find valuable information from mass data and transform them into organized knowledge.Different data mining methods need to be applied to different types of data.Decision tree algorithm is an important method of classification,and also a typical technology of data mining.Decision tree algorithm is widely used in data analysis as a simple,efficient and strong explanatory model.For massive data,the traditional decision tree algorithm has been unable to deal with it effectively.In a sense,parallelization is a new way to deal with massive data.Based on the above background,this paper focuses on the following three aspects.Firstly,the decision tree algorithm was studied deeply and the C4.5 algorithm was optimized(The new C4.5 algorithm called C4.5_YH algorithm).The C4.5_YH algorithm is applied to a bank in Portugal to excavate the potential users of the subscription deposit.The experimental results show that the calculation of the C4.5_YH algorithm is reduced,the accuracy of the classification is also improved,and the decision tree is fitted to the reality.Secondly,the current Hadoop framework technology was explored.This paper expounds the programming ideas and working principle of MapReduce,and introduces the frame structure and working principle of YARN in detail.The comparison between Hadoop1.0 and YARN shows that YARN framework has obvious advantages in dealing with massive data.Finally,based on the detailed study of the decision tree algorithm,the process is designed in parallel,including attribute parallel,the discretization of continuous attribute parallel,node parallel,pruning parallel.Then the implementation of parallelization is introduced in detail with the C4.5_YH algorithm as an example.The experiment of parallel computing of C4.5_YH algorithm is realized by using the YARN framework of Hadoop platform.The experimental results show that the parallel computing based on the decision tree algorithm based on the YARN framework is efficient and reliable. |