Font Size: a A A

Research On Parallel Decision Tree Algorithm Based On Hadoop Platform

Posted on:2019-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:T LvFull Text:PDF
GTID:2428330566991429Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of science and technology brings convenience to people,but at the same time,it also brings us new problems and new challenges,When we use the Internet for information transmission and interaction,a large number of data will be generated.The traditional single computer algorithm can not meet the needs of the present computing.This urges people to search for new technologies to process and analyze large amounts of data.Parallel computing and the application of big data platform are the best solutions at present.Classification algoritlun is an important data mining task for classifying and predicting transactions,and guiding people to know things correctly.Hadoop is a distributed system infrastructure,which has the advantages of cross platform and high fault tolerance.Using distributed data block storage,which can handle large scale data with high concurrent and high tolerance.This thesis will mainly divide into two parts to expand the research work of Hadoop classification and parallel algorithm.(1)On the basis of the study of C4.5 algorithm,the parallel classification algorithm HD_C4.5 based on Hadoop is proposed,and parallel implementation of MapReduce is implemented.The algorithm HD_C4.5 makes full use of the MapReduce framework to maximize the parallel processing of the key tasks of the attribute selection measurement,which effectively solves the use of the best split attribute to the computer resources and improves the efficiency.The experiment is completed in the fully distributed Hadoop cluster.The experiment is completed in the fully distributed Hadoop cluster.The results of comparative analysis show that the algorithm proposed in this paper has better performance.(2)This thesis proposes a pruning improvement algorithm for parallel Shared decision tree mining algorithm based on Hadoop.The algorithm reduces the influence of training set's unreliability on the model by classifying the number of uncertain probability error as the pruning selection basis,and with the increase of data set,the superiority of the improved algorithm is more obvious.A large data Hadoop platform framework is built,and a comparative experiment is carried out.The results show that the improved algorithm is Less time-consuming,more efficient,and can better adapt to the needs of big data processing.
Keywords/Search Tags:Hadoop, Parallel Decision Tree, Parallel Sharing, Uncertainty Probability, Pruning Algorithm
PDF Full Text Request
Related items