Font Size: a A A

The Study Of Decision Tree Algorithm Based On Hadoop Platform

Posted on:2016-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:2308330473965522Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development and popularization of the Internet, the amount of data shows explosive growth. The traditional Data Mining algorithms which run on a single processor are limited by the computing power and storage capacity. That makes the traditional Data Mining algorithms inefficient when dealing with huge amounts of data, and increased the burden of data mining technology greatly. Then the Hadoop platform provides a new direction for optimization of data mining algorithms. Because the Hadoop platform has the advantages of a huge computing power and storage capacity, and the emergence of its low cost, high fault tolerance when compared with Data Mining when dealing with massive data. The Hadoop platform use HDFS distributed file system to achieve file storage and the MapReduce programming model for distributed computing. You can make huge amounts of data mining tasks with deploying traditional Data Mining algorithms to the Hadoop platform.This paper introduces the key technology and operational mechanism of the Hadoop platform. Introducing the process of HDFS reading and writing files in-depth. And comprehending the principles and working mechanisms of MapReduce parallel programming model. Also has a brief introduction to the process of KDD and Data Mining algorithms. The characteristics and decision tree construction and classification are given. Based on the introduction of the model of decision tree algorithm C4.5 algorithms and typical SPRINT algorithm, propose a parallel strategy of C4.5 algorithms and SPRINT algorithm. Then a detailed algorithm design has given which combined with MapReduce model. Then the paper theoretically analyzed the improved algorithm of C4.5bH and SPRINTbH. Then algorithm classification accuracy of the original algorithm of SPRINT algorithm and C4.5 algorithm and the improved algorithms comparison showed classification accuracy of the improved algorithms have not changed, and then through the research of scalability of the algorithms finds that the parallel computing time of the improved algorithms are significantly less than the series calculated time, and parallel computing time reduced with increasing of the number of clusters. Finally, by combining the advantages of C4.5 algorithm and SPRINT algorithm in data processing, a new CS algorithm to process data mining is designed. And parallelization process of the CS algorithm which is different from SPRINTbH algorithm was given. Then deploy the algorithm to Hadoop experiment platform, and verify the accuracy of the three algorithms with five iterations cross. And verify the scalability with adding the number of the node. Through the experiments, assessed the three algorithms are useful.Through experimental analysis and comparison, in terms of the massive data, the improved algorithm of C4.5bH algorithm and SPRINTb H algorithm on the Hadoop platform has higher speed-up, but also running time has decreased when increasing the number of processing nodes. The algorithms have good scalability. In a certain extern, it has solved the problems that the C4.5 algorithm and SPRINT algorithm have faced with, which means when calculate the massive data needs high workload, and need long time to build a decision tree problem. The new CS algorithm that combined with the advantages of SPRINT algorithm and C4.5 algorithms have a higher accuracy than SPRINT algorithm and C4.5 algorithm in Data Mining.
Keywords/Search Tags:Hadoop, Data Mining, C4.5, SPRINT, Parallelization, CS
PDF Full Text Request
Related items