Font Size: a A A

The Research Of Decision Tree Mining Based On Hadoop

Posted on:2016-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z C ZhaoFull Text:PDF
GTID:2308330461967254Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Now, an era of a large-scale production, sharing and application data is opening. The development and application of cloud computing technology bring new opportunities for data mining, The real value of big data, for the most part, are hidden beneath the surface, however, promote cloud computing. Data mining algorithm processing data and mining hidden data value, which is beneficial to the company to make decision on the basis of the data value and conform to social development. But the present data mining algorithm will take a long time to deal with mass data. An effective method to deal with mass data and extract data value is to combine the traditional mining algorithm with the current mature cloud computing technology..Hadoop is an open source of Apache distributed system framework, which is based on Java language development, the core of Hadoop consists of HDFS and MapReduce. The HDFS provides high fault tolerance and high throughput rate of file storage, reading and writing. MapReduce provides a parallel programming framework, the user can develop parallel applications without knowing distributed parallel programming details. Hadoop provides the mass data storage platform and parallel calculation platform, which provides the basis for the traditional data mining algorithm processing mass data.This paper analyzes and researches the key technology of the Hadoop platform which is the HDFS file system and MapReduce programming model on the basis of detailed analysis of current mature cloud computing platform. And then, the paper includes an in-depth study of the current data mining algorithm, especially the mature decision tree classification algorithm. On this basis, combined with the typical decision tree classification algorithm SPRINT algorithm, optimization algorithm IRSPRINT based on RainForest framework is proposed, and put forward the parallel algorithm HIRSPRINT on Hadoop platform. The experimental results show that HIRSPRINT algorithm has the high speed ratio on Hadoop platform to deal with mass data, effectively reduce the SPRINT algorithm to construct the decision tree and process the mass data. In general, it improves the ability of the decision tree algorithm to deal with mass data effectively.
Keywords/Search Tags:big data, cloud computing, Hadoop, parallel decision tree algorithm
PDF Full Text Request
Related items