Font Size: a A A

Research On Decision Tree Classification Algorithm Based On Hadoop

Posted on:2014-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:S D LinFull Text:PDF
GTID:2208330422452542Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The applications of internet are always changing, and also affect all aspects ofpeople’s lives. Besides, we also can’t imagine the rapid growth of the amount of datagenerated on the Internet. Today, the storage and processing of massive data sets hasbecome a major challenge in various enterprises, and has draw more and moreattention by various enterprises. In order to achieve greater success in the future,theyneed to manage their own data, but also need to obtain valuable information from thedata of other organizations or enterprises. At the meantime, the processing capacity ofmassive data has become one of the core competencies of the modern enterprise. Howto mine valuable information from a large number of data, then analysis andtransform to be understandable knowledge, the data mining technology has becomethe new hot theme. Because it can help enterprises to scientific decision-making.In recent years, the rise of cloud computing provides the important opportunity forthe development of data mining technology, It has become an effective way to solvethis problem. The cloud is composed of a large pool of resources which is managedby a large number of highly virtualized resources. In order to achieve a strong storageand computing ability, the storage and calculation of large amounts of data is evenlydistributed in the cluster. Meanwhile, the nodes in the cluster are inexpensivecomputers, we don’t need to use expensive server. Since the emergence of cloudcomputing technology, data mining technology has entered a new time of cloudcomputing-based.There are many areas of data mining, the decision tree classification algorithm is aclassical data mining methods. It is a very effective method of classification, hasattracted the attention of many researchers. So far, there are many decision treeclassification algorithm, the various algorithms have their own advantages in theefficiency of the implementation of, scalability, comprehensibility of the results, theaccuracy of the classification results. However, in the era of the big explosion of datagrowth, the performance of these algorithms analysis huge amounts of data is always some passable. Cloud computing is a good way to deal with massive data, thealgorithm is arranged in a cloud computing platform for distributed computing is aneffective method. The researchers have used various ways to parallel the classic ID3,C4.5, SPRINT decision tree algorithm in the cloud computing platform, and greatlyimprove the performance of the algorithm and the ability to analysis massive data.Hadoop is a open source cloud computing distributed platforms from ApacheSoftware Foundation. It’s core are the Hadoop Distributed File System HDFS andMapReduce (Google MapReduce open source implementation),and is used to provideusers with a transparent system underlying details distributed infrastructure [9]. TheHDFS with the high fault tolerance, high scalability advantages allows the user todeploy Hadoo on inexpensive hardware, and The MapReduce distributedprogramming model allows users to develop parallel applications withoutunderstanding the underlying details of the distributed system. Therefore, the user cantake full use of advantage of cluster computing and storage capacity to complete theprocessing of vast amounts of data. For traditional data mining algorithms, we canparallel them, and then according to their own characteristics, combine them with theMapReduce programming model. So we can complete the tasks of data analysisefficiently and parallely by porting them to the Hadoop platform.We parallel the C4.5algorithm after studying tree classification algorithmparallelled program. Then we search the core structure and operation mechanism ofcloud computing of the Hadoop open source colud platform. And according to theHadoop MapReduce programming model, we also consider a detailed description ofof C4.5algorithm parallelled implementations in the MapReduce programmingmodel besides its execution flow. Finally, the parallelled algorithm is used in theHadoop platform to classify the mass text data, then valitate the efficiency andscalability of algorithm.
Keywords/Search Tags:Cloud Computing, Hadoop MapReduce, Data Classification, C4.5Algorithm
PDF Full Text Request
Related items