Font Size: a A A

The Research On Decision Tree Algorithm's Parallelization Based On Hadoop Platform

Posted on:2013-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:T M PanFull Text:PDF
GTID:2218330374967522Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The CEO of Google first proposed the concept of Cloud Computing at the search engine conference in2006and in the next five years, the concept of Cloud Computing has been widely spread. Microsoft, Google, IBM, and other well-known companies have carrying out studies of Cloud Computing. The emergence of Cloud Computing platform makes it possible to construct scalable, inexpensive, and efficient computing model.In our modern society, the information has fast growth. These huge amount of information contains large amounts of data, including personal data, transaction data and industrial data. As is expected that by2020, over1/3digital information produced every year will be stored in the cloud computing platform, or these information will be processed by the cloud computing. With the growth of data, we need diversified, personalized data mining and the traditional centralized data mining method has been no longer suited. How to mine the massive data efficiently, credibly based on the cloud computing platform is a big issue.This paper first introduce Google, IBM,Hadoop and other cloud computing platforms and we focus on the key technologies of MapReduce programming model and the distributed file system, the HDFS. Secondly this paper study the decision tree data mining algorithms and introduce a number of commonly used decision tree algorithms. Thirdly, this paper introduce the installation of the Hadoop platform, and its deployment. Finally, this paper aim at two algorithms,C4.5and SPRINT. Proposing the improve methods and parallelization strategies in Hadoop cloud computing platform. According to experimental results, this paper proved these two types of algorithms could be parallelized on the Hadoop cloud computing platform. Under mass data on the Hadoop platform, the decision tree algorithm obtain good running speed. This paper successfully solving the problem of two types of decision tree data mining, which has a defect in handling massive training set and building the decision tree.
Keywords/Search Tags:Cloud Computing, Hadoop, Decision Tree, Parallel
PDF Full Text Request
Related items