Font Size: a A A

The Reseach Of Data Mining Based On HADOOP

Posted on:2011-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:C Z YangFull Text:PDF
GTID:2178330338482901Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the use of computer and Internet technology expansion of many aspects of human society, data shows explosive growth. Now, the process and storage of large data sets has become the new challenge of enterprises. How to mine valuable and understandable knowledge from the massive data in a more rapid, efficient, low-cost way, to help company make decisions is a challenges of Data mining.The emergence of Cloud computing bring new opportunities for data mining technology. Cloud computing distribute the ability of storage and computing among multiple nodes in cloud cluster. So it enabling huge data set storage and computing power. Because you can use a lot of cheap computers for cluster instead of by the high price of the server, cloud computing greatly reduced costs. Together with Cloud computing technology which can provide the large storage capacity and computing power, data mining technology go into the cloud-based data mining era.HADOOP is an open source project of Apache for building cloud platform. HADOOP framework will help us implementation of clusters easier, faster and more effective. HADOOP using HDFS (distributed file system) to achieve large file storage and fault tolerance, and the use MapReduce programming model to computing. To make HADOOP applied to data mining, a key question is how to parallel the traditional data mining algorithms. Some specified traditional data mining algorithm can be paralleled easier because of its own characteristics. But some algorithms are hard to be paralleled. For the algorithms that can be paralleled, combined with MapReduce programming model, we can transplant it to HADOOP platform. Then it will complete data mining tasks more efficient in parallel way.This paper describes the platform of cloud computing and HADOOP core architecture and operating mechanisms in details. Then we give a data mining system mode based on HADOOP combined with traditional system. After all of these we succeed in parallel SPRINT decision tree algorithm and transplant it to HADOOP platform. After giving a detailed algorithm, we experimentally verified the validity of the algorithm.
Keywords/Search Tags:Cloud computing, data mining, HADOOP, SPRINT, parallel computing
PDF Full Text Request
Related items