The Reseach Of Data Mining Based On HADOOP

Posted on:2011-04-14

Degree:Master

Type:Thesis

Country:China

Candidate:C Z Yang

Full Text:PDF

GTID:2178330338482901

Subject:Computer software and theory

Abstract/Summary:

With the use of computer and Internet technology expansion of many aspects of human society, data shows explosive growth. Now, the process and storage of large data sets has become the new challenge of enterprises. How to mine valuable and understandable knowledge from the massive data in a more rapid, efficient, low-cost way, to help company make decisions is a challenges of Data mining.The emergence of Cloud computing bring new opportunities for data mining technology. Cloud computing distribute the ability of storage and computing among multiple nodes in cloud cluster. So it enabling huge data set storage and computing power. Because you can use a lot of cheap computers for cluster instead of by the high price of the server, cloud computing greatly reduced costs. Together with Cloud computing technology which can provide the large storage capacity and computing power, data mining technology go into the cloud-based data mining era.HADOOP is an open source project of Apache for building cloud platform. HADOOP framework will help us implementation of clusters easier, faster and more effective. HADOOP using HDFS (distributed file system) to achieve large file storage and fault tolerance, and the use MapReduce programming model to computing. To make HADOOP applied to data mining, a key question is how to parallel the traditional data mining algorithms. Some specified traditional data mining algorithm can be paralleled easier because of its own characteristics. But some algorithms are hard to be paralleled. For the algorithms that can be paralleled, combined with MapReduce programming model, we can transplant it to HADOOP platform. Then it will complete data mining tasks more efficient in parallel way.This paper describes the platform of cloud computing and HADOOP core architecture and operating mechanisms in details. Then we give a data mining system mode based on HADOOP combined with traditional system. After all of these we succeed in parallel SPRINT decision tree algorithm and transplant it to HADOOP platform. After giving a detailed algorithm, we experimentally verified the validity of the algorithm.

Keywords/Search Tags:

Cloud computing, data mining, HADOOP, SPRINT, parallel computing

Related items

1	The Process And Research Of Massive Data Mining Based On Cloud Computing
2	Research On Decision Tree Mining Algorithm Based On Cloud Computing
3	Research And Implementation Of Data Classification Algorithm Based On Decision Tree
4	The Parallel Reseach On Decision Tree Classification Algorithm Based On Hadoop
5	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
6	Parallel Data Mining Algorithm Research In Cloud
7	Study On Data Mining Platform Based On Cloud Computing
8	Research On Web Data Mining Algorithms In Cloud Computing Environment
9	Research About Data Mining Technologies Based On Cloud Computing
10	Data Mining Association Algorithm Research And Realization Based On Cloud Computing