Research On Data Mining Algorithm In Cloud Computing Environment

Posted on:2015-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:H X Qin

Full Text:PDF

GTID:2208330434451419

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, the Internet and computer-related technologies develop rapidly, including photographic technology, video technology, e-commerce and so on, so that the data generated around us grow explosively, especially after the rise of smart phones as more representative of the mobile Internet technology. Faced with such a large-scale data analysis, processing of data has become a huge problem, which would give the opportunity to the development of data mining. Data mining can extract valuable information to users from these massive, heterogeneous, random data, then one can found interesting user mode.Traditional data mining techniques in handling massive amounts of data often takes too long. The emergence of cloud computing is a way to bring data mining to solve this problem. Cloud computing is often built on physically large clusters or large-scale data centers. Through the advantages of scale, cloud computing can provide a powerful and inexpensive computing power, low-cost storage network. Moreover, the public cloud also makes a lot of users to simultaneously access computing resources according to their demand.This thesis introduces concepts and features related to cloud computing and data mining, and highlights the open source of cloud computing framework Hadoop. Hadoop is a distributed computing framework to build an open source cloud platform. We then can easily build their own Hadoop cluster without the need to understand the complexities of the underlying communication mechanism. Hadoop has many components, of which the two most important parts are:distributed file system HDFS and MapReduce computing model. HDFS can provide a secure and reliable file system. MapReduce provides users with an easy-to-use yet efficient programming model base on messaging model. MapReduce model will assign tasks to multiple hosts in the cluster by the master node monitoring and management.In order to perfect the existing data mining algorithms running on a Hadoop cluster, to take advantage of cluster parallelism to improve operational efficiency, the need for reform for these algorithms, they are to re-implement in MapReduce programming model. In this thesis, collaborative filtering algorithms, for example, will present a Hadoop cluster running on a scalable item-based collaborative filtering algorithm. Taking advantage of Hadoop and MapReduce’s characteristics, we devide the compute-intensive tasks to run in parallel on different nodes. Collaborative filtering can be part of the serial phases and within these phases implemented in MapReduce model, because of the parallel task processing request data map regardless of the processing of other records. The most important is to calculate similarity between two items in parallel. Extraction stage score two items during the map phase, find similarity of two entries during the reduce phase, these two phases are parallelized, then the overall efficiency of this algorithm has been greatly improved. Similarly, for the K-means algorithm, the key is to calculate distance in parallel.Finally, through experiments and analysis, collaborative filtering implemented under Hadoop framework has been proved to achieve great efficiency improvement compared to serial.Through the related studies, we understand the data mining in cloud environment, data mining in stand-alone environment and the advantages and disadvantages between them better. The thesis studies how to improve the traditional data mining algorithms to take advantage of the open source Hadoop distributed framework to achieve parallelism and perform efficienctly.

Keywords/Search Tags:

data mining, cloud computing, Hadoop, K-means algorithm, collaborativefiltering algorithm

PDF Full Text Request

Related items

1	Research Of Clustering Mining Algorithm Oriented Big Data
2	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout
3	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
4	Data Mining Association Algorithm Research And Realization Based On Cloud Computing
5	Cloud Computing-based Integratedoperation Management Platform Research
6	The Parallel Reseach On Decision Tree Classification Algorithm Based On Hadoop
7	Parallel Data Mining Algorithm Research In Cloud
8	Research And Design Of Parallel K-prototypes Clustering Algorithm Based On Hadoop
9	Research And Application Of Data Mining Algorithm Based On Cloud Computing
10	Research And Improvement Of Apriori Algorithm Based On Hadoop