Font Size: a A A

Research On Data Mining Algorithm In Cloud Computing Environment

Posted on:2015-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:H X QinFull Text:PDF
GTID:2208330434451419Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, the Internet and computer-related technologies develop rapidly, including photographic technology, video technology, e-commerce and so on, so that the data generated around us grow explosively, especially after the rise of smart phones as more representative of the mobile Internet technology. Faced with such a large-scale data analysis, processing of data has become a huge problem, which would give the opportunity to the development of data mining. Data mining can extract valuable information to users from these massive, heterogeneous, random data, then one can found interesting user mode.Traditional data mining techniques in handling massive amounts of data often takes too long. The emergence of cloud computing is a way to bring data mining to solve this problem. Cloud computing is often built on physically large clusters or large-scale data centers. Through the advantages of scale, cloud computing can provide a powerful and inexpensive computing power, low-cost storage network. Moreover, the public cloud also makes a lot of users to simultaneously access computing resources according to their demand.This thesis introduces concepts and features related to cloud computing and data mining, and highlights the open source of cloud computing framework Hadoop. Hadoop is a distributed computing framework to build an open source cloud platform. We then can easily build their own Hadoop cluster without the need to understand the complexities of the underlying communication mechanism. Hadoop has many components, of which the two most important parts are:distributed file system HDFS and MapReduce computing model. HDFS can provide a secure and reliable file system. MapReduce provides users with an easy-to-use yet efficient programming model base on messaging model. MapReduce model will assign tasks to multiple hosts in the cluster by the master node monitoring and management.In order to perfect the existing data mining algorithms running on a Hadoop cluster, to take advantage of cluster parallelism to improve operational efficiency, the need for reform for these algorithms, they are to re-implement in MapReduce programming model. In this thesis, collaborative filtering algorithms, for example, will present a Hadoop cluster running on a scalable item-based collaborative filtering algorithm. Taking advantage of Hadoop and MapReduce’s characteristics, we devide the compute-intensive tasks to run in parallel on different nodes. Collaborative filtering can be part of the serial phases and within these phases implemented in MapReduce model, because of the parallel task processing request data map regardless of the processing of other records. The most important is to calculate similarity between two items in parallel. Extraction stage score two items during the map phase, find similarity of two entries during the reduce phase, these two phases are parallelized, then the overall efficiency of this algorithm has been greatly improved. Similarly, for the K-means algorithm, the key is to calculate distance in parallel.Finally, through experiments and analysis, collaborative filtering implemented under Hadoop framework has been proved to achieve great efficiency improvement compared to serial.Through the related studies, we understand the data mining in cloud environment, data mining in stand-alone environment and the advantages and disadvantages between them better. The thesis studies how to improve the traditional data mining algorithms to take advantage of the open source Hadoop distributed framework to achieve parallelism and perform efficienctly.
Keywords/Search Tags:data mining, cloud computing, Hadoop, K-means algorithm, collaborativefiltering algorithm
PDF Full Text Request
Related items