Font Size: a A A

Data Mining Algorithm Parallelization In Cloud Environment

Posted on:2017-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:P P HeFull Text:PDF
GTID:2308330503453806Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Data mining means one process of extracting useful information and knowledge from a huge, incomplete, noisy, fuzzy and random data. Not only can summarize the development of the past, but also predict the future development trend, change the Dead of data to golden of knowledge, It makes great contributions with the business decision, database, military, medical and other fields. However, with the future of the big data, huge data fields also challenge to the traditional data mining. Relying on cloud computing and distributed computing platform to provide a strong ability, combine with data mining and cloud computing for becoming a trend in the industry, showing its strong advantages and potential uninterrupted. Application of cloud computing to data mining, it can provide a solution for more and more massive data mining.In data mining, relating rules and clustering analysis is an important data mining algorithm. Apriori algorithm is the core of the algorithm of relating rules. Search all of the frequent itemsets through multiple scan databases. But faced with massive data, repeated of database scanning should be spend a lot of time and memory space. And the typical clustering algorithm K-means algorithm in the processing of large scale data, the same subject to memory capacity, cannot always run effectively. Therefore, this topic is based on the Hadoop cloud computing platform with powerful distributed computing and storage capacity, based on the MapReduce programming model to improve the traditional serial algorithm, thus solve the problem of huge data in association rules and clustering analysis.This paper mainly introduces the Hadoop framework, the ralating rules Apriori algorithm and clustering analysis K-means, and face to two core technologies of Hadoop, HDFS distributed file system and MapReduce programming model. Based on MapReuduce programming model, in order to improve traditional data mining Apriori algorithm and K-means, and propose MapReuduce parallel design scheme. After parallelization algorithm will make the duplicate computation distribute each node at the same time, reduce each node of the computation burden and time. Finally, take the improved algorithm after MapReduce parallelization into Hadoop cluster environment, and the improved algorithm test under different sizes of data sets, and base the result of the experiment to analyze the performance of the parallel algorithm. Experiments show that the Apriori based on MapReduce algorithm and K-means algorithm can solve the problem of time-consuming and efficiency of traditional data mining.
Keywords/Search Tags:cloud computing, mapreduce, patallelization, association rules, clustering analysis
PDF Full Text Request
Related items