Data Mining Algorithm Parallelization In Cloud Environment

Posted on:2017-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:P P He

Full Text:PDF

GTID:2308330503453806

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Data mining means one process of extracting useful information and knowledge from a huge, incomplete, noisy, fuzzy and random data. Not only can summarize the development of the past, but also predict the future development trend, change the Dead of data to golden of knowledge, It makes great contributions with the business decision, database, military, medical and other fields. However, with the future of the big data, huge data fields also challenge to the traditional data mining. Relying on cloud computing and distributed computing platform to provide a strong ability, combine with data mining and cloud computing for becoming a trend in the industry, showing its strong advantages and potential uninterrupted. Application of cloud computing to data mining, it can provide a solution for more and more massive data mining.In data mining, relating rules and clustering analysis is an important data mining algorithm. Apriori algorithm is the core of the algorithm of relating rules. Search all of the frequent itemsets through multiple scan databases. But faced with massive data, repeated of database scanning should be spend a lot of time and memory space. And the typical clustering algorithm K-means algorithm in the processing of large scale data, the same subject to memory capacity, cannot always run effectively. Therefore, this topic is based on the Hadoop cloud computing platform with powerful distributed computing and storage capacity, based on the MapReduce programming model to improve the traditional serial algorithm, thus solve the problem of huge data in association rules and clustering analysis.This paper mainly introduces the Hadoop framework, the ralating rules Apriori algorithm and clustering analysis K-means, and face to two core technologies of Hadoop, HDFS distributed file system and MapReduce programming model. Based on MapReuduce programming model, in order to improve traditional data mining Apriori algorithm and K-means, and propose MapReuduce parallel design scheme. After parallelization algorithm will make the duplicate computation distribute each node at the same time, reduce each node of the computation burden and time. Finally, take the improved algorithm after MapReduce parallelization into Hadoop cluster environment, and the improved algorithm test under different sizes of data sets, and base the result of the experiment to analyze the performance of the parallel algorithm. Experiments show that the Apriori based on MapReduce algorithm and K-means algorithm can solve the problem of time-consuming and efficiency of traditional data mining.

Keywords/Search Tags:

cloud computing, mapreduce, patallelization, association rules, clustering analysis

PDF Full Text Request

Related items

1	The Parallel Association Rules Algorithm Based On Mapreduce In The Application Of Community Analysis Research
2	Parallel Association Rules Algorithm Based On Hadoop
3	The Research And Implementation Of Parallel Association Rules Algorithm Based On Cloud Environment Data Mining
4	Research For Association Rules Algorithm On Big Data
5	Research On Data Mining Technology Of Internet Of Things Based On Cloud Computing
6	Research And Application Of Association Rules Algorithm Based On MapReduce
7	The Analysis Of Mass Travel Data Based On Cloud Computing
8	The Research Of Parallel Association Rules Mining Algorithms Based On Cloud Platform
9	The Research Of Spatial Clustering Analysis Based On Cloud Computing
10	The Study Of The Improvement And Transplantation Of Apriori Algorithm Based On Hadoop