The Research On The Improvement And Parallelization Of CLIQUE Algorithm In Hadoop Environment

Posted on:2019-05-12

Degree:Master

Type:Thesis

Country:China

Candidate:P Lin

Full Text:PDF

GTID:2428330572495084

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As a very important branch of data mining technology,clustering analysis has always been favored by both domestic and overseas researchers.It can divide the original data objects given by users into several clusters according to the relationships between each data object.The goal of the algorithm is to make sure that data objects in the same cluster are as similar as possible,and the similarity between data objects in different clusters is as lower as possible.With the rapid development of modern technology,the amount of data generated by the Internet is becoming larger and larger.The traditional serial clustering algorithms has encountered several challenges,for example,the data objects are difficult to be loaded into memory with poor execution efficiency of algorithms at one time.How to clustering analysis the massive data stably and efficiently has become a brand-new research topic.The emergence and rise of Hadoop distributed computing platform can provide an effective way to solve the performance problems of traditional clustering algorithms.In clustering analysis,the reasonable setting of the partition parameter and the density threshold can make the grid-based CLIQUE algorithm get high-quality clustering results.It not only affects the execution efficiency of the algorithm,but also the final clustering result when the settings of initialization parameters,especially the partition parameter,are unreasonable.This paper deeply studied the dividing strategy of CLIQUE algorithm and proposed the boundary-correcting method and the grid-sliding method to improve the quality of the grid meshing.Considering that the time complexity of the traditional CLIQUE algorithm rises sharply with the increase of the dimension of the data set,and the serial algorithm can not meet the requirements of processing massive data.This paper combined the improved CLIQUE algorithm with the MapReduce framework,implemented a distributed CLIQUE algorithm based on Hadoop platform.The algorithm realized the distributed processing of massive data with the help of two-phase:the grid meshing phase and the clustering phase.In the end,a series of experiments are carried out,and the clustering accuracy,processing time,speedup and scalability of the proposed algorithm are tested.The experimental results show that the proposed algorithm can effectively improve the clustering quality,and especially improve the efficiency of the CLIQUE algorithm on processing massive data.

Keywords/Search Tags:

Clustering Analysis, CLIQUE Clustering Algorithm, Hadoop Platform, MapReduce Framework

PDF Full Text Request

Related items

1	Distributed EM Clustering Algorithm Based On Hadoop Platform
2	Research And Implementation Of Distributed Clustering Algorithm Based On Hadoop Platform
3	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
4	Parallel Clustering Algorithm Based On MapReduce
5	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
6	Research On Parallel Clustering Algorithm On Hadoop Platform
7	Research And Optimization On K-medoids Clustering Algorithm Based On Hadoop Platform
8	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm
9	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
10	Study On Iterative Mapreduce Computation Model For Clustering Analysis