Font Size: a A A

The Research On The Improvement And Parallelization Of CLIQUE Algorithm In Hadoop Environment

Posted on:2019-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:P LinFull Text:PDF
GTID:2428330572495084Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a very important branch of data mining technology,clustering analysis has always been favored by both domestic and overseas researchers.It can divide the original data objects given by users into several clusters according to the relationships between each data object.The goal of the algorithm is to make sure that data objects in the same cluster are as similar as possible,and the similarity between data objects in different clusters is as lower as possible.With the rapid development of modern technology,the amount of data generated by the Internet is becoming larger and larger.The traditional serial clustering algorithms has encountered several challenges,for example,the data objects are difficult to be loaded into memory with poor execution efficiency of algorithms at one time.How to clustering analysis the massive data stably and efficiently has become a brand-new research topic.The emergence and rise of Hadoop distributed computing platform can provide an effective way to solve the performance problems of traditional clustering algorithms.In clustering analysis,the reasonable setting of the partition parameter and the density threshold can make the grid-based CLIQUE algorithm get high-quality clustering results.It not only affects the execution efficiency of the algorithm,but also the final clustering result when the settings of initialization parameters,especially the partition parameter,are unreasonable.This paper deeply studied the dividing strategy of CLIQUE algorithm and proposed the boundary-correcting method and the grid-sliding method to improve the quality of the grid meshing.Considering that the time complexity of the traditional CLIQUE algorithm rises sharply with the increase of the dimension of the data set,and the serial algorithm can not meet the requirements of processing massive data.This paper combined the improved CLIQUE algorithm with the MapReduce framework,implemented a distributed CLIQUE algorithm based on Hadoop platform.The algorithm realized the distributed processing of massive data with the help of two-phase:the grid meshing phase and the clustering phase.In the end,a series of experiments are carried out,and the clustering accuracy,processing time,speedup and scalability of the proposed algorithm are tested.The experimental results show that the proposed algorithm can effectively improve the clustering quality,and especially improve the efficiency of the CLIQUE algorithm on processing massive data.
Keywords/Search Tags:Clustering Analysis, CLIQUE Clustering Algorithm, Hadoop Platform, MapReduce Framework
PDF Full Text Request
Related items