Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce

Posted on:2022-03-06

Degree:Master

Type:Thesis

Country:China

Candidate:T Tao

Full Text:PDF

GTID:2518306524498934

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Clustering algorithm is an unsupervised learning algorithm in data mining,can according to the characteristics of data objects,related to similar objects classified as a class,a greater difference between the data objects are divided into different classes,so the clustering algorithm can be found from the sample data in potential distribution patterns,are widely used in image segmentation,data mining,information retrieval systems,anomaly detection,medicine,computer vision,and construction management,etc.Among the clustering algorithms,the partitioning-based clustering algorithms,such as k-means and k-medoids algorithms,are easy to understand and fast in algorithm convergence,which has attracted widespread attention.With the continuous development of Internet information technology and the arrival of the era of big data,big data has the "4V" characteristics of large Volume,Variety,Velocity and high Value compared with traditional data.However,the traditional partitioning and clustering algorithm requires a high time complexity,which is only applicable to a small-scale data set.However,when dealing with large data,it will undoubtedly generate huge computational complexity.Therefore,how to reduce the computational complexity of partitioning and clustering algorithm,so that it can process big data,is a key problem.Although the existing parallel partitioning clustering algorithm has achieved some results,there are still the following problems:(1)how to further reduce the influence of initial center sensitivity caused by random selection of initial clustering centers,so as to improve the stability of clustering results;(2)how to reduce the communication overhead between nodes;(3)how to further deal with the problem of poor parameter optimization ability in partitioned clustering algorithm;(4)how to improve the efficiency of clustering so as to improve the overall performance of the parallel clustering algorithm.In view of the above problems,based on the research and analysis of the parallel partitioning clustering algorithms,and on the basis of mining efficiency and other relevant knowledge,this paper proposes two kinds of parallel partitioning clustering algorithms:(1)the partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce;(2)the partitioning-based clustering algorithm by using improve artificial bee colony based on MapReduce.The main research work of these two kinds of parallel partitioning clustering algorithm is as follows:(1)The partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduceAiming at the problems of sensitivity of initial center,high communication overhead of nodes and low efficiency of cluster in big data clustering algorithm based on partitioning,this paper proposes a partitioning-based clustering algorithm by using grid density and locality sensitive hash function based on MapReduce,named PBGDLSH-MR.Firstly,based on the initial data set,the GDS(Grid density strategy)is proposed to get the initial clustering center,which avoids the sensitivity of initial center caused by random selection of initial cluster center.Secondly,the DP-LSH(Data partitioning based on locality sensitive hash functions)is proposed to map more closely related data objects into the same subdataset and get data partitions on the Map.Meanwhile,a formula SI(similarity improvement)is designed to evaluate the data partitioning results.Thus,the communication overhead between nodes is reduced.In addition,an AGS(Adaptive grouping strategy)is designed to handle data skew in data partitions,which improve the cluster efficiency.Finally,based on MapReduce,the cluster centers are mined in parallel to generate the final clustering results.(2)The partitioning-based clustering algorithm by using improve artificial bee colony based on MapReduceAiming at the problems of poor parameter optimization ability,sensitivity of initial center and data skew in big data clustering algorithm based on partitioning,this paper proposes a partitioning-based clustering algorithm by using improve artificial bee colony based on MapReduce,named MR-PBIABC.Firstly,the BLCCF(backward learning and the clustering criterion function)is proposed to improve the solution quality when use artificial bee colony algorithm to search.Meanwhile,according to AFS algorithm,it makes use of the characteristics of strong optimal solution capability,ABC algorithm and AFS algorithm is combined to improve the optimization ability of ABC algorithm.Then,the IABC algorithm is proposed to get the initial clustering center,which avoids the sensitivity of initial center caused by random selection of initial cluster center.Secondly,a DBS(data balancing strategy)is designed to handle data skew in data partitions,which improve the cluster efficiency.Finally,based on MapReduce,the cluster centers are mined in parallel to generate the final clustering results.

Keywords/Search Tags:

big data, parallelize clustering, data partitioning, data skew, MapReduce

PDF Full Text Request

Related items

1	The Research Of Handling Data Skew In MapReduce Computing Model
2	The Research On Clustering Technology For Big Data
3	Load Balancing Algorithm Based On Data Skew Of MapReduce
4	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark
5	Research Of Join Algorithm With Skew Data On Mapreduce
6	Algorithm To Deal With The Problem Of Data Skew In MapReduce Model
7	Research And Strategy On Data Skew Problem Based On MapReduce
8	Research On Partition Loading Balance Based On Spark Data Skew
9	Research On Dynamic Data Partitioning Algorithm For Large-scale Streaming Data Online Processing
10	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew