Font Size: a A A

A Density-based Distributed Clustering Method

Posted on:2019-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2428330548461169Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the increasing of the amount of information in the network,people are getting more and more demand for specific fields.Clustering,as an important method of data analysis,aims to divide the data without annotation into several clusters according to the degree of similarity between objects.Unlike classification,clustering is an unsupervised learning and does not require any marked training data.Researchers have proposed many clustering algorithms,such as clustering based on distance(K-Means),Density based clustering(DENCLUE)and grid based clustering(CLIQUE),and so on.In real life,clustering in many fields also has a wide range of applications,such as natural language processing,multiple document Automatic digest,search engine,etc.Clustering is an important method for data analysis in the field of data mining.The unlabeled data are divided into several groups according to the data similarity.CSDP is a density-based clustering method.When the data size is large or the data dimenstionality is high,the efficiency of clustering is relatively low.In order to improve the efficiency of clustering algorithm,this paper proposes a density-based distributed clustering method,called MRCSDP,which uses MapReduce to cluster text data.In order to describe the proposed clustering algorithm clearly,first of all,the paper puts forward the concept and the significance of clustering,and then gives the algorithm of the CSDP details,and analyzes the advantages and disadvantages of the CSDP clustering method.After that,the structure of the distributed computing framework of MapReduce is given.The distributed computing framework consists of two phases,one is the Map stage and the other is the Reduce stage.At the same time,this paper gives some contents about the distributed computing ecosystem Hadoop,which mainly involves the two components that are HDFS and YARN.In the section of the description algorithm,this paper defines the concept of independent calculation unit and independent calculation block,and then gives the details of MRCSDP.In order to build the independent computing unit and the independent calculation block,the task of independent computing block in the cluster is distributed evenly.First,the data needs to be split into several equal pieces of data.And then distributed to calculate the local density of data block,incorporating local density get global density,according to the density of the global computing center value,value calculated by the density and global center for each data block candidate cluster center.Finally,the global cluster center is selected from the candidate cluster center.MRCSDP has better clustering effect on the basis of reducing the time complexity.In order to verify the correctness of the algorithm and the advantages and disadvantages of the algorithm compared with other distributed algorithms,this paper carried out five experiments.The first set of experiments mainly compare the influence of different parameters on the accuracy of clustering,the second group,the third group experiment MRCSDP contrast in the original algorithm,mainly compare the accuracy and efficiency,the fourth group and five experiment MRCSDP in the comparison of the current existing distributed clustering algorithm.Experimental results show that the clustering method of distributed environment MRCSDP relative to the CSDP can more quickly and efficiently deal with massive data,and make each node load balancing,and in some areas relative to other distributed clustering algorithm has certain advantages.
Keywords/Search Tags:Clustering, Distributed computing, MapReduce, Independent calculation unit, Independent calculation block
PDF Full Text Request
Related items