Font Size: a A A

The Research On Clustering Technology For Big Data

Posted on:2020-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GuoFull Text:PDF
GTID:2428330602961126Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Clustering technique is a grouping technique.that groups the collection of physical objects or abstract objects into multiple collections composed by the same class of objects.The technology is widely used in various fields,and is one of the important research contents in data mining,pattern recognition and other research fields.It plays an extremely important role in identifying the intermal structure of data.With the development of the information industry,the attribute types of data are becoming more and more complex.However,traditional clustering algorithms such as k-means can only process single attribute data,while k-prototypes clustering algorithms can process mixed attribute data,which greatly expands the application field of clustering algorithms and improves the efficiency of clustering analysis.With the advent of the era of big data,traditional clustering methods can no longer process large-scale data.Therefore,the combination of clustering technology and cluster environment has become a new trend in processing massive data,and a large amount of valuable information can be analyzed.The main contents of this paper are summarized as follows:(1)This paper presents an effective prototypes clustering algorithm for GK-prototypes.This algorithm is based on the classical K-prototypes clustering algorithms,in which the defuzzy-similarity matrix is used to construct the coarse particle sets,the granular computing and the maximum and minimum distance method are used to determine the initial clustering centers,and the objective function is modified.The experimental results and theoretical analysis show that the clustering algorithms for GK-prototypes were more accurate,more effective and more robust than those for other prototypCs algorithms.(2)In this paper,a MK-prototypes clustering algorithm for big data was proposed.One of the characteristics of big data is that the attributes of data are mixed types,including numerical attributes and classification attributes.On this basis,this paper presents a method for processing large-scale mixed data by using MapReduce models to parallelize k-prototypes clustering algorithms.Experimental results and theoretical analysis show that under the premise of maintaining the clustering accuracy,with the continuous increase of data set size,the parallel clustering algorithm has good scalability and achieves the speedup effect close to linearity.
Keywords/Search Tags:Clustering Algorithm, Mixed Attribute Clustering, Granular Computing, MapReduce, Parallelize Clustering
PDF Full Text Request
Related items