Font Size: a A A

Study On Clustering Algorithms For Big Data

Posted on:2020-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:X J LiuFull Text:PDF
GTID:1488306494469324Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,pattern recognition has attracted more and more attention,and it has been widely applied in many fields.As an important part of pattern recognition,clustering partitions a data set into clusters according to a given similarity measure,such that intra-cluster similarity is maximized and inter-cluster similarity is minimized.Although many achievements has been made in the developments of clustering algorithms,a difficulty of fast clustering also exits duing to the limitation of computing resources such as time,space and CPU in today's big data era.How to design a new clustering algorithm to cluster big data efficiently has been the focus and has become one of the research hotspots in the field of pattern recognition.This work proposes a series of clustering algorithms based on sampling technology,feature reduction and classical clustering methods to explored the efficient clustering problems in big data.Meanwhile,it also attempts to embed quantum computing in the idea of algorithm design,and to integrate the methods and techniques of quantum computing into the solution of efficient clustering to develop quantum clustering schemes.The contributions and research contents in this paper include the following three aspects:(1)Two fast partitioning clustering algorithms based on feature reduction and quantum computing are proposed to improve the efficiency of partition in big data.The first is a k-means++clustering algorithm based on feature reduction and weighting(FRW-KC),which uses an unsupervised feature reduction method based on approximate Markov blanket to reduce the scale of data sets,then assesses the weight of each feature of the preserved subset based on information entropy,finally,utilize the k-means++ method to cluster the preprocessed data.The second is a quantum k-means algorithm which aims to improve the efficiency by embedding quantum computation.In the quantum k-means algorithm,the clustering data and the centers of k clusters is prepared to be quantum state which is used to compute the similarity.Then the quantum phase estimation algorithm converts the similarity into qubit whose minimum then will be searched by using a quantum algorithm for finding the minimum.Theoretical analysis shows that the time complexity of quantum k-means algorithm is reduced compared to the classical especially when the feature dimension of the data and the category number are very big,and the space complexity is reduced exponentially.(2)Density and delta distance clustering(DDC)is an ideal clustering algorithm,but it has a high complexity and requires manual identification of clustering centers.Two efficient DDC algorithms are proposed.The former is an efficient and adaptive DDC(EADDC)algorithm,which embeds a sampling method based on Locality Sensitive Hashing(LSH)to reduce time complexity and utilizes an outlier detection based on Density-Based Spatial Clustering of Applications With Noise(DBSCAN)to adaptively recognize cluster centers.Experiment results show the efficiency and adaptability of EADDC algorithm.The latter is a DDC algorithm accelerated by quantum computation,which introduces related quantum algorithms to take advantage of their parallel computing ability.The quantum counting algorithm is developed to accelerate the processing of density for each data point and the quantum algorithm for the finding the minimum is incorporated to find delta distance of each point.The theoretical analysis shows that the efficiency of DDC can be improved significantly by using quantum computation for acceleration.(3)A spectral clustering(SC)algorithm has a good clustering effect on an arbitrary structure dataset for it can converge whole optimization.However,the required time complexity in SC is extremely high.An spectral clustering algorithm based on optimizable k-dissimilarity selection(SOKS)is proposed in this study.SOKS starts from the selection of representative sampling points from a dataset by using an optimizable k-dissimilarity selection algorithm,then uses the Nystr?m method to approximate the eigenvectors of the similarity matrix to cluster the entire dataset.Theoretical analysis and experiments show that the proposed algorithm can effectively reduce computational complexity compared with the original SC algorithm.
Keywords/Search Tags:big data, clustering, sampling, feature reduction, quantum computing, quantum algorithm
PDF Full Text Request
Related items