Font Size: a A A

Core-points Based Big Data Clustering Algorithm

Posted on:2018-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2348330536487818Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of storage technology and network technology,the data amount explosively increases,and the data structure is also increasingly complex.How to mine valuable information from massive data becomes the hotspot of current research.Clustering is an important data processing technology in the field of data mining,and has been widely used in machine learning,pattern recognition and so on.Due to the non-uniqueness of initial conditions and applicat ion criteria of clustering,a variety of clustering algor ithms emerge.However,some classical clustering algor ithms are usually inapplicable to the massive data.Spectral clustering and Affinity Propagat ion(AP)can handle arbitrary data sets with high clustering quality.However,these two clustering algorithms cannot clustering big data due to high computational complexity.In recent years,researchers have proposed a number of big data clustering ideas,among which the sampling-based big data clustering algor ithm is widely used.However,the existing sampling methods are unable to balance the quality of the sample set and the computational complexity.To tackle the disadvantages of the existing sampling methods,the paper proposes a similar ity-based sampling method,with which big data are grouped.Firstly,a small sample subset is randomly selected from a big data set;secondly,the data similar ity between the big data set and the sample set is calculated,and the core-point is selected according to the highest similar ity;finally,each core-point represents a group,and the remaining points are assigned to the group in which the core-point having the highest similar ity is located to group the big data.This sampling m ethod brings great increase of the quality of the sample set at the cost of a small loss of precision.Theoretical analys is and experimental results show that the proposed method not only has low computational complexity and is easy for operation,but also the core-set can better reflect the whole information of big data,and is robust to noise.This fully proves that the sampling method has better applicability and effectiveness.With regard to the failure in applying classical clustering algorithms to the big data,this paper proposes a similar ity-based big data clustering framework with reference to the foregoing sampling ideas,and incorporating the classical spectral clustering and the AP into the framework.The excellent performance of these two algorithms is successfully extended to the big data.Firstly,the above sampling method is used to obtain the core-set and group the big data;secondly,the spectral clustering and the AP are applied to the core-set to obtain the core-set clustering result;finally,the big data clustering result is obtained according to the correspondence between the core-set and the big data.Theoretical analys is and experimental results show that the CBSC and CBAP can not only deal with the big data,but also inherit the advantages of the original clustering algor ithm.That is,the CBSC and CBAP can handle arbitrary data sets and is insensitive to noise data,but only have nearly linear tim e complexity,and are very helpful to deal with big data.The efficiency of the clas sical algor ithm is greatly improved at the cost of a small loss of precision,and the spectral clustering and the AP are extended to big data.The experimental results of the artificial data sets and the real data sets fully illustrate the high efficiency of the CBSC and CBAP.
Keywords/Search Tags:big data, spectral clustering, Affinity Propagation, sampling, similarity
PDF Full Text Request
Related items