Font Size: a A A

Research On High Scalable Clustering Analysis Method

Posted on:2014-07-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:1268330425467050Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The clustering problem has been a hot topic in the field of pattern recognition, whichhas wide range of applications, including statistics, image processing, medical diagnostics,information retrieval, biology, machine learning and so on. In recent years, many clusteringmethods have emerged. Most of these methods are limitated by the scalability of itselfalgorithm, and show excellent performance on the specific data-scale data sets, but oftenhave little effect on the outside of its scale data sets, and even can not run. With the rapiddevelopment of information collection and storage technologies, the diversity of the data ismore prominent, so the exploration for highly scalable clustering method becomes more andmore popular.The article study and discuss on the clustering algorithm scalability, and the problemthat the clustering algorithms is difficultly applied to the processing of large data sets due tothe high computational complexity and huge memory requirements of the clusteringalgorithm. In this process, the main innovation is reflected in the following aspects:(1) Many classic citrus algorithm get very good results in the small data size of dataclustering task, but the algorithm is not strong because of its scalability, which make themost algorithms is difficult to competent or can not be completed in large-scale dataclustering task. To explore the clustering method for high scalability problem and make themethod adapt to the wide range of data set, this thesis based on the thought of piecesprocessing further researches the data set processing way of first segmentation then partition.And a clustering method based on this approach, clustering method based on the datasegmentation and partition, is proposed in this thesis. The proposed method does not need toread all the data into main memory at the same time and result in the greatly reduceddemand for hardware resource. It is uneasy for the method to fall into local optimumscompared with the traditional method of generating centers iteratively.(2) DP is a strong elasticity clustering method, and have shown excellent clusteringperformance in the clustering task of the large data collection and the small data collection.But the DP method still has the limitation on its application, because when the data scale istoo large and the local characteristics of a sample set is too large to exceed the requirements of the main memory. For this situation, DP which is a kind of clustering method of strongscalability can perform well in clustering small and large data sets. However, DP can notwork well in very large data set because the local characteristics sample set of the data is toolarge resulting in the out of main memory requirements. To solve the shortage, the thesisdesigns the thought of compression step by step after deep analyzing the theory of DP andproposes an improved DP, clustering method based on the Means Radial Compression,MRC. The experimental results show that means radial compression algorithm can makebetter solutions compared with DP with time complexity of O(n).(3) The method of the visual analysis based on the minimum distance spectrum datafeature clustering characteristics is proposed. Usually, after the data representation the dataused to participate in the clustering analysis will generate the data characteristics. Thecharacteristics of the data should have an inherent connection, so make the data present thepacket characteristics, and the cluster analysis is to identify this data packet in accordancewith some similarity measure. Therefore, the process of the data represents and the choiceof the data characteristics will directly affect the final clustering result. MinDS first definethe minimum distance spectrum model, and the cube data can be mapped to thetwo-dimensional data space by the minimum distance spectrum characteristics analysis. Sofor the intuitive evaluation of the features of data characteristics clustering, the failurereasons of clustering method get very good results. At the same time, Minds method canalso be used to deal with the noise, to identify outlier and to seek data categories.
Keywords/Search Tags:scalable clustering, data division and partition, means radial compression, the minimum distance spectrum
PDF Full Text Request
Related items