| Clustering is an important unsupervised learning method. As a tool for data analysis, itsimportance has been widely realized in several fields including pattern recognition, imageprocessing, etc. The purpose of clustering analysis is to find the structure hidden in data, and try togroup the data by same properties according to some similar measurement as much as possible. Inrecent years, various data and information has grown exponentially. Confronted with large scaledata, the processing capability of traditional clustering algorithm is no longer "efficient" due totime constraints, internal storage, CPU and other resources, etc., exposing the defects of lowprocessing amount in unit time, longer time for large scale data processing as well as difficulties inachieving desired effect and so on. Effective clustering for large scale data obtains the attention ofmost researchers, which has also become the focus of the field of international data mining.For problems existing in current clustering algorithms that are not able to effectively processlarge scale data, this paper carries out researches through two major parts: Firstly, for large andcomplex data and aiming at separating clustering algorithms and spectral clustering algorithm, itdesigns two parallel clustering algorithms, namely sampling partitioning clustering algorithmsbased on MapReduce and efficient parallel spectral clustering algorithm under cloud environment,which are combined with parallel computing and cloud computing technologies,; In the secondpart, we focus on large scale data in specific areas such as complex network, biological gene,image segmentation, etc. and design a clustering ensemble algorithm for complex communitydetection F-DC, a weighted semi-supervised clustering algorithm for image segmentation WSSCand a biclustering algorithm of differential co-expression for gene data DCECluster. The maincontributions of this thesis are as follows:(1) It introduces a common framework of partition-based clustering for large scale datasetusing sampling, as well as the implementation on MapReduce framework. In more detail, theimproved sampling techniques are used to handle large scale dataset for the first, and then acommon framework is designed for sample partition clustering algorithm, whose effectiveness isverified by implementing the k-means and k-medoids algorithm. On this basis, the MapReduceprogramming model is applied to implement the proposed framework. Experiments prove theeffectiveness of the proposed approach for large scale dataset.(2) It proposes an efficient parallel spectral clustering algorithm. The strategy for suchalgorithm comes as follows: Firstly, the distance matrix and the similarity matrix are improved,and kd tree techniques are introduced for sparse processing of the similarity matrix. Whencalculating feature vectors, the Laplace matrix is stored in the distributed file system HDFS, anddistributed Lanczos is used to form the vectors through parallel computing. In the end, for thetransposed matrix of feature vectors, an improved paralleled k-means clustering is applied toachieve the results. Using different parallel strategy in each step, the algorithm performancereceives a linear increase in the speed. Experiments show that, with the expansion of dataprocessing scale, the clustering speed is able to reach linear increase, and the parallel spectralclustering algorithm suitable for massive data mining.(3) It designs an efficient community detection algorithm F-DC. The details are as below: Atfirst, the model of time evolving network is proposed to give a unified description for each cluster.Then for the features that true network evolves over time, a method based on the snapshot of theclustering segmentation is implemented to generate a clustering member. Finally, taking thedifferences of the distribution of the cluster center and the actual distribution among the clustering members, the method that ensemble the clustering results is used to based on the maximumlikelihood. Through the assessments of massive experiments, the thesis verifies the validity ofclustering ensemble algorithm for the time evolving network community.(4) It illustrates a weighted semi-supervised clustering algorithm for image segmentationWSSC. The algorithm first introduces the concept of the weight in traditional semi-supervisedclustering algorithm and lists out the calculation formula. On this basis, it obtains class labelsthrough optimizing the probability matrix. Each image can be expressed as a d-dimensionalrandom vectors and each pixel can be obtained through mixed density independently. Both of theusage of WSSC algorithms and the image segmentation results can be obtained via the class labelof the mixture component. A series of experiments results on two groups of image data prove thatthe higher the efficiency in the proposed algorithm WSSC, the more obvious advantages for largescale color image.(5) It designs a new multi-valued sample discretization method based on rough set, and thenimplements an algorithm DCECluster for mining the maximal biclusters based on sample weightedgraph of differential co-expression for gene data and search pruning strategy. Firstly, the discretedataset is constructed as a weighted graph based on differential sample relations to effectivelyeliminate unrelated gene. Then it redefines the concept of support by the differential co-expressionrelation for gene, and finally the pruning result for candidate biclusters is carried out through theusage of effective search strategy and pruning strategy. According to the verification on four kindsof differential co-expression biclustering algorithm in terms of effectiveness and efficiency, thethesis shows the advantages of the algorithm with a fast processing speed, large number of effectivebiclustering, low memory cost and so on.As the intangible production material for information society, large scale data shows anexplosive growth in the construction of smart city. It’s like blood running through all the aspects ofthe smart city construction including intelligent traffic, intelligent medical and life. The complexanalysis and mining of large scale data will draw a series of disciplines for decision-making andservice. The research results obtained in this thesis can provide a good support for the smart cityconstruction. |