Font Size: a A A

Distributed Clustering And Evolutionary Clustering Algorithm Based On Semi-supervised Learning

Posted on:2013-01-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y C YuFull Text:PDF
GTID:1268330422452674Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Traditional clustering analysis belongs to a method of unsupervised learning. During the process ofclustering, only the unlabeled data is used and variously inherent prior information from the data isignored. However, when semi-supervised clustering is used to clustering the unlabeled data, the priorinformation will be introduced to the process of clustering, which will improve the clusteringperformance. Now, most of the semi-supervised clustering methods often assume that all data iscentralized on one site, simultaneously, the number and the distribution of samples cannot be changedas time goes on. However, in many applications of data mining, there are two special data sets, suchas distributed data sets and evolutionary data sets. In distributed scenario, data are naturallydistributed among several sites. Due to the ristrictions of data security, data privacy andcommunication etc., it may be impractical to centralize all data to the centre site. The goal ofclustering evolutionary data is to produce a sequence of clustering. And each clustering result in thesequence needs to satisfy the two criteria: it should have a high-quality clustering of the current dataset and should be similar to the previous clustering results in the sequence.The dissertation consists of two parts. Firstly, when data set is horizontally stored in several sites,the effectively distributed semi-supervised clustering methods, which are based on paralle k-means ordistributed EM algorithm, are developed. Secondly, the methods based on constrained distributedk-means and constrained distributed EM algorithm are extended and applied to clusteringevolutionary data set, then the evolutionary clustering methods using constrained clustering are putinto practice. The main contributions of this dissertation are summarized as follows:(1) Three frameworks of distributed semi-supervised clustering based on parallel k-means aredeveloped respectively.(a) The framework of heuristic distributed k-means uses the local clusteringresults from user sites as prior information. The major advantage is in that the limitation, of whichparallel k-means is seriously sensitive to the choice of initial centers, is improved.(b) ConstrainedParallel k-means, namely CPKM, is based on parallel k-means. By means of pairwise constraintsprovided by user sites, the drawback, which parallel k-means tend to converge to local minimization,can be improved. And then, using the mean of pairwise data points as the representive point and itscluster assignment as the estimation of pairwise points, this can prevent the cluster assignments ofpairwise data points from the influence under the order of constrained data points.(c) Distributedk-means Based on Soft Constraints (DBSC) also uses pairwise constraints as prior information, butthe constraint relation between pairwise data points is soft. This means that error constraints canhappen. By defining the weight of constraints violation and adopting different strategy to deal with constraints violation, CPKM can effectively employ them to improve accuracy of the distributedclustering.(2) By introducing the soft constraints to GMM, Constraints Consistency Gaussian Mixture Model(CCGMM) is developed. And then, as the extension of CCGMM, Distributed Constraints ConsistencyGaussian Mixture Model (DCCGM) is developed and used to distributed clustering.(a) CCGMMincludes two major characteristics. Firstly, the estimation parameters reflect both the inherentdistribution of data points and the prior information coming from users. Secondly, there are closedexpressions to the iteration formulae of parameters.(b) DCCGMM adopts distributed EM as theiteration framework to estimate the parameters of global model. During the process of globalparameters estimation, only the parameters of all local models from the corresponding sites are used,and the data points and constraints can only be used by the corresponding site where they locate.(3) By analyzing the correlation between the logarithm likelihood function and the objectivefunction of k-means, a framework of distributed semi-supervised clustering is developed. Theframework is based on the MPCK-MEANS algorithm and can perform constrained clustering andlocal distance learning simultaneously.(4) Two frameworks of evolutionary clustering are developed. Different from the other methods ofevolutionary clustering, the two methods view the clustering results in previous timestamp as the priorinformation of the current data points and adopt the manner of constrained clustering to realizeevolutionary clustering. To guarantee the temporal smooth of the clustering sequence, the functions ofhistory cost are defined as the form of constraints violation.(a) Constrained Evolutionary k-means(CKEM) employs the centers from the current cluster assignments to fit the historical data points, anddefines total history cost by measuring the total distance between the historical data points and thecurrent centers.(b) In the framework of Evolutionary GMM Using Historical Cluster Labels (EGMM),pairwise historical data points, which had the same cluster labels in the previous timestamp, areviewed as pairwise constraints. Then, according to the current parameter model, the total difference ofposterior probability among all pairwise historical data points is calculated to adjust the parameterestimation of the current model.
Keywords/Search Tags:Distirbuted Clustering, Semi-Supervised Clustering, Prior Information, Parallel k-means, Gausian Mixture Model, Distributed EM, Evolutionary Clustering, Temporal Smooth
PDF Full Text Request
Related items