Font Size: a A A

Semi-Supervised Clustering Algorithm Based On Pairwise Constraints And Its Parallel Implementation

Posted on:2014-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:C LinFull Text:PDF
GTID:2248330398475070Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important method in the field of data mining, cluster analysis is able to find the natural distribution structure of the data objects. It is a process that divides objects into the similar class according to their attribute. The goal of the cluster is that the similarity of objects from the same group is larger than the similarity of objects from the different group. From the perspective of machine learning, clustering analysis is an unsupervised learning method, and we don’t need any background knowledge when analyze on data objects. However, we can always get some information of the data objects to be analyzed, and we find that a small amount of known information can help find the data object identifier or constraint information between two instances. By adding prior knowledge to the traditional unsupervised clustering algorithm and guide the whole clustering process, then we get semi-supervised clustering algorithm with a high accuracy of clustering result.In this thesis, we select pairwise constraints to help guide the clustering process. Generally, pairwise constraints contain two parts:Must-link and Cannot-link, they describe the relationship between two samples of data. Wherein, Must-link represents two samples must be assigned to the same cluster, while Cannot-link represents two samples of data must be assigned to the different cluster. This thesis also introduces the semi-supervised clustering algorithm Cop-Kmeans in details, which is based on pairwise constraints. We put forward a new and improved method to solve the constraint violation exists in the Cop-Kmeans, the efficiency of the algorithm is also better than the traditional improvement program. In addition, we find the pairwise constraints may have an adverse effect on clustering performance, so we further propose a corresponding improved program. It is possible to weaken such adverse effects, and improve the accuracy of the clustering result to a certain extent.Since the traditional serial clustering algorithm can not meet the requirements either in memory or computing speed when clustering object is a type of large data sets or high-dimensional data, and inspired the idea of "cloud computing", the thesis deals with large-scale data sets in a parallel way. We use Hadoop to set up a parallel processing platform, and parallelize the proposed algorithm according to the MapReduce computing model, so that it can efficiently handle large data sets. Experiments show that parallel computing model can significantly improve the efficiency of clustering.
Keywords/Search Tags:Semi-supervised clustering, Pairwise constraints, Parallel computing, MapReduce
PDF Full Text Request
Related items