Semi-Supervised Clustering Algorithm Based On Pairwise Constraints And Its Parallel Implementation

Posted on:2014-02-14

Degree:Master

Type:Thesis

Country:China

Candidate:C Lin

Full Text:PDF

GTID:2248330398475070

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As an important method in the field of data mining, cluster analysis is able to find the natural distribution structure of the data objects. It is a process that divides objects into the similar class according to their attribute. The goal of the cluster is that the similarity of objects from the same group is larger than the similarity of objects from the different group. From the perspective of machine learning, clustering analysis is an unsupervised learning method, and we don’t need any background knowledge when analyze on data objects. However, we can always get some information of the data objects to be analyzed, and we find that a small amount of known information can help find the data object identifier or constraint information between two instances. By adding prior knowledge to the traditional unsupervised clustering algorithm and guide the whole clustering process, then we get semi-supervised clustering algorithm with a high accuracy of clustering result.In this thesis, we select pairwise constraints to help guide the clustering process. Generally, pairwise constraints contain two parts:Must-link and Cannot-link, they describe the relationship between two samples of data. Wherein, Must-link represents two samples must be assigned to the same cluster, while Cannot-link represents two samples of data must be assigned to the different cluster. This thesis also introduces the semi-supervised clustering algorithm Cop-Kmeans in details, which is based on pairwise constraints. We put forward a new and improved method to solve the constraint violation exists in the Cop-Kmeans, the efficiency of the algorithm is also better than the traditional improvement program. In addition, we find the pairwise constraints may have an adverse effect on clustering performance, so we further propose a corresponding improved program. It is possible to weaken such adverse effects, and improve the accuracy of the clustering result to a certain extent.Since the traditional serial clustering algorithm can not meet the requirements either in memory or computing speed when clustering object is a type of large data sets or high-dimensional data, and inspired the idea of "cloud computing", the thesis deals with large-scale data sets in a parallel way. We use Hadoop to set up a parallel processing platform, and parallelize the proposed algorithm according to the MapReduce computing model, so that it can efficiently handle large data sets. Experiments show that parallel computing model can significantly improve the efficiency of clustering.

Keywords/Search Tags:

Semi-supervised clustering, Pairwise constraints, Parallel computing, MapReduce

PDF Full Text Request

Related items

1	Research On Active Learning Algorithms Of Pairwise Constraints In Semi-supervised Clustering
2	Semi-supervised Clustering Algorithm And Implementation Based On Seeds Set And Pairwise Constraints
3	Study Of Semi-supervised Fuzzy Clustering Algorithm Based On Pairwise Constraints
4	Research On Semi-supervised Clustering Algorithms With Pairwise Constraints
5	Studies On Semi-Supervised Clustering Algorithms Based On Pairwise Constraints
6	Research On Semi-supervised Clustering Ensemble Approach And Its Application
7	Research On Parallel Implementation Of Semi-Supervised Clustering
8	Research On Two Clustering Algorithms Based On Semi-Supervised Learning
9	Research On Semi-supervised Selective Clustering Ensemble
10	Research On Semi-supervised Learning And Its Application