Font Size: a A A

Research On Active Learning Algorithms Of Pairwise Constraints In Semi-supervised Clustering

Posted on:2018-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:T H YuFull Text:PDF
GTID:2348330542461672Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The performance of semi-supervised clustering is greatly dependent on the choice of side information.If the side information is selected improperly,they may even degrade the clustering performance.There are two types of supervised information,one is class labels,the other is pairwise constraints.Compared to class labels,pairwise constraints are easier to obtain.Besides,the supervised information of class labels can be easily converted into pairwise constraints.This paper focuses on the active learning methods for constraint-based semi-supervised clustering algorithms,and explores more effective active learning strategies.Comparing to the active learning studied in other domains,the research on active learning of instance-level constraints for semi-supervised clustering is relatively limited.And,all of them have some deficiencies.This paper proposes an improved active learning method.Compared with the Min-Max algorithm,this method has two main improvements.First,we add the Select phase on the basis of Min-Max algorithm.In this phase,we measure the uncertainly of the sample data through the number of its neighbors which is not assigned to the same cluster with it,and select informative data set.Then,the Explore and Consolidate phases work on the informative data set rather than the whole data set.Second,our method will select the most informative data point as the first point in the Explore phase rather than select it at random.Experiments on the UCI datasets show that the proposed algorithm has better performance.In addition,considering the traditional serial clustering algorithm is difficult to meet the needs of the current big data processing,this paper introduces the idea of"cloud computing" and makes two aspects of parallel improvement work.First,we parallelize the proposed active learning algorithm according to the MapReduce calculation model.Second,we parallelize the well-known MPCK-means semi-clustering algorithm according to the MapReduce calculation model.Then,the.parallelized active learning algorithm combines with the parallelized MPCK-means semi-clustering algorithm,and constructs a practical parallel semi-supervised clustering algorithm.Through the large data set processing experiment on the Hadoop cluster,the good scalability of the algorithm are proved.
Keywords/Search Tags:active learning, semi-supervised clustering, pairwise constraints, parallel computing, MapReduce
PDF Full Text Request
Related items