Font Size: a A A

Research On Parallel Implementation Of Semi-Supervised Clustering

Posted on:2018-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:H J WangFull Text:PDF
GTID:2348330536459569Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As an important method of data analysis,cluster analysis divides sample objects into different clusters according to the similarity of samples.Cluster analysis requires the similarity of samples in the same cluster is as large as possible,and the similarity of samples in different clusters is as small as possible.As an unsupervised learning method,clustering analysis does not know the target attribute of the samples before the samples are divided.But in many practical applications,besides obtaining a large number of unlabeled samples,we can usually also get some samples with supervision information.Semi-supervised clustering investigates how to use these small amount of supervised information to guide the clustering process of unlabeled samples.There are usually two types of supervision information contained in semi-supervised clustering,one is the labels of samples,and the other is the pairwise constraint relationship of two samples,and the labels of samples combine the sample information can obtain the Seeds set.In order to overcome the problem of insufficient use of supervision information in traditional semi-supervised clustering algorithm,in this paper,we propose a semi-supervised clustering algorithm by introducing both the labeled samples and pairwise constraints into Kmeans to guide the clustering process,this algorithm is named SC-Kmeans(Kmeans based on Seeds set and pairwise constraints).Firstly,the algorithm uses Seeds set to expands the scale of pairwise constraints.Then,in order to get better initialization effect of clustering,the initial cluster centers are calculated according to the Seeds set.Finally,the extended pairwise constraints is introduced into the algorithm to guide the clustering process of samples,and the samples are required cannot violate the relations of pairwise constraints during the clustering process.At the same time,in order to obtain higher quality supervision information,according to the analysis and judgment of the information contained in the supervision information,the active learning algorithm is introduced into SC-Kmeans to design an active semi-supervised clustering algorithm(Active SC-Kmeans).This algorithm uses the farthest-first traversal scheme to select unlabeled samples,this scheme can select the unlabeled sample objects farthest from the Seeds set and tag them.Active learning can select more informative supervision information by using the minimize cost,and help SC-Kmeans algorithm get better clustering results.Aiming at the problem that the present clustering algorithm cannot solve the large dataset with high efficiency,this paper utilize the Spark computing framework to realize the parallelization of SC-Kmeans algorithm.Because of the frequent iterative computations in the SC-Kmeans Algorithm,a parallel SC-Kmeans algorithm based on Spark(Spark SC-Kmeans)is proposed by using the memory computing method of Spark framework.The experiments on the UCI dataset show that the active semi-supervised algorithm proposed in this paper can obtain more informative supervision information and improve the accuracy of clustering.At the same time,by using the artificial large-scale dataset as test data,we realize the parallelization of SC-Kmeans algorithm on Spark framework,and prove that the Spark SC-Kmeans has excellent adaptability to the data size,can effectively reduce the clustering time.
Keywords/Search Tags:Semi-supervised clustering, Pairwise constraint, Seeds set, Active learning, Spark
PDF Full Text Request
Related items