Font Size: a A A

Research And Application Of Active Semi-supervised Gaussian Mixture Model Clustering Algorithm

Posted on:2019-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2428330596966491Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the surge in the amount of data in all areas of society,human beings urgently need to use some technical means to extract useful information hidden from a large amount of unknown data.Among them,clustering is an efficient data analysis technology,which can aggregate a large number of unlabeled data into several categories,which provides a good foundation for further data mining.Gaussian Mixture Model(GMM)is the most typical and most commonly used representative of model clustering,and has been applied in many fields.However,the traditional GMM cannot utilize a small amount of already marked data,and semi-supervised technology can use these labeled data to improve clustering performance.Therefore,the semi-supervised GMM(SGMM)has stronger research and practical application value.The sample mark-based SGMM can use a small number of labeled samples to make the algorithm perform partial clustering on a large number of unmarked samples,so that the clustering results meet certain constraints and effectively improve the accuracy based on model parameter estimation.However,when the data set has class imbalance or large overlap between classes,the convergence speed and accuracy of SGMM will be seriously degraded.In this thesis,an Anti-annealing and EM algorithm of SGMM is proposed to propose a semisupervised Gaussian Mixture Model clustering algorithm based on Anti-annealing(ASGMM).The inverse temperature parameter of ASGMM slowly rises from a small value greater than 0 to a value greater than 1,and gradually decreases to 1,and its EM algorithm iterates to convergence under each inverse temperature parameter.The artificial data and UCI data show that the clustering performance of ASGMM is better than SGMM.Although ASGMM can improve the EM algorithm's vulnerability to local optimization through Anti-annealing,and improve the accuracy of the algorithm for data sets with large class imbalance or inter-class overlap,it still depends heavily on the initial parameters of the model.And it is not possible to cluster directly against network data.In this regard,active learning and representation learning are combined with ASGMM,and an active semisupervised Gaussian Mixture Model based on representation learning(AASGMM)is proposed.AASGMM first selects a group of nodes with high value and marks them from the set of unmarked nodes through the active learning algorithm,which is used to augment the set of marked nodes.Then,the node content information and the link information are fused into a node representation vector by representation learning.Finally,the node representation vector and partial markers are clustered as inputs to the ASGMM.Experimental results on synthetic networks and true networks show that the clustering performance of AASGMM is better than ASGMM.In order to further test the effectiveness of AASGMM,applied to the user profiling of CSDN,and clustering for blog documents and CSDN users respectively.The clustering results show that AASGMM has good practical application value.
Keywords/Search Tags:Semi-supervised Learning, Gaussian Mixture Model, Active Learning, Representation Learning, User Profiling
PDF Full Text Request
Related items