Font Size: a A A

Research And Application Of Active Semi-supervised K-means Clustering Algorithm

Posted on:2019-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:F LvFull Text:PDF
GTID:2428330596966525Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network information technology and data acquisition technology,people exchange data with the outside world all the time.How to extract and discover valuable information from the massive data exploding on the Internet to provide basis of decision-making has become a new research topics.As an important technology in the field of machine learning and data mining,cluster analysis,which plays a crucial role in data mining and further scientific research,can divide the data set in the case of uncertain classification criteria.Because of its simplicity,rapidity and easy expansion,the K-means clustering algorithm based on partitioning has become one of the most researched and applied clustering algorithms.However,the traditional K-means clustering algorithm only considering the attribute characteristics of the sample and ignoring the existence of a priori information,is blind to a certain extent.Semi-supervised learning can use a small amount of labeled points as a priori information to guide the clustering process and improve clustering performance.The performance of semi-supervised clustering to a large extent depends on the quality of prior information,and the higher the quality of a priori information,the greater the improvement of clustering performance,however the acquisition of prior information often requires a lot of cost.Active learning can actively select the points that the current learner considers to be the most informative through a certain selection strategy,and then extend it to the a priori information set,which greatly reduces the overhead and ensures the quality of the prior information.Therefore,active semi-supervised clustering obtains higher quality a priori information while reducing the cost of acquiring prior information,further improving the performance of semi-supervised clustering,and has higher research and application value.This thesis focuses on the active semi-supervised K-means clustering algorithm.The main research contents are as follows:Distributed K-means algorithm based on Spark is used to cluster the CSDN data combining the content data and behavior data.According to the result,it provides the basis for the company to formulate marketing strategies and proposes rationalization suggestions.At the same time,it verifies the feasibility of K-means algorithm in user clustering.Aiming at the problem that the cluster to which nodes is assigned may not be the current optimal in Active Pairwise Constrained K-means(APCKmeans)and Pairwise Constrained K-means(PCKmeans),an improved strategy of prior nodes assignment is proposed,which makes full use of prior information to guide sample assignment.Finally,experiments on four UCI datasets show that the improved algorithm has better convergence speed and performance.Aiming at the problem that fewer initial priori nodes in Active Learning of Constraints for Semi-Supervised Clustering(ALCSSC)will affect the results of subsequent iterative clustering,and then affect the final clustering results,the process of constructing the initial set of prior nodes is added to the original algorithm.Furthermore the active selection strategy of selecting single node in the original algorithm is improved to the active selection strategy based on importance multi-node.Finally,experimental results on four UCI datasets show that the improved algorithm outperforms the original algorithm in both performance and efficiency.
Keywords/Search Tags:K-means algorithm, Semi-supervised clustering, Active learning, User clustering
PDF Full Text Request
Related items